How We Moved Our Product to Google Cloud with No Downtime While Doubling Traffic

12-07 23:08

Play by Play: Moving the NYT Games Platform to GCP With Zero Downtime

Recently I wrote about moving the platform behind The New York Times Crossword to the Google Cloud Platform and mentioned we were able to cut costs in the process. I did not get to mention the move occurred during a timeframe where our traffic more than doubled and that we managed to do it with zero downtime.

Even from the start, we knew we wanted to move away from our LAMP stack and that its replacement would likely be written with the Go programming language, leaning on GCP’s abstractions wherever possible. After much discussion, we came up with a microservice architecture and a four-stage process for migrating public traffic over to it. We drafted an RFC and distributed it internally to get feedback across the company and from our Architecture Review Board. Before long, we were ready for stage 1 and about to run into our first round of surprises.

Stage 1: Introducing a Simple Proxy

For the initial stage, we wanted to simply introduce a new pure proxy layer in Google App Engine (GAE). Since all traffic flows through Fastly , we were able to add a rule to point all crossword traffic at a new * domain and proxy all traffic into our legacy AWS stack. This step gave us ownership over all of our traffic so that we could move over to the new stack, one endpoint at a time, and monitor the improvements along the way.

Of course, right off the bat we ran into issues, but for the first time ever, we also had an array of tools to let us peer into our traffic. We found that some web customers were unable to access the puzzle, and found the cause of the problem to be App Engine’s limit on the size of outbound request headers (16KB). Users with a large amount of third-party cookies had their identity stripped from the proxied request. We made a quick fix to proxy only the headers and cookies we needed and we were back in action.

The next problem came from our nightly traffic spike, which occurs when the next day’s puzzles are published at 10pm Eastern time. One of App Engine’s strengths is auto-scaling, but the system was still having problems scaling up fast enough for our 10x+ jump over the course of a few seconds. To get around this, we use an App Engine cron task combined with a special endpoint that utilizes an admin API to alter our service’s scaling settings right before we expect a surge in traffic. With a handle on these two problems, we were ready to move to the next stage.

Stage 2: Building Out Endpoints and Syncing Data in Realtime

Between all of NYT’s puzzles and game progress for all of our users, there was a lot of data in our existing system. In order to smooth the transition to the new system, there needed to be a mechanism to replay all of our data and keep it in sync. We ended up using Google PubSub to reliably push data into our new stack.

  • For puzzle data, we added a hook to publish any updates from our internal admin to our new “puzzles” service. This service would manage upserting the data into datastore and invalidating any caches.
  • For game progress, we went the duct-tape route and simply added a process with a cron to query the legacy database for new updates and emit them over PubSub to a new “progress” service in App Engine.

While we were able to rely on PubSub’s push-style subscriptions and App Engine for the majority of our data, we did have one use case that was not a good fit for GAE: generating PDFs for our puzzles. Go has a nice PDF generation library but some of the custom fonts we needed to use led to unacceptable file sizes (>15MB). To get around this, we had to pipe the PDF output through a command-line tool called ghostscript . Since we could not do this on App Engine, we added an extra hop in our PubSub flow and created a small process running on Google Container Engine (GKE) that listens to PubSub, generates the PDF, and then publishes the file back out to PubSub, where it is consumed by the “puzzles” service and saved to Google Datastore.

This is the stage where we learned a lesson on managing costs when doing heavy work in Google Datastore. The database uses the count of entity reads and writes to determine costs and, while replaying all of our historical game play, our user statistics were getting signalled to be reaggregated almost constantly. This reaggregation led to many collisions and recalculation failures which unexpectedly resulted in us spending thousands of dollars one weekend. Thanks to Datastore’s atomic transactions, we were able to toss a locking mechanism around statistics calculations, and the next time we replayed all user progress to the new environment, it was a fraction of the cost.

Architecture behind replaying game progress, calculating statistics and archiving data.

With our data reliably synced in near-realtime, it was time to start turning on actual endpoints in GCP.

Stage 3: Turning On Endpoints in GCP

Soon after data began to sync over to the new stack, we started making changes at the “edge” service to point to our newer implementations, one endpoint at a time. For awhile we were at a pace where we were confidently switching over one endpoint a day.

Rewriting existing endpoints to the new stack wasn’t our only job during this timeframe. We also had a new, read-only endpoint to implement for the new iOS home screen. This new screen required a mix of highly cacheable data (i.e. puzzle metadata) and personalized game data (i.e. today’s puzzle solve time). We have two different services for hosting those two different styles of data in our new stack and we needed to combine them. This is where our “edge” service became more than a dumb proxy and enabled us to combine information from our two sub-services.

In this stage, we also replatformed the endpoints in charge of saving and syncing game progress across multiple devices. This was a major step as all related endpoints dealing with user statistics and streaks also had to be migrated. The initial game progress launch was a little rockier than we had hoped. One endpoint was experiencing much higher than expected latency and a plethora of odd edge cases popped up. In the end, we were able to cut out an unneeded query to remove the extra latency on the slow endpoint but the edge cases were a bit tougher to chase down. Once again, thanks to the observability tooling available in Google App Engine, we were able to track down the worst of the bugs and we were back to smooth sailing.

Stage 4: The Last Piece of the Puzzle

Once the systems around puzzle data and game progress were stable and running purely on Google’s infrastructure, we were able to set our sights on the final component to be rewritten from the legacy platform: user and subscription management.

Users of the crossword app are allowed to purchase their subscription directly through their device’s app store. (For example, an iPhone user can purchase an annual NYT Crossword subscription directly from the iTunes store.) When they do so, their device is given a receipt and our games platform uses that receipt to verify the subscription when the app is loaded.

Since verifying such a receipt is a task that could possibly be used by other teams at The New York Times, we decided to build our “purchase-verifier” service with Google Cloud Endpoints. Cloud Endpoints manages authentication and authorization to our service so another team in the company could request a API key and start using the service. Given an iTunes receipt or a Google Play token, this service tells us if the purchase is still valid and when it will end. To authenticate direct NYT subscribers and to act as an adapter to translate our existing authorization endpoints to match the new verification service, we add a small “ecomm” service in the mix.

The final public endpoint went live on GCP a little over 2 months ago and we’ve been actively resolving small edge cases and tuning the system for maximum efficiency and costs. Thanks to GCP’s observability tooling, it’s not uncommon for the platform to have a day with 99.99%+ success rates and far lower latencies than we had in the past.

We still have a PHP admin component running in AWS for managing our system’s assets and feeds, but we’re currently redesigning and rewriting it to run on App Engine. Our next iteration is already reading and writing to a Google Cloud SQL instance so we hope to be out of AWS completely in the coming months.

For the future of the games platform, we’re looking into adding some new exciting features like social leaderboards and real-time collaborative multiplayer crosswords . By leaning on managed solutions from Google Cloud Platform like Firebase’s Realtime Database , we’ve had some very successful prototypes and we hope to have them available to the public sometime next year.

If you’ve been intrigued by some of the engineering behind The New York Times, you may be like to know that we’re currently hiring for a variety of roles and career levels:

标签: 谷歌
© 2014 TuiCode, Inc.