Cutover to AWS: Preparation for moving the BookMyShow infra to Cloud

Utsav Kesharwani
We Are BookMyShow
Published in
8 min readAug 25, 2021

--

Moving from On-Prem to AWS Cloud: The BookMyShow journey

Introduction

Switching public traffic from one data center to another is never easy when you operate at a very high volume and have a complex on-prem setup that has been working for a few years.

We need to migrate the traffic from the on-prem data center to the AWS cloud. With the scale at which BookMyShow operates, having 200M monthly user visits, 5B monthly page views powered through hundreds of services (micro, moduliths, monoliths, legacy, etc.), a variety of databases (individual, clustered, sharded, SQL/NoSQL), and terabytes of data. We’d been using an on-prem setup for the last 15+ years, which had grown complex over time.

High-Level technical architecture of BookMyShow infrastructure before AWS migration

The migration plan or cutover had to be thought through, detailed-out, debated, discussed, and eventually, each step had to be scripted for execution to avoid surprises and failures. Failure would most likely result in a negative business impact, extended service downtime, upset customers, or worst case, a rollback to on-prem to be retried again later.

Another thing that was clear is that we can’t afford to lose any critical data of any period. Hence data migration strategy has to be one of the crucial aspects.

Note to readers: The rest of this post will cover assuming there is already a parallel, production-like environment set-up on the new data-center (AWS cloud). The new set-up should have been benchmarked for functionality, load, scale with at least 2x peak traffic. How the team at BookMyShow achieved this will be covered in a separate post. Keep watching!

How the migration journey started

For cloud migrations of such a huge scale involving multiple teams, the first thing one needs to do is to document the current state. The technical challenge has to be overcome by meticulous planning and crisp execution instead of writing code.

In BookMyShow, there are 20+ broad teams (verticals/horizontals) spanning from Movies, Live Entertainment, Stream, Transaction, Payments, SRE/DevOps, etc. Each vertical was given a task to take a look at the infrastructure, understand what each component does, and on what they are dependent.

Exploring legacy components in a running system was most challenging

It seems to be easy from a high level, but it comes with many challenges. There have been a lot of legacy components built and shipped over time. With little or no context for such apps, planning for their data center migration seemed like taking some parts out of a running car engine, and analyzing what broke.

However, getting the information accurately was a must-have since it laid the foundation of the overall cutover strategy. It also helped forecast and keep a tap on the associated costs in the AWS environment.

Initially, we shared a common Google sheet to collect all the information. This may not have been the best way, but it led to starting from somewhere. The sheets were created in such a format to ensure that each one was self-explanatory, with examples, so as to drive collaboration quickly. All the teams were asked to fill in the following information:

  1. Project Name
  2. Code and configuration deployment plans
  3. Associated databases
  4. Criticality
  5. Application dependencies
  6. List of other dependent components on the project
  7. Associated DNS (public and private)

The continuously evolving approach

The team started brainstorming on how to go component by component on AWS. This seemed the most logical way to do it. The team expected that this will have minimal dependencies and low downtime. Migrating a single component on the cloud and observing whether it is working or not. If not, then tune it, mend it, fix it! We planned to repeat this activity, with one new component at a time, until we were fully on the cloud.

The flipside was cross data-center communication. After the first component migrates and until the last component moves to AWS, we would have been in a hybrid stage. A service communicating with another service over the internet was a big no. Despite having ways to secure the data transfer (firewall, etc.) the latency would have been high. AWS Direct connect seemed to show a ray of hope to minimize the latency and make it feasible.

It’s worth mentioning here that the team had no doubts that we didn’t want to stay in the hybrid stage for long, but the ease and expectation of minimal downtime with this approach seemed to outweigh the cons.

Now a new question arose — how to decide which component moves first, and which moves at the last?

Migration based on app criticality

Given we’d already collected information based on the application criticality, we decided to migrate all critical ones together. This was done assuming that in a single small downtime window, the teams will be able to coordinate and migrate these applications and DBs quickly, and then the less critical apps & DBs can be migrated by teams individually without much need of cross-team coordination or downtime requirement.

There was still one challenge though. A lot of our legacy apps were marked as critical! All teams were cooperative, ready to commit, but this plan was mostly based on speculation and assumptions. We’d following concerns:

  • What if Direct Connect didn’t work out for our use case?
  • What if there are unexpected issues, and we’ve to revert back?
  • Should we migrate the critical ones first, or at the very last?
  • We have tested our cloud set-up, but the real test happens when user traffic comes. Is the parallel set-up on the cloud ready to take up the user traffic?

With these questions in mind (and a few stressful days later), we eventually came up with a newer approach.

Migration based on multi-DC & single-DC set-up

After a few days of brainstorming, we noticed that few of our apps support multi-DC setup & deployments. We thought of taking advantage of this to move to the cloud.

Immediately, we started gathering the information from the different teams and collating it in another excel sheet for further analysis.

Our apps were categorized into the following based on how DB reads-writes can happen:

Active-Active set-up

  1. The apps get deployed in two data centers. The user’s traffic got split from the Cloudflare based on a pre-defined algorithm.
  2. Service-to-service communication stayed in the same DCs (if other services had a multi-DC set-up).
  3. The databases were in two data centers. Reads-writes on DBs can happen in both DCs.
Active-Active set-up

Active-Passive set-up

  1. The apps get deployed in two data centers as mentioned in the previous use case.
  2. Service-to-service communication stayed in the same DCs (if other services had a multi-DC set-up).
  3. Though the databases were in two data centers, writes used to happen in one DC (primary/master nodes) only. Reads on DBs can happen in any DC.
Active-Passive set-up

Single-DC set-up

  1. Here the apps & DBs were in a single data center. Some of the DBs could be replicated for DR, but the critical ones were mostly kept at a single node only.

In this new approach, we planned to first migrate the apps which supported multi-DC set-up.

For active-active set-ups, we decided to do a parallel set-up on the cloud, where apps need to be deployed, and DBs reads-writes also happen in respective DCs.

For active-passive set-up, the parallel set-up on the cloud should have the apps deployed, the reads initially happen in respective DC while the writes happen on the on-prem DB (primary node).

After this is achieved, perform some basic quality checks. Then, start sending some (~5%) end-user traffic, and monitor the performance. We specifically want to test the cloud set-up with actual end-user traffic, check DB replication delays, observe the latency between the apps, if any, benchmark AWS Direct Connect, finetune the network speed requirements between both DCs, to become more confident. Almost all of the concerns of the previous approach would have been taken care of with this approach.

The plan was that if everything goes well, eventually, start increasing more end-user traffic (~15% or more). For active-passive, the primary DBs (where writes happen) can move to AWS with this change, and the sync direction would reverse. Keep monitoring, and eventually divert 100% of end-user traffic for these apps to the AWS set-up, and start deprecating the on-prem set-up for the same.

Later, take a downtime window and migrate the apps which are deployed in a single DC. And the cutover would be successful.

Runbooks

While working on the cutover approach (or approaches), one thing the core team realized was that we’d not gone into absolute details about the applications. That’s why we’d have to keep going back-and-forth to the application owners for more information.

Also, any disconnect in the understanding of how we thought the cutover should happen and how the application owners executed the cutover plan would lead to total chaos.

We decided to fix this. How? You guessed it right, another Excel!

Spreadsheets actually proved quite helpful for collecting information!

All the teams were given a base template (referred to as “runbook”) to list down the information about the infrastructure of their services. Also, the approach was listed step-by-step for the application owners to adapt it to their use-cases with bare-minimal tweaks.

The runbook had the following sections:

  1. Team and contact details
  2. Whom to reach out in case of confusion
  3. Application details: Name, Type (HTTP server, background/scheduled job), DNS, Config management & Monitoring tools, etc
  4. Database details: DNS, Technology (SQL, NoSQL, etc)
  5. Test cases: Smoke, Regression
  6. Run-sheet

Details on the Run-sheet

The run-sheet was basically an activity sequence containing:

  • Pre-cutover steps: Basic preliminary checks about checking DB configs, access, data sync delays, and informing stakeholders about the downtime, if any
  • During cutover steps: How to initiate downtime, how to handle background jobs that don’t handle concurrency, message queue (Rabbit-MQ, Kafka) of which DC to broadcast messages and consume from, etc.
  • Post-cutover steps: Smoke and regression tests
  • Rollback steps (if in case, be prepared!): Try fixing the issue, if not rollback specific component(s), doing RCA, and re-plan migration

Testing the approach and execution

After going through a lot of these Google sheets with this much data, the team was fairly confident about the execution. To be fully sure, we decided to test this on our staging environment, just to figure out the challenges of actual execution.

More on this later. Here’s a small hint on how it went.

--

--

Engineering Manager @ Apollo.io #thinker #team-player #coder #problem-solver #architect #strategist