Rebooting Traffic @BookMyShow: Get, Set, Go!

Ralstan Vaz
We Are BookMyShow
Published in
5 min readFeb 2, 2022

--

There is no denying the impact that the Covid-19 pandemic has had on the entertainment industry becoming the first to shut down and the last to get back on its feet. With a natural dip in the overall transactions, running a massive infra was not prudent and hence we decided to scale down. Drastically! This was so we could optimise as much as possible! Drastically! Optimize as much as we could!

With traffic slowly but surely making a comeback, it was time to get back in the driver’s seat and get set to handle the rise again!

One of the simple ways of doing this was to add more infrastructure. But that would mean an increase in the cost!

There was a need for an effective way of doing this. And so, we decided to take the road less travelled, and cure it from the roots. Because if there is anything the pandemic has taught us, it is to Do More With Less.

It may mean more work, more hours, but most definitely be most effective.

So now our ultimate goal was-

Handle more traffic, with as little infra as possible.

The important thing to understand here is that we don’t get heavy traffic every day. It is distributed through the week with differing loads where weekends have the heaviest traffic and more so at specific times of the day.

Release Dates
Spider-Man: No Way Home — 16th Dec 2021
Pushpa: The Rise — 17th Dec 2021

Weekend have higher traffic(17th onwards was a weekend)
Our peak traffic typically ranges between 10 am and 2 pm

This means we must be able to handle burst traffic.

And this solution was to introduce “Blockbuster Day”.

The idea is to simply pretend we are going to receive heavy traffic for a popular movie, prep the system for the same and see how it behaves.

Learn as fast as possible by failing even quicker.

We use these Blockbuster Days to detect problems at an earlier stage, rather than having issues with live traffic.

Every incremental Blockbuster Day would teach us something new or present an unknown challenge which we would fix before the next one, ultimately resolving all problems before the live traffic day was upon us. Each of these have different objectives that are typically defined by the previous Blockbuster Day.

In order to do this, we had to select a date and pass an artificial load to the entire system.

Our teams knew about this day and this exercise just like they would when such movies released

“What load should we target?” was the first question that popped up.

We decided to mimic the load from a heavy traffic movie in the past year.

Note: We are on AWS and use EKS, our apps (pods) are auto-scaled based on CPU and Memory.

So far we have had 4 Blockbuster Days

Blockbuster Day 1: Defend Ourselves

Blockbuster Day 2: Scaling Policies/Rightsizing

Blockbuster Day 3: Content-Based Routing

Blockbuster Day 4: Warm pooling

In this article, we will cover Blockbuster Day 1

Blockbuster Day 1: Defend Ourselves

We started with 20% of the load (JMeter) that we wanted to achieve and gradually increased it. As the load increased there was one clear challenge that we saw kept returning and that was Pods Crashing Constantly!

This meant we could not go further and so we decided to drill down into this. Our pods were set to auto-scale at or above 65% RAM and 50% CPU

Here’s what was happening.

The request (API calls) flows like this: Client → CDN → HAProxy → L1 Apps → L2 Apps…

L1 Apps: These are the apps that are accessed by the client

L2 Apps: These are the apps that L1 is dependent on

Pods receiving load
1st pod crashing
2nd pod crashing

The moment burst traffic started coming in, the CPU and RAM began to spike up to a level that would crash the pod. Auto-scaling would trigger, but while the new pod came up the older ones would continue to crash and the same would happen to the new pod and this cycle continued.

Of course, running higher min pods could help but that is not cost-effective.

The pods are like

No really. We just needed to give it some time to scale.

That got us thinking! We could simply defend ourselves using the almighty HAProxy.

The answer was that simple — Send just enough traffic for the pods to handle and queue up the rest of the request

Request getting queued up

We did a couple of load tests on the individual L1 Apps to understand the load that each pod could handle.

Based on this, we computed the concurrency and added those values to HAProxy.

Now, HAProxy would only send just the right amount of traffic in line with what the apps could take! This meant, there was no crashing of the apps.

Here is a sample of how we did it

backend backend_sample_app
...
default-server inter 7s fall 3 rise 2 maxconn 30 pool-max-conn 30 pool-purge-delay 1
...

We used HAProxy’s maxconn to limit the number of concurrent requests each app could take. Anything more than this would get queued up at HAProxy.

Of course, the users in the queue would have to wait but we had a solution for that too (will be covered in the series).

Keep an eye out for the next article in this series (Blockbuster Day 2)

We hope our learnings through this post guide you one step closer towards a scalable system. Watch this space for more such experiences.

In the meanwhile, you can also follow us on Facebook, Twitter, Instagram, LinkedIn

--

--