Logs Cleanup

Ralstan Vaz
We Are BookMyShow
Published in
4 min readMay 10, 2021

--

In this post, I will take you through our journey of logging.

We started with an approach just like anyone else, log everything.

And that’s exactly what we did!

Logged EVERYTHING!

So…

What’s the problem with that?

  • Logs help you debug
  • Gives you insights into the state of the system
  • If you’re feeling excited, you can even create beautiful dashboards out of them

These are all great stuff!

But...

It can easily be misused.

During the pandemic, we began to optimize our systems with respect to cost and capacity.

We jumped in, to figure out leaks in our system, and guess what came out of it… Massive Logs!!!

Realized we were logging 100’s of GB worth of logs a DAY!!

That’s when our eyes opened 👀

We saw how it negatively impacted our apps in terms of performance, infra, and cost of running them.

CPU
Every log line you print has to be executed consuming the CPU. This could slow down your application.

Network
When you aggregate and ship logs in tools such as Splunk, Coralogix, Kibana, etc. It will consume network bandwidth.

Storage
This is one of the primary costs, it’s the number of GBs logged.

Confusion
Badly formatted logs can make it very tricky to debug issues in an ocean of logs.
To describe a logline, some logs used description as a key while others use desc or call it statement.

Similarly for enums,

To identify the type of log, applications logged the type as error while others used err, and similarly for warning it was called either warn or warning.

{
...
"type": "error",
"application": "transaction",
"description": "something we should worry about",
...
}

Or

{
...
"type": "err",
"app": "transaction",
"desc": "something we should worry about",
...
}

Which key or enum are we supposed to use to search our relevant logs??
How do you perform analysis on all the errors?

How did we resolve it?

Taking out years of logs is not going to be a simple task, especially when you may not even know why it was logged in the 1st place.

Figure out which log lines were being logged the most

By using the _size metric in our logs we were able to aggregate and get the size of each logline per day. This was our first target, it helped us remove a lot of heavy logs quickly.

Figure the applications that log the most

This part was relatively easy, lucky we had a decent bifurcation of systems (Kubernetes namespace), subsystems (app-name) already created which was then pushed into our logging platform. This made it really simple for us to target the application that would create the most impact.

Held educational session and war rooms

It’s easy to log, but what’s hard is to log right!

It’s important that every techie in the company knows what, when, and how to log. In order to achieve our goal for efficient logging, we needed everyone to be in sync. This is exactly what we did, held educational sessions on the best practices of logging, so everyone realized the impact of it. In certain cases, we also started war rooms with major application owners, which gave us really good results.

Ensuring this does not happen again

Holding sessions makes the devs aware of how to log, but we need to ensure the implementation.

We need to ensure that there is consistency across microservices.

We did this by creating logging libs across languages that use some of the best practices in logging.

We also created dashboards that are monitored every day to keep a check on log sizes.

This does not mean you stop logging, it means you start logging smartly!

In just around a month we were able to reduce logs by 80%.

Yes, that’s right!

80% less CPU

80% less Network

80% less Storage

80% less COST!!

We hope our learnings through this post guide you one step closer towards a cleaner system, do watch this space for more such experiences.

In the meanwhile, you can also follow us on Facebook, Twitter, Instagram, LinkedIn

PS: Keep an eye out for a follow-up post on smart logging practices.

--

--