Today, we have Srinivas Devaki with us, sharing his SRE story. Srinivas has been building and handling systems at Zomato until recently. He is now making a product for continuous cost optimizations for software companies.
Hello Srinivas, please introduce yourself.
My name is Srinivas. I went to IIT, Dhanbad. I was lucky to get into computer science. I liked the studies, but within the first semester, I got bored. From the second semester onwards, I started liking competitive programming. I never knew anything about web development. I was doing competitive programming for most of the three years. I got an internship last year where I worked on the front end and Android. I knew nothing related to SRE or DevOps.
I got into Zomato also by luck. I started within the web team in Zomato. The web team was in charge of the front end for the browser and mobile PWA. I stayed there for about a year. In the first few months, I got a chance to solve all the HackerOne issues related to the front end – all the security-related bounties involving content security policies, XSR attacks, and SQL injection attacks.
While solving those issues, I was able to design a proper convention across the organization and always use templates properly. If you do not use appropriate templating language, malicious attacks can occur. I developed that standard and set up a linter across the organization within those first two months. It was great to set up that linter so all the new code entering the system doesn't run into these issues. That was a perfect opportunity to learn different attack vectors and kinds of attacks that you would never imagine.
The issues are usually straightforward: HTML, SQL, or code injection. But often, some sophisticated attacks will be fun to work with.
After that, again, I got a project on the front end. I broke the restaurant page while working on it. We were using PHP those days. In PHP code, if you have a bug and the PHP crashes in between, your HTML page shows half, exactly half. In this case, it will show the restaurant name, and then the description is there. Then, half of the page disappears suddenly. The issue was I was using an unidentified variable. At that time, I was slightly immature, and I got defensive that it was not my fault. Code reviewers should be taking care of this, or the system should take care of this. But I had an excellent team, and they made me understand the complete picture. They helped me know how things are deployed end to end.
That was one of my first opportunities to go into some DevOps space. At that time, we were completely using PHP on EC2 instances. So I got into this thinking that I've made that bug because it's hard to see what is an unidentified variable. If I could use something like PHPStorm, I could crack it. We were using EC2 instances, which are remote machines. Everyone codes on a VM hosted somewhere else. So you can't run PHPStorm on that VM. The only solution I could see was using some containers. I had heard about containerization during my internship. The only answer I could see was to Dockerize it and run it locally so that I could run PHPStorm.
So it was all about ensuring your dev and prod setups are consistent and you can replicate the issue locally.
The production setup at the time was still on VMs. but I wanted to dockerize to run PHPStorm locally with the application code. Replicating that VM on Mac OS would be pretty hard. Getting Apache 2, PHP, and all the extensions to work when some are custom C extensions written in-house long ago, which work on Linux only. It didn't make sense to use it on Mac directly. It took two weeks, and then I Dockerized everything. I didn't pick any other project. And that's also one good thing about Zomato: the amount of freedom. I stopped all the work for two weeks in my first year and could Dockerize it.
That put me on the map, and the CTO got to know. At that time, I didn't know. But later on, I got to know about it.
Once I've been through these six months, I got more front-end projects, and I got tired of it. I was frustrated that I just left my laptop and went home one day. I had to work on some CSS animation. At that time, there were not that many frameworks. Zomato was still using jQuery.
I had to get that animation working using CSS, and I got too frustrated and left home. At that time, my roommate brought the laptop back. But everyone in my team got that kind of sense that I was too frustrated with the work. They also felt that I have a knack for DevOps because of the type of projects I enjoyed in the team. And most of them recommended me to move. But I felt the only people I knew were those seven on my web team. So, I didn't want to decide for almost two months. That went on, and eventually, I asked my CTO. And he remembered my Dockerization project, and then I moved. Down the line, it took two more years to settle into the SRE/DevOps role. Our scaling challenges took one and a half more years to advance. That's when we picked up the Dockerization of production setup. We wanted to revert to the old stack if we wanted to quickly. And Dockerization seemed like a good thing. We did that in 2019.
That's when I entered into SRE. Initially, I got to work on CI and CD flows at Zomato. There were multiple changing points across Zomato's lifetime when it rapidly scaled up. It was so rapid that if you get a downtime this Sunday, you can give a 100% guarantee that the same downtime will happen next Sunday because you touched that thing. Sunday, dinner, you know. Most of the failure points are change-driven. And those are easy to target. It would help if you created that observability. What are all the changes happening across the architecture? You can quickly address the incident and solve it. But what's surprising and what you can't predict is systems that break with the scale.
There can be interconnected pieces, and they can break haphazardly.
Yeah. Those are very hard to predict because there is no one specific thing. The unknown virtual limit could be connections. It could be some system level or file limit. These are all virtual limits. For physical limitations, you can easily see CPU utilization and memory utilization. But virtual limits are hard to see. Like thread limits, there is some tuning factor somewhere you'd never know. Or deadlocks. And these only get broken when you reach a particular scale. During that time, We had to solve and develop systems for the Zomato scale within a week. Because if it broke this week, you had to deploy a connection pooling proxy across the system before next Sunday. And you can't deploy on Saturday. Saturday was still a peak day because it was a day of significant business. You need at least two days for testing to gain confidence that the solution will work. You don't want to break the system even more. So you need to get two days' confidence, which means you need to ship it within those three or four days. A connection proxy, which you need to change the code, means you can only interact with a few people. It can't be a team project. One or two people must do it within two or three days. And we solved many such complex problems around that time. For example, we had a few outages where a single key was causing an outage for the entire memcache, and for this, we had to extend a tool called memsniff. I learned k-means clustering in college. It clusters the words if their distance is similar. So, it is identical to Memcache.
How do you remember if you want to identify a specific key pattern that is much more heavily used? You cluster the access patterns, and then we did that clustering and found out what the hotkeys are and what key patterns there are. What kind of crucial patterns are most traffic consuming because there are scalability limits within Memcache also. We wanted to use as few shards as possible. So, the goal was to address a lot of these complex problems. We even developed a MySQL circuit breaker. We developed a MySQL circuit breaker before we could develop a service mesh proxy and circuit breaker because, at that time, the main breaking point was always MySQL, and there was always one bad query.
Production systems are very complex most of the time. But most of the features are optional. There are a lot of features because too many experiments are going on. The core product is always about ordering food, but that's different from what everyone works on. So, almost always, that one lousy query could be more critical. After two to three incidents, we realized we wanted a circuit breaker to identify and terminate the wrong query automatically. Within three to six months, we developed connection pooling for every database, MySQL, Memcache, and Redis. We created these circuit breakers to identify failure points. We did a lot of that stuff in 2019. Beyond that, most of the difficulties came from the scale of microservices rather than the vertical scale of a single microservice or the vertical scale of a single system.
Post 2019, we did more aggressive microservices where no one was responsible for that single unit. Once you hit a certain tipping point where the monolith itself started becoming a bottleneck for things like knowledge sharing and code proximity, breaking points within the code, which are dependent on each other, started causing problems. At one point, the standard structure makes sense so everyone is on the same page, but that breaks down quickly as teams grow. In a monolith, each problem is much, much, much harder to solve. We decided to move to microservices. We started out in Java, but we needed something else. Within six months, we learned that there is too much boilerplate and more of a knowledge curve to start being productive. One core principle is that if you hire someone, they must be effective within two weeks or one week; two weeks is too much, so we aim for within a week. As quickly as possible, push the code to production. Moving to Java didn't make sense then because you can't get productive using Spring. And you want to develop a new microservice most of the time. In that case, if a new product needs to be launched, it's pretty hard. Even if you mess it up, it's much harder to recover in Java than Golang. We did five microservices in Java, and then we scrapped it, then we started Golang. That was also the first time I understood this principle of divergence and convergence in design standards.
You diverge and explore different technologies, but as soon as you finalize one thing, you converge on that technology. So, you want to avoid seeing 10 teams working on ten other languages across the organization. You can never leverage the engineering team if they're working on entirely different languages. And this translated to Reliability also because of that convergence. Suppose you see a type of incident in one microservice. In that case, probably more than 80 %, there is a high probability that the same kind of incident will happen in a different microservice within two weeks. You're experimenting with one or two microservices. But once you converge, that pattern gets deployed across hundreds of microservices. So, if it happens now, it'll happen to another system. So, it's also essential to solve the incident within that service and across the organization as quickly as possible, especially for Zomato.
Did you set up a centralized platform team to ensure standardization is followed across the organization?
In Zomato, till 2022, the SRE team was the culmination of platform engineering, Reliability, cost optimization, developer productivity, and everything. When an incident happens, you do a quick sync up, you help the team, and whatever solution they identify or key you help them with, you think of how to make it a long-term solution. It would be best to think of a short-term solution and what solution you can deploy across the organization. This means that you can immediately tell them to do some fix, but you also need to understand if that fix is something you can convert into a process. There are ten ways to improve it; nine ways are acceptable concerning that microservice because that's learned knowledge for them. But you must also consider converting that fix from a human knowledge perspective to a system process. There can be some fixes that are human knowledge-based. You can say that you learned it, And then you'll catch it from the following review onwards. But there will be some solutions that are more like you need to be able to develop a system out of it. That could be a centralized library, core libraries, or these things. So, any platform team plays a vital role by involving the other groups. Again, if it's, you don't want to block them from developing the system for that fix because there is a high chance that it will break tomorrow night. Let them fix it and build on the approach to protect other microservices.
So that's the approach that worked. But slowly, I realized that it's not a scalable approach anymore. It's hard when you want everyone on your team to replicate this process. Suppose you want a fresher to be able to repeat this process. In that case, it's pretty hard because the fresher has to sync up with other teams to understand previous knowledge and context. As more and more you converge, more and more standards develop, and the context required to do even a tiny change increases, even if you do everything first principle approach. So, that centralized system needed to be more scalable because we saw that the SRE team was getting bottlenecked. The SRE team was small, like 14 people, solving these three vectors: cost, Reliability, and productivity.
You need to get buy-in from every team, you need to do follow-ups, and you need to get into their timeline because they planned that timeline, a one-week timeline. You must break that timeline and get your task inside it to conform to that linear. As our team was becoming a bottleneck, I realized that from a knowledge perspective and what teams can learn, it's much better to employ a decentralized approach. I followed this approach where the SRE team acts in an advisory role. However, teams still must develop the systems across the organization. So, if the "Menu team" causes an incident, they should fix their system immediately. But they should also think about an organization standard.
How can they help all microservices avoid that incident? It's like they took a hammer, and they hit themselves. Now, they have learned that it's painful. They need to teach everyone not to beat themselves with a hammer.
I left Zomato at the start of 2023. So, I still need to get quantitative stats about this process. I was only able to employ it for a few approaches. However, I would use a more decentralized system. The starting steps could be different. You could hold sessions, and you can do RFCs.
Even when we had a centralized approach, we still used to do RFC discussions with the team. For any standard we were employing, we would do an RFC. We did it because we were using Golang, and we did proposals because Golang uses proposals. Enhancement proposals. So we wrote lGEP for Golang enhancement and IEP for infrastructure enhancements. We started the enhancement proposal discussions. We ensure you can be more proper when writing a proposal. You can be as ad hoc as possible and be as notes-heavy as possible. We created a proposal to keep the cost of setting the same. Most of the time, if someone wants to contribute appropriately, they will take time to understand your proposal. It matters how structured a proposal you are writing because you can quickly solve it by just getting a sync up, coming to your desk, and fully understanding.
This way, the cost of creating a new proposal gets very low. The price of understanding a proposal is high, but that's good. Cost is always high, and you can employ other leveraged approaches like in-person discussions rather than open source; you can't do in-person talks. You get a better understanding from there, getting the author's perspective, than reading a simple doc, where it's hard to get any. So that worked for us.
I want to shift the gears slightly. Regarding the current observability landscape, what are a few things that you are excited about, or what are a few things that you are not enthusiastic about?
One problem I see is about tracing. If you think about an organization that is already using metrics, who is already using an APM tool, the use cases that tracing would solve at that point would become pretty complicated. If you go to any team and sit with them, they want to look at only some of the architecture. They already know which service they are calling. And they already know how to escalate it. If the cost of tracing is low, then, like whatever number of use cases it solves, it's pretty great. But for the unique use cases that tracing solves, the cost is too high to afford. Even with a 1 % sampling rate, it's high. So, there needs to be more improvement.
We explored tracing, and we deployed it. Surprisingly, none of the teams found any use case. And it stayed there for a year. The teams needed help finding good use cases they wanted to use, and then they needed help to see what problems it was solving. So, it stayed there without adoption for a year and a half. And then, we deprecated the stack because most teams can solve their problems by just using APM with metrics.
Another is that the monitoring systems and the solutions implemented for metastable states, like circuit breaking and rate limiting, need to be developed for databases. If you think about MySQL, no solution stops the lousy query automatically. You're done if the bad query gets into your system, as the whole microservice goes down. And it's almost always impossible to predict which bad query comes only from which endpoint.
Beyond that, the cost of adding metrics is high. Suppose you want to quickly add metrics - not the infrastructure metrics but business metrics. In that case, the cost for such experimentation is still high, and you need to make changes and test things out quickly.
How do you recharge like all of this work with systems?
I do long walks, and I go to the gym. Sometimes, I switch the gym to running 5 to 8 kilometers or jogging.
I go out watching movies, just hanging out with friends. That's it. I need very little external engagement to be content with the rest of my work life. Some people need a lot; some people need less. I need very little.
Where do you find all this information about all the things that are happening in observability? Do you follow specific blogs or something?
I mostly find it on Twitter. There is this interesting Slack group, Rands Leadership Group, RLS. So there I find very, very good discussions, good ideas. Beyond that, for more formal knowledge, I use USENIX. Whenever the SRE conference starts, I read and start noting essential papers. I am inspired by Cindy Sridharan. She prepares distributed system summaries and makes them seem effortless. I like to read them a lot.
What are you doing these days? I know you're building something cool. So, if you want to talk to us about it, that would be awesome.
I'm currently trying to build a cost optimization tool. Nowadays, most companies don't focus on cost optimization or don't worry about the cost of developing. But they keep Reliability and Security whenever they're designing a new system. You need to think about your core product and your architecture. Thinking about infrastructure, physical resources, and requests, predicting this capacity is also challenging. It also comes down to the design implementation part; you can't bring the implementation low-level details into your high-level architecture discussion. So, currently, it's costly to think about cost during the design phase. But if you want to be frugal, currently, it's too expensive, too much cognitive overhead. In terms of visibility, it could be better in terms of feedback. The idea is to give rapid feedback to customers and developers as quickly as possible as they're making changes so that they can build while thinking of cost, unit costs, and everything.
When designing and deploying the systems, they are not surprised at the end of the year by looking at the bill. They don't need to do that once every year; they have to stop everything, drop everything, and focus on cost. That's not how you should be doing it. It should be something continuous cost optimization. So, the product is a suite of tools and sub-products focusing on constant cost optimization, not one-time things like reservations—so, in a one-line, continuous cost optimization and removing compounding costs.
Thanks, Srinivas, for sharing your SRE story with us. Srinivas writes blog articles on Golang, SRE, and Observability here. You can also find him active on Twitter.