Karan from Zerodha on Open-Source Tools and Observability

Open-source tools, observability, and sheer persistence.

Dec 23, 2024

Karan - a Software Developer specializing in Infrastructure/Ops and Observability at Zerodha talks about his SRE journey which comes from years of hands-on experience and a practical mindset.

With a background in self-hosted tools and open-source projects, he’s well-acquainted with the challenges of managing complex systems. His journey has been a constant process of learning, adapting, and focusing on long-term stability.

In our conversation, Karan discusses open-source tools, how patience helped him in the SRE space, and how mastering the basics can make all the difference. His approach could offer just the perspective you need!

Prathamesh:
Would love to start with your introduction—how you got started, what drew you to SRE/DevOps-related work, and how your journey has been so far.

Karan:
I started as a backend engineer, writing Python and a bit of JavaScript—full stack. Over time, when I had to deploy my applications and take them live in production, I realized the need for monitoring. So, I started setting up small utilities for my production apps.

At that point, my organization didn’t have a comprehensive monitoring suite or observability infrastructure. We were relying on basic tools, like what AWS offered with CloudWatch, but there wasn’t anything like Prometheus or the ELK stack. I became interested in setting up Prometheus and Grafana, and I installed Node Exporter on a few servers to see how monitoring worked.

I shared this interest with the CTO, who was supportive. My role gradually shifted to focus more on DevOps—about 80% of my time—while I still spent 20% on backend work. That’s when I dove into setting up Prometheus, learning about other monitoring systems, and provisioning infrastructure using infrastructure as code.

Prathamesh:
What year was this, just to give an idea of the timeline?

Karan:
Early 2019.

Prathamesh:
So, around five years?

Karan:
Yeah, it’s been a full cycle for me. Now, I’m leaning more toward backend work because I’ve spent a lot of time dealing with infrastructure. We did a full Kubernetes migration, but eventually realized Kubernetes wasn’t the best fit for us. So, we moved to HashiCorp Nomad, and it’s been working great in production.

Along the way, I’ve gained experience with Consul, Vault, Nomad, and the whole HashiCorp ecosystem—things like Terraform and Packer.

On the monitoring side, we started with Prometheus for time-series data, then migrated to VictoriaMetrics because we had multiple Prometheus instances writing to a remote database. It’s been much more efficient for us.

Logs, however, have been a consistent pain point. We began with the ELK stack and experimented with various tools, like Loki, before eventually building our own logging infrastructure using ClickHouse. We use Vector to transform the logs and store them in ClickHouse.

I’ve really enjoyed working through these phases—setting up monitoring, provisioning infrastructure, and diving into containerization with Kubernetes and Nomad.

Karan:
Yeah, I’ve worked with most of the tools and aspects in this space.

Prathamesh:
Nice! I also know you’re pretty open about sharing your work.

I’ve seen your articles and talks on Nomad and ClickHouse—really great stuff. But, when you started in 2019, you were the only one handling all of this at your organization, right?
Do you have a full-fledged DevOps team now, or is it still a small group managing everything?

Just for context for our audience, I think you’re running one of the biggest workloads in the country if I’m not mistaken.

Karan:
We use AWS for our entire setup. When I first started, one other person was handling AWS, but we were mostly provisioning instances through the UI. That was pretty much our setup back then.

Our first step into infrastructure as code was using Packer to build AMIs. We set up common utilities on the servers, like tweaking sysctl parameters and configuring default HAProxy or ngnix servers. Over time, we realized we needed to automate most of what we were doing manually in the UI, and that’s how we started evolving.

We didn’t jump straight into Terraform—this wasn’t an overnight change. We had plenty of resources set up in the UI, and we gradually began migrating them into Terraform. Our first migration was Route 53 records. We created a Terraform module for DNS and made it a rule that no DNS records should be created in the UI anymore; everything had to go through the Terraform pipeline.

We did this piece by piece. Our organization uses multiple AWS accounts: one for front-facing services and another for back-office operations that shouldn’t be exposed to the Internet.

We also adopted a philosophy of reducing dependency on managed services. For us, AWS is mostly S3, EC2, ELBs, and other foundational building blocks. We do use Lambdas, but only for specific workloads. We avoid services like RDS and DynamoDB unless absolutely necessary.

Prathamesh:
So, do you self-host alternatives for those?

Karan:
Yes, we run self-hosted instances and alternatives for those services. We also manage a variety of databases. But it’s not like I’m handling all of this on my own. Developers take ownership of their own projects. DevOps focuses on initial provisioning, monitoring, and setting up the basics.

If there’s an issue with something like a Postgres server used by the back-office team, they’re the ones fixing it—whether that’s checking slow queries or optimizing database indices.

It would have been impossible for one person to manage everything. Back then, our scale was about one-tenth of what it is now. The blast radius of something going wrong was still significant, but far less than it is now. With the increased number of users and higher request volumes, we have to be much more cautious with changes to ensure stability.

Unlike many other companies, we can’t do continuous deployment at all. We can’t deploy during trading hours because it could disrupt users, and with money at stake, we have to be extra careful.

Prathamesh:
So there are different kinds of challenges in this setup.

Karan:
Yes, we schedule deployments outside of trading hours, but even then, it’s rare unless we’re gradually rolling out an entirely new infrastructure. For 99% of cases, we mandate A/B deployments or phased rollouts.

So, initially, 5% of users get the new feature, then 10%, and so on?

Features that would’ve gone live immediately in the past now take multiple weeks to reach 100% of users. That’s just how our business has evolved.

Right now, our team is small—just two people—but we’ve just hired one more. By December, we’ll be a team of three.

Prathamesh:
Yeah, it’s fascinating to see how it’s evolved. One point I liked was how it wasn’t like this from day one. You ran into issues, and automated parts, and gradually made things more consistent.

But yeah, I’ve seen your work in open source and the community as well. I’ve always thought of you as a tinkerer because you run these small projects across different domains—JavaScript, Go, DevOps, and more.

I’m guessing you enjoy tinkering with things, playing around, seeing how things go, and building your own side projects. I’d love to hear your thoughts on that. How do you approach these projects? What’s your thought process?

Karan:
I’ve always enjoyed open-sourcing things, but a lot of the credit goes to my organization, especially Kailash. He encourages and promotes open-source tools in our work.
For example, if we run into a problem, he’ll suggest abstracting it into a module or library and then open-source it.
What’s great is that we can open-source things under our personal accounts—it doesn’t have to be tied to Zerodha’s GitHub. We just added a badge in the README to indicate it’s being used at Zerodha. The philosophy is pretty simple: since we rely so much on open source at work, why not give back, even if it’s in small ways?

A lot of my Nomad projects are things we use in production. For example, in the Kubernetes ecosystem, there’s an external DNS tool that maps service records to DNS providers. But there wasn’t anything like that for Nomad, and we needed it for production. So, I built it for AWS as a provider and open-sourced it, hoping it would help others using Nomad.

Kailash also writes a lot of libraries himself. One example is sqljobber (now dungbeetle), which we use in our console back-office platform.

For instance, if a user generates a report for the last 365 days, we don’t want them waiting on the front end until the report is done. This library handles asynchronous query mechanisms and is also open source. He’s always pushing us to open-source our work because, in the end, it helps people and contributes to the community.

Prathamesh:
That's great.

Karan:
We don’t have a fixed budget like some organizations do. Some companies set aside, say, one day a week for open-source work, but we don’t have anything like that.

But you can just work on it whenever you see the need. It’s part of the open-source culture as well.

Prathamesh:
That’s amazing. How is your work setup? Do you use certain tools or command-line tools?

Karan:
I use a Linux ThinkPad. We all use Linux machines, though there are one or two people who use Macs for building iOS apps. My team uses ThinkPad X1s. Personally, I use Pop_OS, but there are no restrictions. Anyone can use any Linux distro.

I use standard tools like Visual Studio Code, and Firefox, and for command-line tools, I use jq for JSON querying. When I worked with Kubernetes, I used a tool—I forgot the name of it.

Prathamesh:
Was it K9s?

Karan:
Yes, I used that. It was pretty nice. It’s been a while since I’ve used anything related to Kubernetes.

Prathamesh:
That’s one point I wanted to touch on. You mentioned that you realized Kubernetes wasn’t the right fit for your organization. Could you shed some light on that?

These days, Kubernetes is pretty much the go-to for running workloads, so I’d love to hear your thought process on why it wasn’t the best option for you, despite its popularity. What led you to choose an alternative?

Karan:
So the context is that most of our applications were deployed on EC2 instances, and there was no standard EC2 instance provisioning with Terraform. We used GitLab CI/CD pipelines to deploy the binaries, for example, to S3, and then the app server would pull from there and restart.

For the longest time, we didn’t migrate to Kubernetes at all. We were only migrating internal tools, utilities, and low-risk applications to Kubernetes. But the thing is, the developer ecosystem at my organization wasn’t very comfortable with Kubernetes. Basic deployments weren’t a problem because we had templates set up.
But when something went wrong, trying to debug the issue became a challenge. If there’s a latency spike, how do you figure out whether it’s the network proxy level or something else that’s slow?
Kubernetes has so many layers of abstraction that it’s hard for developers to fully understand. And simple things like setting CPU limits, memory limits, and memory quotas were major challenges.

Prathamesh:
That’s the biggest question—how do you even set memory?

Karan:
Exactly. These things were easier to manage with our old stack, which just used standard EC2 instances. It was much more straightforward for developers to understand. There’s this modern trend where you’re not supposed to SSH into production servers, but when you have thousands of microservices running, it makes sense. However, we didn’t have a microservice architecture.

We had a mix of monoliths and services. Microservices, in the sense of each function being its service, wasn’t our case. It was common practice for developers to log in and troubleshoot when things broke. But with Kubernetes, things got more complicated because parts get created and destroyed, and logs are lost.

While we tried to solve this with central logging pipelines, running simple tools like Netcat or Telnet became problematic. Developers weren't comfortable with the many layers of abstraction in Kubernetes. Our DevOps team created small reusable templates to deploy applications, but once we gave those to developers, they didn’t care about what was running behind the scenes—they just wanted to get their application up and running as soon as possible.

So we realized that at this point, everyone’s running Kubernetes, and if any major catastrophic failure happens—like we were using EKS—what’s our disaster recovery scenario?

We weren’t comfortable with the idea that we might need support just to recover the control plane. Sure, we could debug basic issues, but if something happens on a trading day, and we’re just waiting for support, that’s not a good position to be in. We need to understand how our stack is running, so we can confidently debug it ourselves if needed. So, we started searching for alternatives.
What I was proposing was a simpler orchestration platform—something not as complex as Kubernetes. We built a system with Ansible scripts for provisioning and using Terraform to spin up autoscale groups every morning and destroy them after trading hours.

Once the instances were provisioned, they would pull templates and configs. We used a tool called Consul Template to watch for changes to key-value pairs or config changes. When a change was detected, we could deploy it centrally to all our servers, and Consul Template would reload our application or HAProxy.

Now, while this was an in-house system and worked to some extent, it had its bugs and issues. We wanted to replicate this setup in a more formal way, where we didn’t have to keep writing custom tooling. We started looking into Nomad, but we didn’t take it seriously at first.

Nomad had a lot of missing features, like custom CNI support, and adoption was minimal—mostly hobbyists playing around with it. But in late 2021, we saw a blog from Cloudflare about their use of Nomad for part of their architecture, which got us thinking.

That blog from Cloudflare gave us some confidence that, okay, maybe it's worth evaluating Nomad.

Karan:
The first two to three months were spent understanding how Nomad and Consul work together. We began migrating simple applications—like Go applications—that were stateless. Even if something went wrong, we could easily kill the application from the ELB level, stop routing traffic to Nomad, and switch to Kubernetes. This way, we gradually gained confidence.

Now, in Nomad, we use it similarly to how we used it at Kite. We create EC2 instance groups, which are like Kubernetes namespaces. For example, we create 10 EC2 instances for one namespace. Similar to how Kubernetes uses constraints for scheduling—such as target labels, pod labels, and node labels—we do the same in Nomad.
This helps prevent the "noisy neighbor" problem. We avoid running multiple different applications on the same node, instead managing this directly at the EC2 layer. We also manage EC2 autoscaling through autoscale groups. So essentially, we’re orchestrating EC2 instances with Nomad.
At this point, I would recommend Nomad to anyone migrating to containers or moving from Docker Compose to a Kubernetes setup.

It’s simple, and you don’t have to dive into the complexities of Kubernetes right away. For teams on AWS, ECS or Fargate are also great alternatives—they're simpler to understand. Kubernetes has an amazing ecosystem, and it’s become the de facto standard, but for small teams, it’s worth considering alternatives first. If those don’t work, then Kubernetes can always be the fallback option.

Prathamesh:

That's a great advice. How do you define reliability, specifically software or system reliability, based on your experience at Zerodha?

Karan:
Reliability goes beyond a simple health check, like checking if Redis or the DB is down. That kind of monitoring can help set alerts, but true reliability is about ensuring your application is performing as it should.

One way to achieve that is by implementing DB-level background checks and anomaly reporting. If there's a significant anomaly spike, that can point to something going wrong. In our setup, we use custom metrics for this.

For example, we have a Kafka streaming producer application that writes to a Kafka queue. Instead of just monitoring the Kafka server, we write metrics within the application layer itself. This allows us to track error counts and other useful data.

This approach is far more effective than just relying on standard HTTP health checks or Kubernetes liveness probes, which just confirm the app is up but don’t give any insight into its actual reliability.

Another thing people could do is set up alerts based on error logs. By monitoring spikes in error logs or checking the count of error logs, you can get early indications of potential issues.

Prathamesh:
You’ve mentioned working with self-hosted tools, which ties into the philosophy you follow. This means you must have a lot of experience with open-source tools—evaluating them, getting used to them, sometimes running into challenges, and finding ways to overcome those.

As an individual, or as a backend developer or DevOps engineer, how do you typically learn about these open-source tools? What’s your approach to understanding them, assessing their integrity, and troubleshooting issues when they arise? Any examples from your experience would be really helpful.

Karan:
So, we try to use tools with a strong community and ecosystem. For instance, we use Discourse for internal communication, which has a really solid community, and their upgrade pathways are stable. The same goes for GitLab—its upgrades have been smooth and stable. We avoid tools that aren’t backward-compatible or don’t handle breaking changes well.
Of course, you can’t always know how a tool will evolve from the start.

For example, we initially used Rocket Chat during the COVID era in 2020 because we needed a mature platform that supported threaded conversations, similar to Slack. It worked well for about 1 to 1.5 years, but then issues started cropping up with each update—things would break in the messaging interface. We reported a lot of bugs on GitHub to alert others about these issues, but eventually, we realized Rocket Chat wasn’t working for us anymore, so we reevaluated our options.

We tried out alternatives like Element (based on the Matrix protocol), but that didn’t suit our needs either. Finally, we settled on Mattermost, and it’s been working great for us. It’s written in Go and performs well, even with a lot of users online at once. So, sometimes decisions can go wrong, and all you can do is self-correct, rather than letting things slide.

In addition to Mattermost, we also use GitLab and Sentry, and our entire monitoring system is open-source, with tools like Victoria Metrics, vmagent for log collection, Alertmanager, and Grafana. These are industry-standard tools with solid ecosystems, so if anything breaks, we can troubleshoot effectively.

Recently, we started evaluating a new tool called Plane, which we’ve been testing over the last few months.

Karan:
Plane is like Linear for task management. We saw the need for such a tool, but it's still in the alpha/beta stages, so we’re helping out by reporting issues to improve the product. We even had a call with the main co-developer, and he was excited to see how our feedback was shaping the product. Interestingly, it's an Indian project.

Prathamesh:
That's great! And I assume the risk here isn’t huge, since it’s a task management tool.

Karan:
Exactly. If our task management system is down, it’s not the end of the world. We’re okay with taking risks like that, as long as the consequences are manageable. For anything critical, we make sure to go with tools that have a stable, well-established community.

Prathamesh:
Yeah, that makes sense. So, is running all of these tools also efficient and economical compared to managed solutions? A lot of times, the build versus buy decision comes into play. What has been your experience with that? Is open-source cheaper or more efficient in some ways?

Karan:

In some specific scenarios, running your own systems could be more expensive, but overall, it’s been cheaper for us.

A lot of SaaS products charge you per user, per month. But if a tool can handle 500 users or 5000 users without much difference, paying more for only a slight increase in benefit doesn’t make sense. Running your own systems lets you scale without those added costs. But, of course, there’s the matter of developer time and effort. I haven’t done the exact unit economics calculation for that part.

For us, it’s more about owning the infrastructure than just cost-saving. That said, cost savings are definitely a nice side effect. Plus, we’re a regulated entity, so most SaaS tools that store data in non-Indian regions aren’t even an option for us.

Prathamesh:
That makes perfect sense. So the regulatory aspect also plays a big role in that decision-making.

Karan:
So, most of these tools aren’t usually over-provisioned or anything, so at that level, it’s not a huge concern. AWS bills are all bundled, including all of our applications. And honestly, it’s been much better for us.

Prathamesh:
Shifting gears a bit—how do you recharge from work? What do you do to get away and do something else?

Karan:
I used to play badminton, and I still play sometimes. Lately, I’ve also gotten into road trips—something I’ve recently picked up as a hobby. And yeah, that’s pretty much it. I also like listening to music.

Prathamesh:
Good to know! What trends are you excited about in observability and monitoring? Given your experience, are there any trends that excite you or ones you’re not too thrilled about?

Karan:

I’m excited about incorporating GPT into observability tools. Imagine, you’re looking at a Grafana dashboard, and there's a GPT plugin embedded. It could tell you if something looks off, how to optimize it, or even suggest improvements.

For instance, when you’re learning about Prometheus and metrics like counters and histograms, it can be tricky to understand things like resetting to zero. But if you had a tool that let you enter your query in plain English and generated a PromQL query for you—that would be amazing.

Another thing I’m excited about is having a bot running in your cluster (like Nomad or Kubernetes) that constantly monitors logs and events. It could build a decision tree based on the state changes happening and alert you if something’s off—like when something deviates from the normal pattern.

Prathamesh:
That’s a cool vision! It could definitely make troubleshooting a lot easier.

Karan:
GPT has been incredible, especially since the launch of GPT-3.5. I use it every day for one thing or another. It's perfect for writing bash scripts, for example—whenever you need to automate something and don’t want to spend too much active time on it, GPT is a great tool. I’m really excited about the future of GPT and LLM use cases in infrastructure.

Prathamesh:
That’s awesome! Is there anything you don’t like?

Karan:

Honestly, I have a problem with Kubernetes becoming the de facto standard everywhere. It’s gotten to the point where, if someone’s not using Kubernetes, people start questioning whether they’re doing it right. It’s almost like everyone has just normalized it.

From my personal experience, when I’ve been interviewing people for DevOps roles, I’ve noticed that as people move further away from the fundamentals, they tend to lose touch with the core concepts.

Simple tasks, like adjusting a sysctl parameter or basic sysadmin work, are often overlooked. People focus too much on tools and less on how things actually work. This isn't a trend, but more of a personal observation from my experience.

With LLM and other tools, it's easy to get lost in the specifics of a tool, but it's important to understand the fundamentals first. That’s when you realize that LLM might be giving you the wrong answer.

Prathamesh:
So, it’s about understanding the core first, so you know when GPT or a tool might lead you astray.

Karan:
Exactly! When you know the fundamentals, you can spot the errors and look for better solutions.

Karan:
I think the number one trait for someone starting in SRE is patience—persistence, really. Sometimes you’ll get stuck on a problem for days, and it can be frustrating. The key is being able to push through and keep at it, even when you’re not getting instant results or gratification. It’s all about staying persistent and not getting discouraged.

Prathamesh:
Yeah, exactly. That mindset really helps when things aren’t moving as quickly as you’d like.

Karan:
In the backend or frontend, you usually have a stack trace or something concrete to work with. Even if you encounter ghost bugs that show up occasionally, you can still trace them.

But in SRE, you're juggling between multiple systems, correlating logs, metrics, and events. So, the key is to be good at finding patterns and persistently working through the problem. Another important skill is knowing how to efficiently Google things to find what you're looking for.

It’s all about system knowledge and applying first principles when things go wrong. If there’s a firewall issue, for example, you need to check things like iptables or whether UFW is activated. You have to dig deep and follow these steps until you get it right.

Prathamesh:
Yeah, a lot of invisible work happens in SRE. It’s often not seen, but it’s crucial. It's mostly visible only when something goes wrong.

Karan:
Yeah, exactly.

By the way, I recently found a site called sadservice.com. They have scenarios where a production server is broken, and you have to fix it. It’s a cool resource. I’ve been doing it for the past couple of days and plan to write a blog about it soon.

Prathamesh:
That’s great! I’ll plug that in for our SRE folks.

That brings us to the end of our conversation with Karan. His wealth of experience in infrastructure, observability, and SRE stands out. With years of hands-on experience managing complex systems, Karan has a practical, down-to-earth approach that truly emphasizes long-term stability.

His journey highlights the value of being adaptable and continuously learning—traits that have helped him thrive in the world of observability. What sets him apart is his belief in mastering the basics and his patient, methodical approach to troubleshooting, always focused on building resilient systems.

We’d love to hear your thoughts!

What are your experiences with open-source tools, reliability, or troubleshooting in SRE? Got any tips for others navigating this field? Or perhaps you know someone with a similar passion? Let us know!

Thanks again, Karan, for sharing your journey. If you’d like to connect with him and learn more about his passion for open-source tools and contributions, you can find him on LinkedIn.

Karan from Zerodha on Open-Source Tools and Observability

Open-source tools, observability, and sheer persistence.

Discussion about this post