Behind the Scenes: Suman’s Journey Scaling Distributed Systems

Observability, Essential Tools, and Unforgettable War Room Moments

Nov 07, 2024

Suman, a Principal Engineer at Airbnb, is a leading expert in observability and infrastructure. With a passion for building and operating large-scale systems, he has played a key role in developing foundational tools like Zipkin, Astra (Previously KalDB), and OpenTSDB with Yuvi.

His work spans all three pillars of observability—distributed tracing, log search, and metrics—cementing his status as a pioneer in the field.

In a recent conversation, Suman shared insights from his early career and how he stays ahead in the observability landscape.

Disclaimer: These are Suman's personal opinions and do not reflect those of Airbnb.

Prathamesh: How did you get into becoming an SRE?

Suman: My journey into SRE began when I was a software developer. My first real exposure to SRE came in 2009 when I started working on Amazon EC2. Back then, neither SRE nor even the concept of operations was well-defined. When I joined Amazon, EC2 was part of a very small team. AWS had only about 200 people and a few products—EC2, EBS, and S3. There were just three data centers, and I was involved in network security, monitoring network traffic, and preventing malicious activities—tasks that were early forms of observability.

We built systems to monitor hosts and servers, collect telemetry data, and then act on that data. This involved deploying these systems across hundreds of thousands of machines at Amazon.
This experience taught me a great deal about building distributed systems and observability, long before those terms were commonly used. It was all about monitoring traffic, analyzing it, and enforcing necessary actions, which gave me my first significant exposure to SRE and automation.

After Amazon, I moved to Facebook, where I worked on a browser-based IDE, which was a novel concept at the time—this was before tools like Visual Studio Code existed. I also contributed to the development of Hack, a programming language. Later, at Twitter, I focused on container orchestration, working on projects like Mesos and Aurora. This was essentially DevOps before the term was coined. We had to build the infrastructure for orchestrating containers, a task that is now more commonly done using Kubernetes.

During my time at Twitter, I also delved deeper into observability and distributed logging. We initially deployed Elasticsearch, but it had many challenges, especially with scalability and reliability. This led me to lead the development of LogLens, a system that significantly stabilized our logging infrastructure and ran at Twitter until 2020.

I also became the tech lead for Zipkin, which introduced me to distributed tracing. My work with Zipkin pulled me into defining the OpenTracing spec, which later evolved into OpenTelemetry.

After Twitter, I moved to Pinterest, where I worked on VM orchestration tools like Teletran, based on Amazon Apollo. I also built a comprehensive end-to-end distributed tracing system for PinTrace. During this time, I contributed to the open tracing spec and was the first to implement it in production.

Additionally, I worked on improving Pinterest’s metrics infrastructure by developing yuvi, a more performant and scalable distributed storage system for metrics, as OpenTSDB didn’t scale well for our needs.

At Slack, I built Slack Trace, an end-to-end distributed tracing system that incorporated lessons learned from my previous projects. I also managed Kafka infrastructure at Slack and built the entire Kafka team at Slack. Another significant project I worked on at Slack was Astra (Prev. KalDB), a distributed log search system.

Now, at Airbnb, I’m responsible for broader infrastructure projects, with a strong focus on observability—managing logs, metrics, traces, using in house observability infrastructure. Throughout my career, I’ve always been deeply involved in every aspect of the systems I’ve built—from designing and developing to deploying and maintaining them. This hands-on experience has provided me with a deep understanding of the complexities of SRE work.

Prathamesh: You’ve had quite an extensive journey with building infrastructure and observability systems. How did your experience with tracing systems at Pinterest and Slack differ, especially considering the evolution of open standards like OpenTracing and OpenTelemetry? Did you rely on open-source components, or did you approach each new system from first principles?

Suman: That’s a great question. The tracing systems I built at Pinterest and Slack had some core similarities, but they also differed due to the maturity of the tools and standards available at the time.

At Pinterest, when I started working on distributed tracing, OpenTracing was still being defined, and we were one of the first companies to implement it in production. This meant we had to build a lot from scratch, focusing on solving the immediate challenges Pinterest faced.

By the time I got to Slack, OpenTracing was more mature, and OpenTelemetry was on the horizon. This allowed us to leverage existing open-source components more effectively. However, I still approached the problem from first principles. Every company has unique requirements and constraints, so I always start by understanding the specific problems we need to solve. At Slack, for example, the lessons I learned from Pinterest helped me design a more robust and scalable tracing system, but I still had to tailor it to Slack’s infrastructure and needs.

When building these systems, I rely on open-source components where they make sense, but I don’t shy away from building custom solutions if that’s what’s needed to solve the problem effectively. It’s a balance between using tried-and-true tools and innovating where necessary to meet the specific needs of the company.

At Airbnb, where I’m currently working, I continue to focus on observability, handling logs, metrics, and traces. The approach remains the same—understand the problem deeply, leverage existing tools when possible, and build custom solutions when needed.

There’s always a balance when deciding how to approach building these systems. When we started at Pinterest and even at Twitter, we didn’t have much of a choice—we had to build Zipkin because there wasn’t anything else available. The Pinterest tracing system, which we called PinTrace, was based on Zipkin. As part of that, we contributed to the development of the OpenTracing spec. At that time, tracing didn’t even have a standardized span format, and OpenTelemetry was still just an interface for instrumentation.

When I moved to Slack, we continued to use the OpenTracing interface for our traces. However, OpenTelemetry was still in its infancy, and the span format was mostly inspired by an intersection of Jaeger, Zipkin, and Google's internal practices. Jaeger’s span format was a bit unconventional, while Zipkin’s was more straightforward but lacked certain fields. So, OpenTelemetry eventually integrated these different approaches into a more unified span format.

At Slack, we picked the OpenTracing interface for our tracing needs but developed a custom span event format. This was because, at that time (around 2017), OpenTelemetry was still very new, and its span format wasn’t as solid as it is today. Our custom format was very similar to what OpenTelemetry would later adopt, so in a way, we were ahead of our time. Back then, even vendors didn’t fully support tracing or standard formats, so we adapted an open-source approach and built something that suited Slack’s needs.

Prathamesh: But today, I think the data formats are more mature.

Suman: Exactly.

Nowadays, it makes a lot of sense to start with open-source formats because you get so much for free—instrumentation, debugging tools, and so on. It’s usually the logical starting point. However, there are cases where the open-source OpenTelemetry format might be overly complex for your specific needs. Sometimes, the tracing system you’re using may not even support all the features that the data format provides. In those situations, it might be a different story. But generally, starting with an open format and open-source system is the way to go for both instrumentation and backends.

Prathamesh: A lot of times these days, people talk about having a single storage or a unified system for logs, metrics, and traces—creating a unified view for everything together, right? But what are your thoughts on that? Is that a viable strategy, or is it something that needs specific tools for specific problems depending on the kind of problem you're trying to solve?

Suman: Yeah, those are interesting questions.

The way I see unifying logs, metrics, and traces is that most people just pick one database and claim it’s the best for all three, which I think is misleading.
Most of the time, these systems primarily support logs and maybe traces, but not all three comprehensively. There's no system out there that truly supports all three, despite what the marketing might suggest.

For example, while some systems can ingest metrics, querying them effectively is another story—there are a lot of nuances in querying metrics that these systems often don't handle well. The storage engine for metrics is fundamentally different from the one for logs. People might use columnar stores for traces, but even for traces, custom storage engines could potentially offer better performance.

So, to answer your question, I think when a system claims to support all three—logs, metrics, and traces—it’s usually more of a marketing claim.

In reality, logs and traces can be unified more easily, which is what we did at Slack. Unfortunately, OpenTelemetry, with its focus on the three pillars of observability, tends to go in the opposite direction of unification. However, I do believe logs and traces can be unified.

Metrics, on the other hand, are different. They are pre-aggregated and should be treated separately. A good way to think about unification is along two dimensions: metrics as pre-aggregated data, and logs, traces, and events as raw events.

So, I think the unification of metrics, logs, and traces can be viewed along two dimensions. One dimension is telemetry emission, where you unify metrics, logs, and traces. On the storage side, though, you need to choose the right storage engine based on your query patterns for logs, traces, and events.

You need something specifically designed for metrics—using a storage engine meant for logs or traces won’t be as performant or easy to use as a storage engine designed for metrics. Something like a Prometheus storage engine is necessary for metrics.

Prathamesh: Shifting slightly from the technical side, how has your day-to-day work setup changed? Are you managing a team now? Do you find yourself in more meetings and doing less coding? How has your role evolved over the years?

Suman: When you're a junior engineer, coding is pretty much all you do, right? But now, as a Principal Engineer at Airbnb, I’m doing less coding than I’d like. A significant part of my role has shifted towards leadership—writing and reviewing design documents, handling leadership responsibilities, and so on. It takes up most of my time now. However, I plan to get back to more coding soon, especially on the Astra (Previously KalDB) project.

Prathamesh: What are your thoughts on the Google SRE book? I've heard contrasting opinions from people. Some say it’s invaluable, while others argue that it’s tailored for Google’s scale and might not apply to smaller scales. Do you find the practices in the Google SRE book still relevant today?

Suman: I actually think the Google SRE book is more tailored to Google’s specific needs than for the broader industry.

A lot of these SRE practices, over the last 15 to 20 years are also influenced heavily by Google's SRE book. That is something that has set a trend or even best practices for a lot of organizations that want to adopt site reliability. They look at that as an inspiration as well as a handbook kind of approach.

When it was written, the landscape was quite different—people were running their own infrastructure, a cloud wasn't as dominant, and the problems Google addressed were large-scale issues. Some parts of the book, like how to observe a service using user metrics, are still relevant. But overall, the book assumes two things: that you have a dedicated SRE team, and that SREs are heavily involved in operations.

This isn't true for many companies today.

Most companies don’t have their own data centers or a dedicated SRE team for every service. They rely on cloud providers like AWS and use various vendor products. In this context, the Amazon operational model, where engineers are responsible for building, defining, and running their systems, is more practical and valuable for most organizations.
And Amazon actually shares a lot of these insights in The Amazon Builders' Library. If you want to understand operations better, I'd recommend following that over the Google SRE book. The Builders' Library articles are much more relevant for day-to-day operations.

Prathamesh: Now, with so many open-source tools and vendors out there, when someone is planning an observability strategy, they often get overwhelmed by the tools rather than focusing on the strategy itself.

What would you recommend as a plan of action or guiding principles for someone looking to improve reliability or implement an observability strategy in their organization?

Suman: For a new company, you typically have two options. For most, going with a vendor solution is the easier path, but the downside is the cost. Observability becomes a significant challenge at scale, regardless of whether you use a vendor or open-source tools. If you're working with a small-scale system, just picking a reliable off-the-shelf tool is often sufficient.

The biggest mistake I see people make is chasing fads. If I were to join a young company, I'd focus on just two things: log search and metrics. And I'd make sure to do both of them really well.
For a new company, I'd recommend starting with simple, open-source tools for observability—just focus on logs and metrics initially. Vendors can complicate things and make observability more challenging. Following their best practices might sometimes lead to less reliable systems. Keep it straightforward and prioritize key metrics like utilization, saturation, errors, and duration.

Start from user problems rather than building observability for its own sake. The importance of reliability can vary by company.

For example, at Slack, uptime and latency are critical because real-time messaging is key, and any downtime directly impacts the user experience. On the other hand, Airbnb's traffic patterns demand a different point of view depending on the use case. Understanding the impact of service failures based on the nature of the product and use case is essential.

Prathamesh: How do you define reliability? It seems to vary based on the specific problem you're addressing.

Suman: My approach to defining reliability is based on the user context. For instance, at Goldman Sachs, which focuses on batch processing, reliability isn't as critical because if something fails, it can be rerun without significant impact.

Prathamesh: So, you're saying it's about the job that needs to be done?

Suman: Exactly. For high-engagement platforms like Amazon, Facebook, or Twitter, reliability and low latency are crucial because they affect user engagement. These platforms need to ensure quick and reliable interactions to keep users engaged.

Prathamesh: And what about for a company like Slack?

Suman: For Slack, reliability is also critical, but for different reasons. Since it's a real-time messaging service, any delay or downtime can disrupt communication.

Prathamesh: How does this apply to other companies, like Airbnb?

Suman: At Airbnb, reliability is important but varies based on the situation. For example, if the booking system is down, it could impact users trying to book. We need much higher reliability on the messaging path where a user is messaging a host to check in to their Airbnb in the middle of a rainy night. While precise reliability needs vary, Reliability still matters to ensure a good user experience. Overall, my key takeaway is to think about reliability in terms of the customer’s needs and context.

Prathamesh: Are there any trends in observability that you're particularly excited about or ones you find less appealing?

Suman: I'm excited about the standardization efforts, like OpenTelemetry. Despite some complexities and performance challenges, it's a positive development for the community. The trend towards cloud-native solutions in observability storage is also promising, though the term "cloud-native" can be quite broad and variable. I believe unifying logs, traces, and events will become more prevalent, and there might be interesting developments in merging these with profiling data.

Overall, I think the infrastructure space is mature, but there's still room to significantly improve observability systems. I'm particularly interested in how observability is driving innovation in database technologies, especially in analytical databases. This is an area where I see a lot of exciting advancements on the horizon.

Prathamesh: One question I always ask is about becoming a good SRE. What are some important lessons you've learned over the years about being a valuable member of an SRE team?

Suman: I have a slightly controversial take on what makes a good SRE.

In the past, the focus was on building reliable systems, particularly stateless services. However, with tools like Kubernetes, Envoy, and gRPC, many of these problems are now well-handled.

Today, I believe a great SRE can have a significant impact by focusing on storage systems, which still need a lot of attention and application of SRE principles. The skills of debugging issues, understanding reliability, and tying problems back to the customer are crucial in this area.

Prathamesh: That's an interesting perspective. Are there any memorable incidents in your career you’d like to share?

Suman: There are several memorable incidents.

One that stands out happened when I was at Slack. During my first week on call, we faced a major outage due to our Kafka cluster going down. The version of Kafka we were using was outdated and unrecoverable. The outage lasted almost a day, making headlines on TechCrunch and Hacker News.

What made this incident particularly memorable was the critical decision we had to make. We had two options: fix the existing Kafka cluster, which could prolong the outage, or deploy a new Kafka cluster. We ended up involving the CTO in the decision-making process. We decided to build and deploy a new Kafka cluster in under four hours, cutting over major use cases while continuing the migration throughout the night. By the end of the three-day incident, we had successfully upgraded Kafka and restored service.

This incident was also very visible in terms of its impact. Not only did it affect users, but it also impacted the stock price. Slack had a policy where downtime was compensated with credits—if the service was down for one second, they would provide credits worth ten or a hundred seconds. Since this incident lasted a day, the credits added up significantly, affecting revenue and, consequently, the stock price. This was one of the few times I’ve seen an incident directly influence stock price, making it a particularly memorable and impactful experience for me.

And that wraps up our enlightening conversation with Suman. His enthusiasm for observability and infrastructure is evident, and his extensive experience in building and managing large-scale systems is truly impressive.

From his early coding days to his current role at Airbnb, Suman has accumulated a wealth of knowledge and insights. His balanced approach to work and dedication to the field offer valuable perspectives for anyone navigating the complexities of observability and SRE.

We'd love to hear from you!

Share your SRE experiences, and thoughts on reliability, observability, or monitoring. Know someone passionate about these topics? Suggest them for an interview. Let's connect on the SRE Discord community!

Thanks a lot, Suman, for sharing your journey with us. If you’re passionate about observability and infrastructure, connect with Suman on LinkedIn.

Behind the Scenes: Suman’s Journey Scaling Distributed Systems

Observability, Essential Tools, and Unforgettable War Room Moments

Discussion about this post