Salim’s Insights from 21+ Years of SRE at Google

The Evolution of SRE and Today’s Observability Challenges

Sep 27, 2024

Introduction:

Salim, a Site Reliability Engineer at Google for over two decades, has been at the forefront of managing and scaling complex systems. His deep experience spans from early challenges in storage and distributed systems to today’s advanced reliability practices.

In our conversation, Salim shares his experiences, the evolution of SRE practices, and how he navigates the complex world of observability today.

Prathamesh: How did your SRE journey start?

Salim:

I was a systems engineer working on our corporate infrastructure services like DNS and mail—primarily internal-facing. At the time, Google's external products were search, ads, a shopping service called Froogle, and a few other fairly self-contained things.

However, it became clear that the company needed software engineers with experience running critical infrastructure to take responsibility for these systems.

I distinctly remember that during a group meeting, the managers asked for volunteers to learn about and eventually run our production storage service. I stood up; I was the only one in the room who did. So, I got chosen for the job, and it was an extremely fortunate opportunity. It was a great path for me, though I didn’t fully know what I was getting into.

I had a hobby project I’m still involved with—a small distributed network that includes shared computing, node storage, and a few other services.

I thought I had some experience with distributed storage, but it turns out it was nothing like what Google was getting into.
Google had already evaluated and dismissed the idea of using a third-party distributed storage system because the available systems didn’t meet our reliability requirements. So Google's engineers built their own. This was where I came in. They said, “We need someone to run this, carry the pager, figure out how to allocate resources, and automate turning up new instances.” Remember, this was 20 years ago, and I thought, “Alright, Python, I know Perl, I can do this.”

However, the system itself didn’t have any sort of API or a real control plane, and this is where many opportunities emerged. At the time, I didn’t have the vocabulary to understand that we were building a control plane or that we needed a management console, but those were the things that emerged. These, along with different attempts to automate the system into more reliable states.

As various nodes within the storage cluster failed—whether it was a disk-level failure or an enclosure failure—we needed software to report that failure. Then, our management software, the stuff I was building, would say, “Okay, here are the choices I have to repair and heal the system.”

Some of what we wrote worked, and some didn’t.

We discarded the approaches that either took too long, weren’t reliable, or just didn’t work. Over time, we integrated a lot of these features into the core storage service as it evolved.

That was the beginning of what was both SRE at Google and SRE for me. It was very much as Google describes SRE: having people with a software engineering perspective take responsibility for operating production systems.

I did have a background as a software engineer—it’s what I did before coming to Google. I thought it was a very good blend of perspectives on challenges that were becoming more prevalent in commercial computing. The companies where I’d worked before coming to Google were all monoliths.

We had big databases—physically big and voluminous—and each one was a special instance. It had a name, and it had to be running for our system to operate. If it wasn’t, then that was an emergency—someone had to drive to a data center to figure out what was wrong with it.

Prathamesh: Specialized setup specifically built for running those systems?

Salim:

Yes. Other parts of the systems were less special in terms of having multiple web servers or application servers, but things like data storage were a different story. The places I worked before didn’t have the level of redundancy, sharding, or replication. None of these strategies were being used maturely.

That’s another part of site reliability that I find exciting—we can identify strategies for how to distribute data, bring the data closer to where it’s consumed, and defend against various failure modes.

Prathamesh: One interesting question around this—you mentioned not having sharding or other capabilities in the previous organizations. Was that also because there wasn't the same strict reliability requirement as when you joined Google? So, before Google, you mean?

Salim:

I didn't hear the term "service level objective" until my second year at Google.

The notion of having a measurable indicator of what a system could do was still foreign to me. But then I heard about it.

By that time, I had moved from working solely on storage to also working on distributed consensus.

I was discussing with some of the other engineers how we could make the service more mature, and someone said, "Well, we need a service level objective." I thought, this is amazing.

I knew the different RPCs that were critical to client operations, and I knew what clients expected because I had talked to many of the engineers working with them. So, I could form an SLO.

It took me months to get the instrumentation in place, collect the data, and understand it in a way that we could report on. All of that was done on the side, rather than integrated into the core system. But over time, we integrated these ideas into all the core pieces of software. Now, almost 20 years later, it’s no surprise that almost all our systems report data automatically.

The data was collected automatically, allowing us to issue both ad hoc queries for understanding performance as well as report on stored queries. This was around 2003. I began working on storage in 2003 and on distributed consensus in 2004.

Prathamesh: Okay, fast forward to today—how have things changed? Back then, you were working on core problems like collecting data and defining objectives. How does your day look now?

Salim: For a lot of applications at Google, it looks very similar, but the tools and level of sophistication have increased. Now, we understand that we have this data. As we build new features and release new services, we work from the beginning with clients—the users of the service. We understand what we call the user journey and how clients will use the service.

We then build the service-level indicators to support that journey and describe it with a service-level objective. So, we're working at a higher level now.

Many of the nuts and bolts are built into our platform, which allows us to deliver features that are immediately useful to our users. Users can be either internal or external.

Prathamesh: The concept of customer journeys—was that term also developed at Google while building the vocabulary around site reliability engineering?

Salim: I believe so, but the notion of user journeys likely emerged within the last 10 years. It was used to describe our motivation and the relationship between build engineers and the platform or application users.

Prathamesh: How does your typical day look now? Does it involve a lot of meetings? You mentioned the SRE course—does that take most of your time?

Salim: My job now, and for the last four, almost five years, has focused on external activities like education, presentations, and publications.

I spend most of my day in discussions rather than meetings, often talking with people outside of Google. I try to understand the challenges that other engineers, particularly site reliability and DevOps practitioners, face. In the back of my mind, I'm always thinking about how I can match what Google does with solutions that might answer these questions.

Some of what Google does is very specific to our systems and not necessarily useful to others due to the tight integration with our infrastructure. However, many of our practices are universally applicable. For example, about a year ago, we published a paper on our production continuous integration system, and the methodology behind it is something that others can benefit from.

We've also shared information on our Canary Analysis Service and our approach to application security. While the implementations might be company-specific, the principles can be adapted by anyone to solve similar problems.

My day involves talking to people, exploring emerging technologies within Google and across other companies, and trying to bring order to all the different possibilities. I encourage my colleagues to present at conferences, publish papers and articles, and engage with other companies to support the SRE dialogue and build community.

Another significant part of my day is dedicated to external education. This includes standalone workshops we've published, mainly about service-level objectives and large system design.

We're also planning to release online courses that introduce SRE principles, which will give people an opportunity to explore SRE as a career. Even though SRE has been around for 20 years, it's still evolving as a career path. We hope that this course will help bridge the gap between traditional computer science concepts and their application in the field of reliability.

(Editor’s note: These online courses are offered through a training partner! Read more at our website: https://sre.google/resources/practices-and-processes/sre-fundamentals-course/)

When we talk about software engineers taking responsibility for production systems, there's often a big gap between what people learn in school and what we do in the real world. Taking concepts from an algorithms course and then applying them to build a load balancer, write a caching system, or evaluate a caching system for optimal use is the gap we want to address with this course.

Prathamesh: Do you think that running systems at scale in production requires not just technical skills but also operational skills? For example, setting up processes, tooling, and ensuring everything runs smoothly. Is that mandatory for becoming a good SRE?

Salim: Absolutely.

The items you mentioned are essential to understanding the production environment. It’s not just about having your software or binary running in production; it’s about knowing how it got there, ensuring the correct version is running, and verifying that it’s built from the right source code and has gone through the proper release process.
This falls under what’s now being called software supply chain security.

Capacity planning is another pillar of SRE.

While there are systems that can automatically scale deployments, understanding the decisions behind those systems is equally important. Even with automation, if you don’t set the right guardrails, the system might not respond as expected.

For example, a colleague at another company discovered they could save several hundred thousand dollars a year just by tweaking their auto scaler parameters. It didn’t affect their SLOs or the number of requests they could handle, but it reduced the unused headroom. This kind of operational insight is critical for SREs.

Prathamesh: Are there any tools you depend on in your day-to-day work or that you used when you were coding? Many programmers are curious about the tools others use and want to learn from them.

Salim:

Vim for editing: Been using it since college and am very comfortable with it.
Spreadsheets: Handy for modeling outcomes; surprisingly powerful for various tasks.
Notebooks: Jupyter and Colab for quick prototyping and understanding data sets.
Collaborative editing tools: Crucial for SREs for shared knowledge and effective communication.
Good documentation: Up-to-date guides, how-tos, and readme files are invaluable for writing and sharing code, especially during incident management.

Prathamesh: Is it like a set of steps someone can refer to, similar to the checklists that medical practitioners use?

Salim:

There's a reason checklists are crucial in fields where decisions made in seconds can have a huge impact. They provide an ordered list of steps to follow, much like writing an algorithm but in document form.

When I was working on a storage system in the early days, that's exactly what we did. We would write out lists of steps to solve a problem, automate parts of it, and then identify the steps that couldn't be automated—like when two people needed to talk to each other. Eventually, we found ways to automate those steps too, which is where distributed consensus came in.

So, it was an evolution: we started with written procedures, which led to scripts, and then to building the necessary APIs into the core software for more reliable automation. Written communication is incredibly important in this process.

Prathamesh: When tackling a complex problem like managing a storage system, it seems like the approach itself takes time, considering all the scenarios and potential failures. In such cases, is automation the best way to address most use cases, or is it more about writing detailed specifications, identifying gaps, and deciding what can be done now versus what can wait? How do you approach such complicated problems? Many programmers tend to jump straight to coding. What’s your take on this?

Salim: This is where technical program management (TPM) can play a crucial role, though every SRE can handle this responsibility. Especially in larger organizations, TPMs help with prioritizing tasks.

The core of the issue is understanding what can be automated and in what order. It’s about evaluating the potential rewards. You ask questions like, "How long will this take?" and "How much time will it save?" For example, if something takes 10 hours to implement, test, and release, will it save at least 10 or 20 hours over the next quarter? You then look at incident analysis data to see how often a particular issue occurs. If it’s frequent, the automation might save significant time; if it's rare, it might not be worth automating right now.

Additionally, sometimes automating one part of the system benefits other parts. SREs often have a broad perspective, understanding how different components interact. For instance, solving a problem in the storage node might reduce the load on the computing system, freeing up resources. So, making such assessments is critical to deciding what to automate and when.

Prathamesh: We've touched on reliability a few times, but how do you define reliability? How do you view it from your perspective?

Salim: Reliability, to me, is all about meeting the user's expectations. A system that's 99.9% reliable but not being used doesn't really matter. The work that went into achieving that level of reliability isn't as valuable as a system with three nines that are being actively used by thousands of clients worldwide.

So, reliability is about understanding what users expect and then ensuring those expectations are met.

Prathamesh: You mentioned that your role these days involves external communication and talking with others about the challenges they face. How does SRE at Google compare to SRE outside of Google? Do you find the same patterns and ideas but different ways of implementation?

Salim: SRE at Google is quite similar to SRE elsewhere, with differences mostly in implementation and toolsets. From my discussions, often at conferences, the core principles remain consistent, though the specifics can vary.

One key difference is that, due to its size and early investment in SRE, Google has many dedicated SRE teams for specific services. This model can be challenging for smaller companies, where having dedicated SRE teams might be too costly. SRE often becomes a cost center.

In contrast, many startups and enterprises integrate reliability into the role of all engineers. They promote an understanding of reliability principles, including capacity planning, incident management, and integration processes. With supportive tooling, engineers can manage these aspects without dedicating their entire role to SRE. By grasping concepts like failure domains and dependency management, engineers in other roles can effectively contribute to reliability without being full-time SREs.

Prathamesh: What are some of the trends in site reliability that you’re excited about, and are there any you’re not so enthusiastic about?

Salim: One trend I’m not excited about is MLOps. My skepticism stems from a project I worked on a few years ago where we used machine learning to optimize data placement in a storage system. Although the ML model was accurate 90-95% of the time, it wasn’t enough to justify the investment. The occasional inaccuracies led to extra latency and operational overhead, which reduced the benefits. I worry that at a larger scale, the return from MLOps might not outweigh the costs due to similar issues with accuracy and operational complexity.

Conversely, I’m very excited about the growing focus on the human element within SRE. Understanding the emotional and personal aspects of people’s roles in reliable organizations is gaining prominence. Reliability is increasingly being integrated into various engineering roles rather than being a standalone function.

For instance, discussions at SRECon Americas emphasized the value of war gaming and role-playing scenarios. These exercises, which cover not just technical challenges but also interpersonal dynamics, are incredibly valuable. They help teams prepare for disasters and failures, build trust, and support each other’s growth. It’s about fostering a nurturing environment where team members feel confident, share responsibilities, and rely on one another, even in high-pressure situations.

Prathamesh: What if you weren’t an SRE? What would you be?

Salim: That’s a tough question because SRE has become such a core part of my professional identity. I think I’d apply the same principles of reliability and redundancy to whatever I was doing, though. For instance, there’s a movie called Ronin where a character says, “I never walk into a room I don’t know how to get out of.” I apply this mindset to my daily life. Whether it's planning a family vacation or navigating NYC traffic, I always think about what might go wrong and how I can adapt. It’s a useful approach for ensuring I’m prepared for unexpected situations.

Prathamesh: We’ve talked a lot about becoming a good SRE. Is there anything else you think is important for someone in this role?

Salim: Absolutely. Here are a few more tips:

Ask Questions: When you encounter something you don’t understand, ask questions. Whether it's talking to a colleague or discussing it out loud (rubber duck debugging), asking why a system behaves a certain way can be very insightful. Documenting these questions and answers helps in the future.
Document Your Findings: Keep a lab notebook or record your shell history. This helps track what you tried, what worked, and what didn’t. It’s a valuable habit for understanding failure modes and improving automation.
Communication Skills: Understand how to effectively communicate with others, considering that different people have different preferences (email, chat, etc.). Good communication is crucial, especially during incidents. Knowing who to contact and how to reach them can make a big difference in managing emergencies.
Human Aspect: Invest time in building relationships with your colleagues. Effective teamwork and understanding each person’s preferred communication style can significantly improve how well you handle challenges together. This personal investment pays off in creating strong, reliable teams.

Prathamesh: Any interesting or memorable incidents from your career that you're particularly proud of?

Salim: One incident that stands out involved updating the backend storage model for our leader election service. This major update required stopping each instance of the service one by one, freezing the data store, upgrading to a non-backward compatible version, converting the data format, and then restarting everything. Initially, this process was manual, but we later automated it with a shell script.

During one of these updates, things went awry. We had around 25 instances to update, and a mistake could potentially render an entire data center unusable. Unfortunately, this update led to an outage for internal services, including corp Gmail, which made email inaccessible for all of Google for about half an hour.

In response, I worked with a colleague to develop a protocol buffer tool to identify and correct problematic data entries. The experience was memorable not only due to the scale of the impact—receiving a call from one of Google’s founders about the email outage—but also because of the collaborative effort. Despite being a junior SRE, working closely with a senior engineer made it a very rewarding experience.

Prathamesh: What questions would you like to ask other SREs that you find interesting or important?

Salim: I would ask other SREs to reflect on the impact of the systems they're building and their broader influence on the world. This includes considering how the quality and bias of data can affect the final product, especially as AI-driven technologies become more common.

It's important to think about how we can advocate for inclusive and transparent decision-making within our systems. For instance, as we work with AI and machine learning, understanding the sources and quality of the data we use and the implications of our decisions can have significant effects on users. I encourage SREs to actively engage in these conversations and influence how data is handled and presented.

Prathamesh: How do you think AI will change or impact the world of observability and site reliability?

Salim: AI could significantly enhance observability, particularly for tasks like anomaly detection, where its ability to process large volumes of data can be beneficial. However, I've noticed some limitations with current AI systems, especially in handling complex arithmetic and statistical problems. For example, recent attempts to use AI for a multivariable problem yielded incorrect results.

While AI shows promise for improving observability, especially in detecting anomalies, I remain cautious. Generative AI, which focuses on language processing, may not always be well-suited for time series or statistical data. Therefore, I plan to use AI tools for data analysis but will continue to verify their outputs manually to ensure accuracy.

Final Thoughts:

We’ve just scratched the surface of Salim’s remarkable journey in site reliability. His insights into the evolution of SRE practices and the balance between technology and human factors provide a valuable perspective.

As he continues to drive innovation at Google, Salim’s experiences highlight the importance of adaptability and a deep understanding of the dynamic field of site reliability.

We'd love to hear from you!

Share your SRE experiences, and thoughts on reliability, observability, or monitoring. Know someone passionate about these topics? Suggest them for an interview. Let's connect on the SRE Discord community!

Thanks a lot, Salim, for sharing your journey with us. If you’re just starting out, Salim’s experience will inspire you to embrace opportunities and take bold steps. Connect with Salim on LinkedIn to learn more about his work in SRE.

Salim’s Insights from 21+ Years of SRE at Google

The Evolution of SRE and Today’s Observability Challenges

Discussion about this post