The SRE Experience: Isaac on Automation, Challenges, and Mentoring

SRE Experience, Automation and Challenges

Sep 06, 2024

Introduction:

Isaac Good's journey into the SRE is nothing short of inspiring. Starting his tech odyssey at a remarkably young age, he's carved a niche for himself as a seasoned professional. In a recent conversation, Isaac shared candid insights into his career path, the evolution of SRE, and the essential skills needed to thrive in this dynamic field.

Prathamesh: I'd love to learn about your journey so far. You’re currently a Reliability Engineer at Two Sigma, but how did you start, and how did you reach your current role as an SRE?

Isaac: My tech journey started pretty young. When I was nine years old, we had a Pentium One at home, and my older brother was learning to program from a book called "C for Dummies." I wanted to do everything he did, so I started learning C as well. We had a machine with DOS 6.22, and my brother had set up a batch script in the autoexec.bat file to create menus and sub-menus for launching games. I began automating things, like adding new games to the batch script, so I was automating tasks from a young age.

From there, I went to the University of Toronto for Computer Engineering. That’s where I first got introduced to Linux, as the computer labs used Red Hat Linux. I started experimenting with shell scripts, though nothing too complicated at that time.

The next step towards automation was during a summer job at Blackberry (formerly Research In Motion). I was working there when I first heard about scripting languages like Perl and Python. I chose Perl to start learning for no particular reason, and I got decent at it over the summer.

I was playing an online browser-based game where I needed to take an action every hour to avoid being attacked. I realized I could automate the process, so I did. By the end of the summer, I had fully automated the game and only needed to log in once a week. This experience really got me hooked on automation.

Later, someone asked me to automate another browser-based game, and I did. I even had a friend who worked at a DMV driving school who asked me to automate the booking of driving tests for students who needed to take the test quickly. I was able to repurpose my game automation scripts to help them book tests, which was a cool project.

As I continued with university, I got more into Linux. I installed Ubuntu, tried Gentoo briefly, and eventually settled on Arch Linux. I graduated and started working as a software engineer, but I hadn’t heard of SRE at that point.

Prathamesh: So, when did you first learn about SRE?

Isaac: I didn’t hear about SRE until 2013. Before that, in 2010-2011, I was in grad school, working in a lab with a research cluster of 130 servers. We didn’t have a sysadmin, so the responsibility of upgrading systems and managing the cluster fell to us grad students. I ended up becoming the sysadmin for the cluster, automating a lot of the processes like upgrading systems and setting up disk imaging.

In early 2012, I quit grad school and got my first full-time job as a software engineer. I held that position for about a year, and then a recruiter from Google reached out to me about a Site Reliability Engineering (SRE) role. At that time, I had no idea what SRE was, but I thought it was worth exploring since it was Google. I went through the interview process, got an offer, and moved to California to start my professional journey as an SRE.

By the time I got to Google, I was already automating things, writing shell scripts, and figuring out ways to avoid doing the same work twice. At Google, I learned SRE best practices and gained a deeper understanding of SRE. I’ve been working in SRE since 2013.

Prathamesh: You’ve worked at Google, Two Sigma, and several other companies. How does the SRE practice or culture differ between Google and other companies?

Isaac: Google essentially coined the term SRE and established the practice. They wrote the best practices book and set the standard for SRE, mainly because of the scale they operate at. Google needed to be a leader in the industry due to its vast scale. They invest heavily in SRE tooling, practices, and the SRE role as a whole.

Google has dedicated teams that create software specifically for their developers to use internally, including custom editors and code review tools. The tooling at Google is very mature, and the power given to SREs is substantial. SREs at Google have a lot of leeway, and leverage, and are empowered to make significant changes. They don’t have to fight for resources or respect—SREs are highly valued and have considerable influence.

If an SRE at Google says a system needs more monitoring or a specific approach to building, they can effect change easily, which allows them to perform their job more effectively. This makes Google a great place to work as an SRE, where they can accomplish a lot of good.

In contrast, at other companies, it can be much more difficult for SREs to do their job effectively. The culture varies greatly between companies, and some are better at supporting SREs than others.

Prathamesh: How does your typical day look like these days? Does it involve more coding, more hands-on work, or more communication with other people in the organization?

Isaac: I think any job involves a certain degree of communication with other people—there's no way around that. The more senior I get, the more communication is required. While I definitely don't want to become a manager and prefer staying on the individual contributor (IC) track, being more senior means spending more time talking to, helping, and collaborating with others.

As an SRE, I carry a pager one week out of every N, where N is the number of people on my team. When I'm on call, that's my week—I’m carrying the pager, fixing things, closing out issues, and handling typical SRE work, the life of an on-caller. For the remaining weeks of the year, I spend a lot of time on automation.
Whenever I find a task we do manually or see a runbook where commands are copied and pasted, I think, "No, I'm not going to copy-paste commands from a wiki. I'll write a script to automate it." Rather than encoding it in a wiki, I prefer encoding it in a Python script to automate the process. So, I focus a lot on automating workflows.
Coming from Google, I place a high value on code cleanliness and code health. Sometimes, I'll notice that some code doesn't follow best practices, and I'll spend a week cleaning up a codebase or simplifying it.

I'm fortunate to have a lot of leeway to chase these down and fix things. It's not all I do—I do have my goals and OKRs to hit, but I also get a lot of free time to work on other stuff, clean up things, or automate tasks I find.

Prathamesh: You mentioned Python. Any other tools that you use daily that you depend on?

Isaac: Predominantly, the languages I work with are Python and Bash. At Google, I briefly used Go for about a year or two.

I also rely heavily on tools like awk and jq, depending on what I'm working on. Additionally, I use a tool called Httpie for interacting with REST APIs—it's fantastic and makes life a lot easier. Recently, I’ve also started using yq, which is like jq but for YAML files. jq and yq make reading, modifying, and parsing JSON and YAML in the shell much easier.

Prathamesh: You also mentioned on-call schedules. When you think about an incident, is there any memorable one that you faced and are proud of that you'd like to talk about? I’d love to know.

Isaac: The most memorable one to me is probably the first one I caused when I was back at Google, at the very beginning of my SRE journey. I was a Spanner SRE working on the database, and we had a tool to increase user quotas in an automated fashion. The quota system had different resources, and we had hardcoded the group count to something like 10,000, which was pretty consistent across users.

Everything was great until one day, I added a quota for a special user who had more than the normal number of groups. The quota system reset their group count to the default, and the service went down for about 10-15 minutes. It’s scary to think about how much money I could have cost Google in those minutes because people were unable to sign up and become customers during that time.

Prathamesh: That sounds both fun and terrifying.

Isaac: It was terrifying, but we got it fixed fairly quickly. I was there when I broke it and remembered what had changed. I pushed the change and wasn’t sure if it was related, but thankfully, the tech lead was excellent at figuring out what was going on. We rolled it back pretty quickly. It’s one of my most memorable incidents because it was the first time I single-handedly took out a major system.

Prathamesh: Do you have any dashboards that you look at every day? It’s something I ask everyone—do you start your day with some dashboards, or not really?

Isaac: The team I'm currently working on doesn't directly run external customer-facing or time-sensitive services. We manage a lot of offline work, pipelines, and processes. So typically, we’re more focused on poking at bugs and tickets, pushing things along, and we don’t often deal with major outages.

Prathamesh: So it's not a typical on-call rotation for you?

Isaac: Not usually. We do have dashboards that show tickets and things that are failing—the standard queue of support tickets. So I keep an eye on that, especially since I’ve rewritten a lot of those systems.

When I came into my current role, there were a lot of shell scripts that I rewrote in Python, so I'm familiar with many of those systems. I do try to keep an eye on them, and if there are issues in the code I wrote, it’s usually pretty easy for me to figure them out.

But in my current role over the last two or three years, we don’t have systems that users are directly dependent on, so I’m not generally watching dashboards too closely.

Prathamesh: Are there any trends that you're excited about in the observability space? And are there a few trends you're not excited about? I'd love to hear about both.

Isaac:

I'm glad that monitoring and observability are becoming more commonplace and much more accessible.

With tools like Grafana, it’s nice to see how easy it is to set up a stack and add metrics.

The fact that the industry is gradually realizing what good metrics to measure—like focusing on what the customer is experiencing instead of just internal request failure rates—is promising. Overall, I'm glad we're moving towards a more reliable or at least more observable world.

The latest hot topic that everyone’s discussing is AI. AI has a lot of potential, but I don’t even know exactly what it’s going to do. It's clear that AI is going to change the industry; that’s the one thing I’m certain of. It's probably something everyone should be watching—by keeping at least half an eye on, it because it’s changing the world around us.

Prathamesh: What keeps you excited about your work?

Isaac: I really like automation.

There's something very satisfying about creating a tool that takes a task that was previously done by hand—something that took time and could lead to mistakes—and turning it into something automated and reliable.

I love being able to say, "Don't worry about those manual steps anymore. Just use this tool." It simplifies the process, makes it quicker, and eliminates errors.

As I advance into more senior roles, I'm also coming to terms with the shift towards enabling others. I really enjoy teaching and helping people learn new things. Seeing someone’s eyes light up when they understand a new concept or skill is incredibly rewarding for me. It’s one of the aspects of my job that I find most fulfilling.

Prathamesh: Based on your experience and interactions with others, what attributes or traits do you think are essential for becoming a good SRE and a valuable team member?

Isaac:
Curiosity is crucial. You need to be willing to ask questions, challenge the status quo, and explore better ways to approach problems.

It’s also important to have a broad range of skills and to be open to trying new things and learning new areas. Additionally, having grit is essential—you need to be persistent and not get frustrated when things aren't working right.

Finally, enjoying the work is important. SRE is a broad field with many different areas, and having a passion for some aspect of it can help sustain your interest and commitment over time.

Prathamesh: Absolutely. Enjoying what you do is key to maintaining long-term motivation and satisfaction in the field.

Prathamesh: How do you define reliability? What does it mean to you?

Isaac: Reliability, to me, involves two main aspects: availability and predictability. Reliable systems are ones you can depend on to perform as expected. They should have high availability, meaning they are accessible and operational when needed. Additionally, when failures occur, they should fail predictably. This predictability makes it easier to understand and diagnose issues, helping maintain overall system reliability.

Prathamesh: How do you recharge yourself from work? Do you take breaks or have any specific ways to get back to speed?

Isaac: I'm lucky in that I really enjoy automation, so I don't get too burnt out when I'm doing that stuff. But some weeks at work, you know, I'm writing documentation, doing other tasks, or on call, and I don't get to do the stuff I like. Often, I'll take an evening to work on personal projects, automate a task related to work, or write some Python code.

Sometimes, I'll just need to write Python code, so I'll create a silly tool or engage in self-adventive coding. I also mentor a lot on the exercism.org platform, so I'll find some way to do something with Python that I find fun.

Prathamesh: How has your experience with Exercism been? Do you enjoy it, and does it provide some leeway from work?

Isaac: I've been involved in Exercism for about three or four years now, primarily working with the Python track, and also maintaining the Bash, Awk, and Jq tracks. I briefly worked on the Go track during a cohort push. I'm pretty active in the community—I moderate the forum and Discord, and I even wrote a Discord bot that's quite popular.

It's been both fun and challenging. The bot I wrote reacts to user posts and provides track information, which people seem to enjoy. I'm also involved in syncing documents and updating exercises, even for tracks I’m not directly involved with, like Haskell.

Overall, I'm more involved in Exercism than I probably should be. I enjoy the community and the work, but I know I should balance it better with other hobbies.

And that concludes our engaging conversation with Isaac. Isaac's passion for automation is contagious, and his dedication to building reliable systems is inspiring. From his early coding days to his current SRE role, he's gained invaluable insights. His balanced approach to work and life provides practical guidance for anyone navigating the complexities of SRE.

We'd love to hear from you!

Share your SRE experiences, and thoughts on reliability, observability, or monitoring. Know someone passionate about these topics? Suggest them for an interview. Let's connect on the SRE Discord community!

Thanks a lot, Isaac, for sharing your journey with us. Connect with Isaac on LinkedIn, and for more about his experience, visit his webpage.

The SRE Experience: Isaac on Automation, Challenges, and Mentoring

SRE Experience, Automation and Challenges

Discussion about this post