Dan Slimmon’s SRE Lessons from the Frontlines

A Candid Chat on Resilience, Team Communication and Observability

Oct 23, 2024

Dan Slimmon, a seasoned Site Reliability Engineer (SRE) with over 16 years of experience, has become a leading voice in incident response and operational resilience. Dan’s approach combines technical know-how with a practical mindset, helping companies find their footing in tough situations.

In our conversation, Dan shared his insights into the intricacies of observability and the challenges of maintaining high-performance systems. He emphasized the importance of effective communication in SRE roles – how conveying technical decisions can significantly impact team dynamics and project outcomes.

Beyond his work, he enjoys learning Japanese and playing music with his daughter. I feel, his journey is a refreshing mix of professional wisdom and personal flair, making him someone you can easily relate to.

Prathamesh: How did you become a Staff SRE at HashiCorp? What was your journey to reaching this point?

Dan: Well, I went to college for physics and math. Eventually, I realized I wasn't going to make it as a physicist or mathematician, so I took a job as a sysadmin.

I started writing a few Perl scripts to improve deployment processes. I worked at a company focused on political fundraising, offering a suite of SaaS tools for politicians and nonprofits running their campaigns.

We worked on the Barack Obama campaign in 2008. Initially, we thought it would just be a 6-month project since we didn't think he would beat Hillary Clinton, but he did. It was a very intense six months trying to get that website up to speed for a national presidential election.
After that, I worked at an IoT company in Minnesota for a bit, focusing on the Internet of Things. Then, I got a job at Etsy on their observability team, where I worked with Logstash, the ELK stack, and Graphite.

You know, open-source, run-it-yourself type, managing self-managed observability infrastructure.

And now I'm here at HashiCorp. That's the whole story. I live in New Haven and work on Terraform Cloud.

Prathamesh: And how does your typical day look these days?

My job lately has been similar to a sysadmin/SRE role, whatever you call it in any given decade. I tend to get distracted by unusual things in the production data, asking, "What's that? What's going on?"

I dive down those rabbit holes, which means I could be faster at writing a bunch of code. However, that curiosity has become my niche at HashiCorp: I focus on finding problems in production and fixing them before they escalate.

I go to a decent number of meetings about projects, discussing whether this or that will work. I read some proposals in the morning and spend maybe half an hour to an hour a day, sometimes more, looking for anomalies in various data sources. I'll poke at some graph dashboards to figure out which issues I find interesting and which ones I don't. If something seems worth digging into further, I'll file tickets with the relevant teams.

I've probably spent most of the rest of my time investigating issues myself.
For example, if I notice a spike in network latency at 3 a.m. last night, I’ll look into that. Or if I see that 500 errors are becoming more common at higher throughputs, I’ll think, "Well, that's interesting," and dig into it to see if there needs to be a ticket about that. Most of my time is spent consulting with other employees about strange problems they've encountered or digging into issues myself.

Prathamesh: This is one question I ask everyone: How many dashboards do you start your day with?

Dan: It’s a really interesting question. Daily, I have one dashboard with two graphs on it, corresponding to two outstanding issues that I know might get worse. I check it every day to see if things are worsening. However, there's no specific dashboard that I check daily. On any given day, if I feel like looking at particular systems, we have dozens of dashboards.

Prathamesh: So you look around and see what you find. Do you use metrics, logs, traces—everything?

Dan: Yes, we use Datadog for all our monitoring needs. I use database monitoring extensively to keep an eye on our PostgreSQL instance. We also use APM for tracing and logs.

There’s no substitute for logs, no matter how many traces you have. I find myself using logs a lot and metric dashboards probably less frequently. Often, I’ll dump logs into a CSV, run a script against them, or use Google Sheets to analyze the data.

Prathamesh: Any programming tools that you depend on every day?

Dan:

I write all my code in Vim, not for any ideological reason, just because it’s what I know.
I use Delve to debug Go code and the Chrome Developer Tools if I need to debug some JavaScript.
I don’t really go out of my way to find new tools that will make me marginally more effective. With the tools I have, I’m already effective enough. It’s more about asking the right questions. Sure, I might be 10% faster with VS Code instead of Vim, but it doesn’t matter much if I'm doing the right thing in the first place.

Prathamesh: Are there any trends you see in the current observability landscape that excite you? I'd like to know both something you're excited about and something you're not particularly interested in.

Dan: Sure! I don’t get overly excited about tools in general, but I’ve noticed a strong focus on distributed tracing lately. Developers are getting more involved in tracing their own code, which I think is super valuable.

Also, there are some excellent database performance analysis tools emerging, especially at Datadog. They’re doing amazing work with database monitoring these days. That’s exciting because database issues can often be dry and challenging to understand. Any bit of visualization or anomaly detection I can get from a tool is incredibly helpful.

On the flip side, I’m definitely skeptical about anomaly detection in monitoring, particularly AI-based anomaly detection. I find that humans are quite good at detecting anomalies.
Let me share my theory on this. Everyone in my organization has a mental model of how our system works, and we write code and make changes based on that model. If our mental model drifts too far from reality, that’s when problems arise. To effectively detect an anomaly, I believe a human should look at it and say, “Huh, that’s weird. That doesn’t fit with my mental model.”

That doesn’t seem right because

I’m the one with the model; the AI doesn’t have a model of how the system works. It’s just a black box. It simply records that there were this many observations, more than a certain threshold. But it doesn’t know what’s interesting. If I’m personally surprised by something, that indicates a disconnect between my mental model and reality.

And that’s a signal to follow. I don’t really get involved in the black box anomaly detection stuff, even though everyone seems to be pushing that more and more.

Prathamesh: Okay. I have three questions for you. I'll start with distributed tracing. As an SRE, do you think that distributed tracing helps you understand system health? I primarily look at observability data for two use cases: understanding system health to make decisions and debugging for root cause analysis. In your experience, where does distributed tracing help you as an SRE?

Dan: Mostly, I use it for the second purpose—troubleshooting and debugging. For instance, when a request behaves unexpectedly, I look into what went wrong.

I also use it for system health investigations. One technique I employ is to select an endpoint and sort the traces by decreasing latency. Then, I examine the top few traces to gather insights.

Prathamesh: So you perform some aggregation on top of that?

Dan: Yes, like taking the top few traces. I can also sort them in ascending order by latency to determine the baseline—what's the least amount of time a request can take? From there, I can analyze what causes requests to take longer, often looking for the components that might be breaking down.

Prathamesh: That sounds super helpful.

Dan: It really is. Additionally, tracing data breaks down by subsystem, which lets me examine latency. I check whether latency is flat or varies with time. If it spikes during the day and is lower at night, that indicates potential contention somewhere. This gives me a clue that there might be a problem, allowing me to use APM tools to dig deeper.

Prathamesh: Let’s talk about database monitoring as you mentioned. Database monitoring has two main aspects: logical analysis, where you identify issues like missing indexes and performance problems, and infrastructure monitoring, which involves checking CPU and memory usage. In your experience, where do you find most problems? Are they more on the infrastructure side or the logical side?

Dan: For the application I support as an SRE, the infrastructure metrics are not particularly helpful. While it's useful to know, for example, that the system is running at 70% CPU, I primarily focus on analyzing individual query performance. For instance, when a query that was once fast becomes slow, I need to understand why. We used to run it once every second, but now it's being executed a hundred times a second.

I find that using EXPLAIN PLANS is essential for this analysis. They allow me to see changes in the query's performance. Additionally, having samples of what queries were running at any given time is incredibly valuable for performance analysis.

Identifying which query holds a lock that another query needs is crucial.
Database traffic is often non-linear, meaning that aggregated system-level metrics may not reveal what's truly important. A query might be doing nothing for a while, but a small change—either in the query or the underlying dataset—can suddenly impact the entire database. By focusing on individual query performance, I can catch many issues before they escalate.

Prathamesh: That makes sense. I’ve found that using EXPLAIN and EXPLAIN ANALYZE in Postgresql is fantastic for understanding execution plans and identifying potential issues. Do you use those features extensively, or do you rely on Datadog's offerings?

Dan: Datadog provides EXPLAIN PLANS, and they're useful as a starting point. However, for queries I'm particularly interested in, I usually pull up an exact example from the database and run EXPLAIN or EXPLAIN ANALYZE directly in the database CLI. Sometimes on a clone if you're concerned about the query's impact, right?

I had a fascinating case a few months ago where a specific query caused the database to run out of memory. This was a significant database, and when I ran just EXPLAIN on the replica, it consumed 200 gigs of memory and crashed the database. I was shocked! The query was so complex and nested that just trying to plan it caused the system to run out of memory.

Prathamesh: That’s quite an insight! The next question that I’m always excited about is war room incidents. I’m sure you might’ve been a part of many interesting war rooms. Do you have any memorable incident that you ran into and fixed it proud of?

Dan:

Let’s talk about the most significant one.

It took us about a month or two to figure it out. We encountered an issue where long-running transactions in the database led to a severe pile-up of processes, causing everything to grind to a halt. This happened after about 30 to 45 minutes of a transaction running, resulting in processes getting stuck in a state related to something called MultiXact SLRU. We had to dive deep into the internals of PostgreSQL to understand what was happening.

MultiXact is when PostgreSQL locks a row, it records the lock information in the tuple on disk. If multiple transactions simultaneously hold a lock on the same row, there isn't enough space to store all their transaction IDs. In such cases, PostgreSQL uses a separate space called the MultiXact region, which acts like a linked list of transactions holding locks on that row.

We found ourselves in a tricky situation because we were using PostgreSQL as a queue, which is not advised. If there's a long-running transaction, PostgreSQL can't finish its vacuuming process for the table.

The vacuuming process is crucial as it clears out old MultiXact data. If there are old tuples, and the vacuum cannot clear them due to the ongoing transaction, PostgreSQL has to keep all the multi-exact entries for those old tuples locked, even if the corresponding rows are already gone.

As a result, the MultiXact SLRU region became enormous. Reading this region took longer and longer due to a mutex, meaning that if you were reading the multi-exact table, you had to hold this mutex. Nothing else could read from it until you were finished, causing query times to increase linearly. A query that initially took 10 microseconds could balloon to 20 microseconds or more, creating a cascade of delays and leading to an outage.

To resolve this, we modified our queuing logic. We implemented two main strategies:

Lowering Lock Timeouts: We adjusted the lock timeout on queries so that if they were waiting too long for a lock, they would simply abort.
Identifying Long-Running Transactions: We conducted a thorough investigation to identify sources of long-running transactions. By pinpointing and fixing these areas in the code, we significantly reduced the instances where multiple transactions would block the same row simultaneously.

As a result, the rate at which we were generating these MultiXact objects decreased substantially.

Prathamesh: That sounds like quite a rabbit hole.

Dan: We were frantically reading through the PostgreSQL source code to get to the bottom of it all.

Prathamesh: How do you recharge? I see some guitars behind you. Is playing music your go-to when you want to take a break from work?

Dan: I actually do a bit of everything. I work on my Japanese flashcards—I’m learning Japanese right now, which I find relaxing.

As for music, I mostly play the piano these days as my three-year-old loves to play on her little plastic keyboard. We often do fun things like covers of Devo songs together. I guess I recharge by doing different kinds of work. For better or worse, that’s just how I’m wired.

Prathamesh: That's great! I know you love being an SRE, but if you weren't in this role, what would you want to do instead?

Dan: I think I’d like to be a linguist.

The science of language fascinates me, especially syntax. I’m intrigued by how our brains process language and the rules that govern it. What are the built-in parts of our brains that facilitate this, and what aspects are subject to variation? Those questions really interest me.

Prathamesh: That’s a fascinating choice!

Prathamesh: For someone aspiring to be a good SRE, what traits or attributes do you think are important?

Dan:

I often tell people that while you can learn the technical skills on the job, one aspect that often gets overlooked is communication.

Many individuals become technically proficient, but once they get promoted or take charge of a larger team, they realize it’s not just about technical skills. They need to articulate the reasons behind their decisions and effectively explain things to others. If they haven’t practiced those communication skills, they can struggle at that point.

So, I advise people from day one to explain every decision they make—no matter how small—to their coworkers. It should become a habit. By the time they’re in a position of greater responsibility, they’ll have those communication skills well-developed.

Dan: Have that skill, and you'll be ready to go.

Prathamesh: Absolutely! Those communication skills can make a significant difference in how effective someone is in a leadership role.

Thank you, Dan, for taking the time to chat with us!

Our discussion gave us a glimpse into your journey as an SRE, which is not just about your technical expertise but also your thoughtful approach to handling challenges.

It’s refreshing to hear how you balance the demands of your role with your interests in learning Japanese and playing music with your daughter. Your experiences remind us that while tech can be daunting and challenging at times, it’s important to stay grounded and make time for what we love outside of work.

We'd love to hear from you!

Share your experiences in SRE and your thoughts on reliability, observability, or monitoring. If you know someone passionate about these topics, suggest them for an interview. Also, join us in the SRE Discord community!

If you find yourself resonating with Dan's experiences and insights from the war room, connect with him on LinkedIn!

Dan Slimmon’s SRE Lessons from the Frontlines

A Candid Chat on Resilience, Team Communication and Observability

Discussion about this post