SRE Story with Alex Hidalgo
Becoming better SRE by understanding the human connection with software systems
Alex Hidalgo, author of the SLO book and Principal Reliability Advocate at Nobl9, shares his SRE story.
This story was recorded as a conversation between Alex and Prathamesh on April 14 2023.
Welcome Alex to the show. How has been your journey so far in the SREverse?
I have an interesting path to where I've gotten today. I grew up as a computer geek. My dad started teaching me programming when I was nine years old. My friends and I tried programming our first 3D first-person shooter engine in high school. I didn't go to college right after high school because I assumed I could get a computer job, and I could!
I ended up doing network security work for the Department of Energy. I quit after about a year and a half because I hated it. I thought it was because of working with computers, and I wanted computers to be a hobby and not a career. So I decided to go to school. I studied philosophy and history. As a bartender, I worked in the service industry for various restaurants in front and back of the house. I worked in a warehouse for a while.
Then I moved to New York almost on a whim. I had been living in Richmond, Virginia, a much cheaper town, and I had moved to New York City, one of the most expensive cities on the planet, right at the height of the last recession. This was around early 2009, right after the 2008 collapse. The economy was still recovering, and no one was hiring. I couldn't find a job, and right as my money was about to run out, I ran into someone who needed a desktop support person, and I said, you know what, I still knew computers.
I took up that job thinking it was to pay the bills, but I ended up loving it. I love working with computers, especially when a human factor is involved. Especially when there's a human on the other end, I was helping people every day, even if it was just defragging their hard drive or removing a virus or one of these simple little things, whatever it might be. I was helping someone, and I started to connect that with how much I loved working in the restaurant industry. I was helping people. I was making their day better.
A few years later, I became a technical operations engineer at Admeld, one of the early adopters of DevOps. The whole DevOps thing was just getting started. The term had just been coined. Everything was about the humans, better communication, so I loved it because I love humans. Google acquired Admeld, and suddenly, my title changed. I was now an SRE.
At first, I didn't even know what that meant. But the work felt similar. I ended up loving the SRE work. I love the customer focus, measuring things, and the user impact. The customer is not always a paying customer; sometimes, another team depends on you. I spent a long time with Google on various teams and eventually ended up on the CRE team. This Customer Reliability Engineering team is a group of experienced site reliability engineers who focus on helping Google's most prominent cloud customers build more reliable services.
Read more about the CRE team at Google.
Eventually, my time at Google was over. I went to Squarespace. At Squarespace, I spent much of my time on better-doing Service Level Objectives. That's when I wrote the book about SLOs - Implementing Service Level Objectives.
Eventually, I ended up at Nobl9 - the Service Level Objectives company, quote unquote. That's my basic story, and it's important to know that there is a human thread through all of it.
I learned as much about how to do my current job well, working in restaurants or serving people coffee or selling them furniture or all those other jobs that I did in my twenties. That has been as important to me as anything I learned at Google.
What does your work day look like?
It depends on the day. I am in an interesting role. We are an SLO-tooling startup. In my position, I have to help out where needed. That means some days I'm very customer focused. I am helping people understand how to do SLO better, use our product, and how Nobl9 works. Some days it is about assisting people in troubleshooting. Some days I am sales focused. I am out there with new prospects and helping show them why we're so excited about what we're building.
Some days I am more into marketing. I also work with SLOConf speakers to help them prepare for the talks and work on planning the event. I still help with product development assisting the engineers with architecture decisions. I am in a fun role where I get to do a little bit of everything.
Wow! But that also means you have to switch contexts and wear different hats. How do you manage it?
How do I do that? Perhaps not as well as I could :) Because you're right, it isn't easy. And that means that I could do better. That means that sometimes I need to pay the right amount of attention to the right problem. Because I'm being distracted by something else, but that's okay to admit. None of us are perfect. Part of the process is understanding that things aren't perfect and that humans need to help each other. So I don't want to make it sound like I do poorly. But I like to focus on the concept that sometimes work is difficult. Running computer services is difficult. Computers often break because complex systems often break. Everything's complex, including our social-technical systems and interactions with others. It's essential to focus on the fact that it's okay. Let's iterate; let's learn. Every time I don't context switch well, that's an opportunity for me to learn how to do it better. That's how I try to approach everything. With that SRE mindset — let's learn and let's improve, and let's iterate, and let's get better and better and better.
Are there any tools that you heavily depend on?
I use a Google calendar very heavily. It is constantly open. Not just because I have a lot of meetings. Sometimes I don't have any meetings. But I have a lot of reminders about tasks. I create a lot of blocks of time that help me organize my work. For example, if I have to write a blog post for someone or spend some time troubleshooting a problem — I use my calendar to set that off. My calendar looks almost complete for weeks, even if only half are actual meetings. The other half is just a little reminder to myself. That works reasonably well for me. I start every day by looking at my calendar, seeing what I might want to change and move. It's not a highly advanced system. But it's the one that works for me. And I am constantly in that tab. I also have subscribed to almost everyone's calendar in my company. Before I bug someone, I scroll through and click their name to see whether they are in a meeting right now. Or are they even working today because maybe they're off? I constantly use that to help me figure out when's the right time to ask this person for assistance or clarification.
Do you work remotely?
We are about half and half. We have an office in Poland, and that's one-half of the company. The other half is mainly distributed across the US. We do have a small office in Boston. But that's just a handful of people. Most of the company is remote. We're stretched across a nine-hour difference in time zones between the West Coast and Poland. We have people in every single time zone in the US. That is part of ensuring we're doing all the proper coordination. But overall, everyone should love this remote work-from-home culture.
For a while, we didn't have a dog walker. I have a dog; I love him :) He needs a lot of walks. He needs daytime walking. We have a dog walker now, but for a while, our old dog walker had to leave, and it became just a really lovely kind of break in my day because I used to walk him. I have a block in my calendar for the dog walk time :) It was an excellent way to break things up. I'm happy that he has a dog walker again, but it's a perfect example of when you work from home, how you can spend some time with your dog or your kids. I'm famous on Twitter for talking about "WalkOps". Some of my best work happens when I'm walking and just thinking. That's so much easier to fit into your day with this remote culture than going to an office. Having more flexibility in my daily schedule has been fantastic for me.
You mentioned being part of the CRE team at Google and helping cloud customers on their reliability journey. I assume most of those customers must be enterprise customers. Making changes in such large organizations is extremely hard. How was your experience working with such customers?
So many! I was on the CRE team for about a year and a half. Yet I saw almost everything you could find on that spectrum. Large enterprise customers were very excited to "Do SRE". We used SLOs as the common vocabulary, and they were very much on board that – “Yes, we want to do SLOs to better think about our reliability and identify where we need to make changes”.
On the opposite end of the spectrum, I have seen people saying – “we have a twenty-four-hour NOC team. We have people who stare at computer screens. It works for us”. After us trying to explain that there is a better way, they used to respond with – “No, we're going to stick with this old model”.
My biggest takeaway is that you can't, from the outside, assume what the culture inside of a large enterprise might be. Some of them are, at least on the technology side, nimble and willing to learn, change, and adapt. Others are stagnant and may seem old-fashioned, stuck in the past.
It was interesting to see a wide plethora of different approaches and different amounts of willingness to listen. But again, there were also a lot of customers who were very willing to learn and much more nimble than you might expect a tech or a large enterprise could be.
You have contributed to the Google SRE book as well, was the book's content based on the lessons learned while handling customers in the CRE team or from the experiences from running and maintaining systems at Google? Because the book has shaped the SRE industry in some ways.
I've recently been on record that I think Google made some mistakes in publishing both the SRE books. Those books were, I would say, overly ambitious. The SRE team at Google did only some of those things. Some teams did some things very well, and some did all. But once the books were released, especially the first one, into the industry, too many people looked at them and said, "Oh, we have to do it this way now”. I think the books were not framed well enough in terms of the fact that they were solutions to Google-scale problems. But you are not Google. You need to solve your problems in your own way. There's a ton of wisdom in the books. I don't feel bad about them or anything like that. But I wish people wouldn't hold them like holy artifacts that they must follow blindly because that can often end up in failure for you because you need to make your own decisions for your own problems and use what's in those books as a starting point or a way to think about things. Many brilliant people with great experience wrote those and put much time and effort into them. So again, I'm not anti-Google books, but I wish people understood more that it's how Google solved their problems. It doesn't mean that that's how you should solve yours. I hope that it's better understood that they're very ambitious books. Use them as frameworks and snippets of wisdom but don't follow them as some road map.
I have come to love the term “the map is not the territory”. The map can help you. It can set you in the right direction, But when you get there, you need to figure out and find your own path and action.
You have worked on a lot of dev tools. Also, Nobl9 is building an SLO product, a dev tool. What are essential to building a dev-tool product in the observability landscape?
I've worked on infrastructure tools my entire DevOps/SRE career. I've always been on teams where our customers were other people at the same company. Other people relied on us for them to do their jobs, which might then be like an external actual paying customer. The most important part is to remember that there are humans on the other side of the table. They don't have to be paying customers. It could be a team down the hall from you. It could be a team across the planet, but the best way to build, maintain, observe, and think about those tools is to frame it from what my users need. It's been fascinating being at Nobl9 because it's the first time I've worked on a product that is, explicitly, directly one step away from the customer. I was always building and maintaining tools for other people at my companies before. But the same approach works across the board. It doesn't matter who relies on you. It's okay if they're paying your company directly or they're someone you know or just a user out there on the internet.
People don't pay to use Google search. But the only way you can keep Google search reliable is to still think about the people using it. You still have to think about the humans on the other side; otherwise, you will measure the wrong thing and make bad decisions.
How do you find observability and reliability-related topics and ideas to write and talk about?
Let me backtrack slightly. I probably spend too much time online chatting with other people in the space, whether it's community Slacks or on Twitter or on, Mastodon or just friends in real life. I often come across situations that I realize many people are struggling with. I've heard four or five people discussing this as a problem for them and how we can solve it. Much of it is also from my experience, what I'm seeing. Especially now that I've left the cathedral of Google.
Part of it is based on my experience - here is a thing I did, and it worked well; let me share it with others. Part of it is based on other people struggling with something and how we can better address that, and how I can use my years and years of experience to see if I can give a solution to the problem.
But the other thing is I'll go back to that concept of “WalkOps”. I come up with many of my ideas by taking long walks and thinking about things. If you look at one of my conference talks, the chance is that I wrote most of that in my head over weeks, if not months, just thinking about it in the background on these walks, sometimes listening to music or podcasts, and it just kind of marinates. And then I can often put the talk together in just a day or two. But it doesn't mean I haven’t been writing the talk the whole time I've thought about it. Letting things marinate and progress. One of the other essential ways I develop these ideas is through meditative mindfulness. Go on a walk, mostly turn my brain off, and see what emerges.
Do you follow blogs or subreddits to discover what's happening in the SRE space?
This story was recorded on 14th April 2023 and the discussion is before Twitter becoming X.
I start with Twitter as a starting point. I think it's a shame the state that it's in and how it's kind of dying. It's going to be interesting to see what emerges from that. Mastodon, BlueSky, maybe it's something else entirely. But I would say that's my primary starting point because that's where people share a lot of their blog posts and a lot of their articles, their research, everything from academic white papers down to – “here's a five-hundred-word blog about a thing that happened to me at work”, and everything in between.
I don't have many blogs I follow necessarily or even newsletters. But if you're following the right people and the right amount of people, they will share that with you. Finding the right people online and using them as a resource works for me. That's also what I try to do – when I read something, or I learn something, I try to boost that out to the world like here's a cool thing I just read, or here's a great book, and be able to share that with people in turn with people sharing stuff with me – that's I think my primary way of finding things to learn and know what's going on in the industry. Finding out what people struggled with or finding out what people's solutions are. It's been a great resource, and that's why I'm kind of sad to see it slowly dying. But we'll see what the world looks like in six months or a year.
Do you have any recommendations for some books or courses for people just starting their site reliability journey?
There are a ton of great books out there. I don't want to go and try to name because I'm going to skip someone or skip a book. What I'd say is – generally, you can trust people attempting to share the right message. Take some of it with a grain of salt because there's always some marketing; someone will always sell you something. I might be trying to sell you something. My advice here is not to list many individual things but to be thoughtful about what you consume. Start with good intentions; assume that people are trying to share something with you because they genuinely want you to be able to do your job better and live a better life because of it. But always realize they might also be selling you something simultaneously.
In my last conference talk, I have a sentence: "Maybe don't listen to me even, maybe don't always listen to the people on stage.” So learn from others but also be thoughtful, mindful, be aware of the situation and how that information is being presented to you and why it's being presented to you, and ultimately make your own decisions.
We can return to the "map is not the territory"; "all models are wrong, but some are useful". There are great quotes like that. Regardless of how many books or talks or whatever it might be, irrespective of how many of those exist, make sure we're using that information to make your own decisions.
Are there any interesting trends you are excited about in the observability space?
I think people finally understand that you have to measure complete user journeys, especially in a world where everyone's running microservices on Kubernetes, getting away from just resource monitoring. Like, who cares what your CPU utilization is? That might be a good metric to have because it may help troubleshoot something down the road, but what you need to think about daily is what the user journey experience looks like. This is becoming more and more common. People understand that's what they need to be measuring. That's what they care about on a day-to-day basis.
Better distributed tracing, OpenTelemetry, which I'm very excited about. It's cool to see the adoption. Open standards, in general. OpenSLO is also a cool project, a vendor-less approach to how you might define and think and modify your SLIs and SLOs.
Now that a few years have passed since publishing the SLO book, do you find its relevance today? Have things changed or remained the same?
I am immensely proud of the book and believe it is still relevant. There are some things that I would update. I'd spend more time discussing how to use your error budgets. So much of the book is about how to get better data. I would focus more now on consuming this data to make better decisions.
How do you recharge yourself from the work?
I love scuba diving, and I've only gotten to do one dive since before the pandemic. I'm planning a trip soon. My favorite place in the world might be Bonaire. It's just a tiny little island - a Dutch island off the coast of Venezuela, a literal desert island, most vegetations are cacti. There are only about sixteen thousand people. But it has the most beautiful reefs. They surround the entire island. There are large painted yellow rocks that indicate – “here is a dive site”. So you can drive around with a few friends, find these sites, and go out and explore. I find scuba diving immensely meditative and calming. It's one of my favorite ways to relax and connect. It is my favorite activity in the world.
If you were not an SRE, what would you be?
I would love to open a Dog Rescue center one day maybe. I love dogs so much. They are such pure creatures. They want to be good. I love dogs, and if one day I can work with them closer, that would be great.
Any suggestions for questions I can ask future participants?
I always love learning what people have been able to apply to their site reliability engineering journey and the processes that they learned outside of the industry. Because being interdisciplinary and learning from other industries is very important, and I always love hearing from people. Oh, here's the thing I learned doing X. It might be a hobby, it might have been a previous career, or it might have been studying something academically that isn't computer science related at all.
Thanks a lot, Alex, for sharing your SRE Story with us!