Ariel Richtman's SRE Lessons and Laughs

Tools, Tales and Laughs

Nov 28, 2024

Ariel's voice really stands out in the SRE world. He has this amazing way of turning even the most stressful war room moments into stories that make everyone laugh, no matter how high the pressure gets.

His journey started back in the sysadmin days, and he’s always been the kind of person who embraces every challenge.

When someone says, “I want to use this tool, but it doesn’t do that,” Ariel’s all about finding a way to make it work. This mindset is what keeps him excited about the work he does, whether it's fine-tuning systems or making life easier for those around him.

In our chat, Ariel opened up about his journey, his daily routine, his go-to tools, and what he enjoys doing when he's not working.

And, for those of you who have been in the SRE space, Ariel has a question for you at the end! Don’t forget to tag him and let him know what you think.

Prathamesh:
Let's start with your journey so far. How did you get into the DevOps/SRE community, and what has the experience been like?

Ariel:
I got into coding in primary school, around age 11 or 12, starting with VBA and similar tools. We had an older sysadmin and his assistant at the time. The sysadmin would get furious because I was always tinkering in his computer lab, but the assistant and I had this cat-and-mouse game where I'd try to break his setup.

We were on a Novell Network version 4-something. I'd even write a little malware or boot Linux from a CD. The assistant found it amusing, but the sysadmin nearly banned us—and my friend actually did get banned for pushing it a bit too far.

Prathamesh:
That’s interesting. When was this?

Ariel:
I was about 11 or 12—let’s say around 2000. After that, coding took a backseat. I finished school and studied robotics engineering, which had a bit of coding, but it wasn’t the focus, so it faded again.

Later, I taught English as a Second Language (ESL) for a few years. Then, I randomly applied for an ICT job with the government. They took forever to respond, so I assumed it was a "no." But eventually, they called back—they were just that slow. They needed more people and didn’t have a specific role yet, but asked if I was interested. So, that’s how I got into ICT, around 2016.

My first role was as a sysadmin for about three to four years. Then I moved into DevOps for a couple of years. Finally, in November 2021, I joined SilverRail as an SRE.

Prathamesh:
So, from breaking computers to maintaining them—you’ve really come full circle!

Ariel:
Yeah! Fun fact—I got fired from one of my jobs for, well...let’s just say old habits die hard. So that’s the timeline.

At SilverRail, the SRE role was still evolving. It was labeled as SRE, but it covered a mix of responsibilities. I've been pushing it more toward platform engineering, setting things up to make sense rather than just putting out fires. Initially, it was more reactive than engineering-focused.

It’s been an interesting journey, and having the CTO’s support has been great. The hardest part, though, has been navigating the people side and driving a cultural shift.

Prathamesh:
Got it. So, when you're setting up platform engineering processes, do you have a team working with you, or are you more of a lone warrior here?

Ariel:
We have two other SREs in our Brisbane and Australian offices. One of them is closer to a traditional SRE, while the other has been with the company for over 20 years. He’s indispensable and knows everything inside out, but it's been tough to bring him along on this journey due to a knowledge gap, and he’s very tied up with his product team.

We’re embedded within teams rather than as a separate portfolio, so he's tied to his product manager, product owner, and team lead. His workload is heavy, especially now that there's a global push to unify our products technically. The goal is to integrate our applications into a more cohesive suite instead of a loose collection of tools patched together ad hoc.

As part of this, the DevOps and platform engineering team has been leading the way on infrastructure, which is foundational for this initiative. The director overseeing this effort has been away for several weeks due to personal issues, so I’ve been pulling things together.

I’m reaching out to team members across time zones to gather documentation and processes, and we’re finally getting a standardized pattern for Terraform and Terragrunt with the right permissions—this has varied every time we deployed before. So that gives you an idea of how the "team" is structured.

Prathamesh:
What does a typical day look like for you? Do you have a lot of meetings?

Ariel:
My busiest day for meetings is sprint initiation day, with retrospectives, reviews, planning, and everything shifting around. I’m more of a Kanban person—spending too much time planning doesn’t change the work that needs to get done.

On a regular day, I start by assessing, “What’s on fire?” Lately, things have been stable, so I usually review outstanding merge requests first. I aim to keep a daily turnaround on feedback since even a 24-hour delay can drag things out. I also handle updates from our Renovate bot and merge any CI-approved changes.

I make a point to block off focus time in my calendar and shut down Slack and Outlook, so I’m not distracted by chat notifications. Some people type out paragraphs in chat, and I’d just be sitting there watching, waiting to see what they’re writing!

There are usually a few support requests throughout the day, often from interns needing upskilling. For example, I recently spent a couple of hours with an intern, walking her through Docker workflows and contexts.

Then there’s the planned work: discovery and design of new solutions, which is rewarding because we’re not so big that everything’s already solved.

We still have opportunities to extend existing tools and address requests like, “I want to use this tool, but it doesn’t do that.” I enjoy figuring out solutions to those kinds of challenges.
We also work on updating older systems to reduce risks, like adding Terraform where it’s missing. It’s a mix of tasks, but it keeps things interesting.
I like understanding different use cases from my sysadmin days—identifying the software paradigms and fitting them in a way that just works without needing constant revisits.

Prathamesh:
Once it’s done, you can replicate it in other areas if possible. You mentioned starting the day by looking at what’s breaking or on fire. Do you use dashboards for that? I usually ask everyone about their daily check-in process. Do you have a set of dashboards or similar tools you check each morning?

Ariel:
We do have some tools, like Redash, but our dashboarding and data capabilities are limited. There’s a Grafana instance, but it mostly supports our Kubernetes platform, so it’s not comprehensive. For legacy systems, we often rely on the basic EC2 dashboards, which pull whatever information they can gather.

Prathamesh:
Got it. And what tools do you use regularly? You mentioned Terraform—are there others you work with daily?

Ariel:
Yes, I work a lot with Linux and Nix, which has been about 90% of my workload. Also, shoutout to Helix Editor—it’s been great. People love their Neovim or Emacs, right?

Prathamesh:
Absolutely! I’m an Emacs person myself.

Ariel:
Right, those editors are classic and feature-packed. But I thought, “I don’t need another hobby of learning Lua and writing scripts.” So, Helix was a perfect solution for me.

As for other tools, I use Terraform and Terragrunt a lot—anything as code. We’re about to roll out Argo CD pilots soon, which will be helpful because it runs without needing much manual touch.

We also use Atlantis for automated infrastructure deployment, which streamlines Terraform operations.

A big shoutout to Nix as well—it's a game-changer. We do a lot of repo hopping and context switching, so being able to reproduce environments without installing dozens of versions on your machine is incredibly useful.

I just drop a definition in the repo, and as soon as you enter the directory, it sets up everything you need to make sure it works.

We also use Python here, but my philosophy is: that as much as I enjoy writing code, the best code—the easiest to maintain—is the code you didn’t write.

Prathamesh:

Yeah, no code!

There’s a famous quote by Kelsey Hightower about how the best code is the code you don’t write, which is exactly what you're saying.

One other thing I wanted to ask—about Terraform and the tools you mentioned. In many companies these days, I’ve noticed they keep infrastructure-related code separate from application code, treating configurations separately from product code. Do you follow that practice, or do you use a monorepo where both are part of the same workflow?

Ariel:
This came up in discussion today! The Australian office focuses on centralized infrastructure, similar to what AWS calls a landing zone. It defines the minimum requirements to manage an AWS account—things like reporting, automation accounts, and an EKS cluster.

We try to align infrastructure as code, Helm charts, application code, and container definitions all in one repository. I’ve seen the chaos that happens when you separate everything—when the container definition is in one repo, publishes to a registry, and the Helm chart is in another. It’s a mental overload!

I’ve been there at 7 PM, juggling four different repos on different branches, committing, pushing, changing tags, and deploying, only to have it all blow up.

People warn against putting all Terraform code in one massive blob, which can get unmanageable. We’re revisiting this, and I’m discussing it with someone who wants to consolidate everything into one repo. But you’ll always hit a boundary somewhere, and some level of coupling is inevitable.

For me, aligning application code with the container and infrastructure you're deploying is key. If you try to shove everything into one repo and hope it all lines up, it can get tricky.

I’m looking for a term to describe the scenario where a URL has to be perfect across different layers—like in the environment variable, Helm chart, and application config. If you come up with a phrase for it, let me know! It definitely needs a yak shaving moment!

Prathamesh:
Exactly! Maybe something like config shaving or a similar term. But shifting gears a bit, you’ve been in this industry for about 10 years now, right? What keeps you excited about your work? With so many changes happening and trends evolving, what drives you to stay engaged every day?

Ariel:
I consider myself deeply technical, so I’m always reading and learning. I often hop on the treadmill or bike and listen to lectures from conferences like Goto or LinuxConf.

What works for me, though, is the facilitative role I play—helping others do their jobs and make things easier. My journey started as an English teacher, transitioned to sysadmin, and now I’m in infrastructure and process delivery.

I couldn’t do pure ops where the same tasks are repeated. I thrive on new challenges! When I see technology come together and really “sing” for people, when everything fits into place and creates something greater than the sum of its parts—that’s incredibly rewarding for me.

Prathamesh:
When you mention watching talks from Goto or other conferences, how do you stay informed about what’s happening in the industry? Do you follow specific people, blogs, or accounts?

Ariel:
That's a great question! I have several email subscriptions that keep me updated. For example, I enjoy SRE Weekly by Lex Neva, as well as TLDRSec and WeeklyTF. And of course, I can’t forget SRE Stories—it’s essential! There’s also a platform engineering newsletter I subscribe to, though it's a bit infrequent.

I’m part of a few Slack and Discord channels, but they tend to be pretty quiet, especially compared to the fediverse. I hopped off Twitter a while back, so I rely more on these platforms and email newsletters.

Social media has its advantages too. Following people allows for more interaction. I can tag someone with a technology or question, say, “Hey, I think it works like this, but is there a better solution?” and people will jump in to correct me if I'm wrong, which is fantastic for learning!

Prathamesh:
Based on all the information you gather, what trends in the SRE space excite you, and what trends are you less enthusiastic about?

Ariel:
That’s a great question!

I have a strong feeling that the current generation of DevOps tools will eventually be superseded, much like how they replaced Salt, Ansible, Puppet, and Chef.
While those tools aren’t dead, they’ve fallen out of favor due to a shift towards more disposable infrastructure—just build it from scratch again. I’m not particularly excited about whatever I’ll have to maintain that’s generated by AI, either.
On the brighter side, there are a couple of hot topics right now. Tools like System Initiative and Dagger.io for declarative CI/CD are gaining traction. We’ve become clever enough with YAML that we’re starting to hit its limitations, so I think we’re ready for a shift.
Similarly, while we’ve been using Terraform and Terragrunt, HCL has its limitations. I suspect we might eventually move toward a more general-purpose language for infrastructure as code, perhaps something like Cuneiform or another data structuring language.

Prathamesh:
When you mention AI, I know some tools are trying to integrate it into observability—like the Grok query engine from Neuralink. Do you think AI will help SREs or DevOps professionals with monitoring and debugging, or is that too far-fetched at this point?

Ariel:
I think there’s potential there! Markov Chain-based AI, for example, could play a role in observability. It can help beginners generate boilerplate code, which is always easier than starting with a blank page. I wouldn’t dismiss it outright.

There’s been a noticeable uptake of tools like GitHub Copilot, and surveys suggest that many users appreciate it. Where I see AI making a significant impact in the SRE space is through machine learning.

It can sift through vast amounts of data, identifying anomalies or statistically unusual events. This capability can help generate a list of alerts for SREs to review—enabling them to assess if these anomalies warrant alarms or if they need tuning.

Overall, I see a lot of potential in using AI for observability and monitoring, as long as we approach it with a critical eye.

Prathamesh:
You mentioned statistics earlier, which brings me to anomaly detection in observability. Often, people expect these tools to magically detect issues, but behind the scenes, it involves statistical models or AI/ML algorithms.

How do you view these expectations? What are your expectations from anomaly detection in observability tools?

Ariel:
My first consideration is the intrusiveness of the solution. Some frameworks, like Prisma Cloud, may require extensive proxies that can be quite invasive. For instance, Dynatrace might want to deploy a hefty 60 MB agent on your Docker containers, which raises concerns.

Regarding anomaly detection itself, based on my experience with data, context, and data modeling are critical. Simply having access to raw statistics isn't enough; it needs to mean something relevant.
Anyone can track the rate of change, but if it's Black Friday morning, that context changes the significance of the data.
What’s essential is understanding what combination of data points represents the objects we want to monitor.
For instance, how am I using rolling windows? What does year-on-year data mean for my specific case? This kind of analysis takes considerable effort. It's easy to deploy a dashboard using tools like Kibana, but getting to a point where someone can look at several charts at 2 AM and confidently identify the issue—that’s the real challenge.

Prathamesh:
Absolutely! The trust factor is crucial. At 2 AM, you need to trust that the observability tool is showing you the right information and guiding you on what to do next.

Ariel:
Exactly! You hit the nail on the head. Trust in the system is crucial.

Prathamesh:
One of my favorite questions is about war room stories. I get a lot to learn and also many SREs resonate with this. Do you have any memorable incidents from your experience that you're particularly proud of? Something that offered valuable lessons in complexity or learning?

Ariel:
Absolutely!

One incident stands out because it was unexpected and taught me a lot about failure modes. We had an instance of Artifactory, which our jobs use to publish artifacts. If it goes down, it's not mission-critical, but it does impact developer workflows.
So, we had a scheduled security update for the EC2 instance running it. After running the update and rebooting the machine, things went awry. It turned out that somewhere along the line, one of the config files had changed, and the schema wasn’t updated to match.
We had snapshots and retention policies in place, but we had no idea how long this latent issue had been lurking, waiting for a reboot.
As we dug deeper, we discovered that we were pulling our images from this instance onto our production Kubernetes. And Kubernetes tends to move pods around. So suddenly, what had seemed like a minor issue became much more critical. If production started doing its thing, we were potentially in big trouble.
We spent a long day troubleshooting. I even had people digging through logs while I worked on pulling some of the older machines, which were also impacted.
At one point, I had to hand over the situation to another senior engineer. He eventually managed to get one of the old machines working, but the bizarre part was that it just needed a reboot—after bouncing it multiple times, it finally came back up.
This incident was memorable because it highlighted a strange failure mode. It was like a landmine waiting for someone to step on it, and it taught me the importance of understanding our dependencies and failure scenarios.

Prathamesh:
I love your description of the incident as a landmine waiting to be triggered. It highlights how failures are often inevitable, especially when building durable systems. This is where resiliency and reliability come into play. How do you define the reliability of a software system? What does reliability mean to you as someone maintaining that software?

Ariel:
Reliability, to me, means that a system behaves as expected consistently. It's important to note that reliability isn't the same as availability. A system can be available but still not function correctly.

For example, we had a recent issue with our OpenSearch cluster. The application hit a shard limit and began rejecting writes, returning a 429 error. I mentioned to the principal architect that we had an incident because the OpenSearch cluster was essentially down. He responded, “What do you mean? It's available!”

I pointed out the classic debate: Is it truly available if you can't log into it? Reliability is about the system consistently performing as intended. While the system must meet functional needs, you also need to be mindful of its availability. If it behaves exactly as you want but isn’t consistently accessible, that’s a problem. Context matters a lot in these discussions.

Prathamesh:
That makes total sense. For someone starting their career in the SRE space, what traits do you think are essential to be successful?

Ariel:
There are two key traits I believe are vital. First, you need to be inquisitive. This means asking questions and seeking to understand how things work. Second, you should feel a sense of annoyance when something isn't quite right. That discomfort will drive you to dig deeper, pull at the threads, and ultimately unravel the problem to find a solution.

Prathamesh:
Absolutely! I think that the annoyance you mentioned is crucial. Without it, there wouldn’t be the motivation to fix issues. If you weren't in SRE, what would you be doing? What alternative career path do you envision for yourself?

Ariel:
Realistically, I’d probably be unfulfilled but still teaching. While teaching has its rewards, I felt I could see the limits of it for myself. However, if I hadn't been a DevOps Engineer or SRE, I enjoyed my time as a Release Train Engineer.

That role involved cross-team product coordination, which I found quite fulfilling. Interestingly, when you step away from coding, the urge to jump back in and tinker with YAML—or whatever the tools of the trade may be—can reignite with a passion!

Prathamesh:
Absolutely! Taking a break from the routine can often reignite your enthusiasm. Speaking of breaks, how do you recharge? The SRE and DevOps roles can be quite demanding, especially with on-call duties and the need to connect systems and people. How do you find your balance?

Ariel:
Exercise is a big help for me. I enjoy activities that allow me to zone out and process my thoughts—like hopping on a stationary bike or treadmill. But sometimes, I just need to step away from the keyboard entirely and go outside. It’s all about pacing yourself and ensuring you're well-rested.

Prathamesh:
That’s great advice! If you could ask future participants of SRE Stories a question, what would you want to know about how they approach their work?

Ariel:
That’s an interesting question!

I’d love to hear how they manage people, especially when it comes to balancing creativity in coding with the need for standardized processes. How do they handle the desire for individuals to create their tools or methods when there might be a more maintainable solution available?

Prathamesh:
I’ll be sure to include that in my discussions!

Thank you, Ariel, for taking the time to chat with us!

As we wrap up our chat with Ariel, it’s clear that his journey in SRE is more than just a career—it's a blend of technical prowess and a genuine love for problem-solving. He reminds us that behind the systems we manage, there are stories, laughter, and lessons learned from those “uh-oh” moments.

It’s truly heartwarming to see someone passionate about their work, and always eager to learn and share insights. I’m sure everyone at some point feels like they’re struggling with people and cultural shifts, but Ariel’s experience should give you hope that, eventually, you’ll handle it all.

We'd love to hear from you!

Share your thoughts on reliability, observability, or monitoring with us. If you know someone with a passion for these topics, suggest them for an interview. And hey, why not join our SRE Discord community to connect with like-minded folks?

I know you’ll want to connect with Ariel and learn how he makes it all seem effortless—so be sure to connect with him on LinkedIn!

Ariel Richtman's SRE Lessons and Laughs

Tools, Tales and Laughs

Discussion about this post