SRE Story with Ricardo Castro

Applying Software Engineering principles to world of Operations

Sep 11, 2023

Today, we have Ricardo Castro sharing his SRE Story with us. Ricardo is a Principal Engineer and SRE at Blip.pt/FanDuel. He is also a tech author and speaker.

Ricardo, why don't you introduce yourself? How did you become an SRE?

Hey everyone, I am Ricardo. I have a master's degree in Computer Science from the University of Porto, Portugal. And at some point, I decided I wanted to pursue a PhD. So I did the first year of my Ph.D., but I said, okay, this is not for me; I want to enter the industry. I researched the private sector before starting my first job. You did software engineering and did what needed to be done. For many years, I was a software engineer, a normal one, just building products. And I started to specialize more in the back-end, as I was more interested in back-end engineering. Eventually, I spent a couple of years in London, where I got my first taste of the difference between Ops and Dev. Throughout my career, I always had some operational responsibility when building products. I was the lead developer in London, specifically at a big company. We outsourced a lot of work because we were a small team, and we outsourced many small projects to freelancers or agencies that could build stuff for us quickly.

Eventually, we realized we were using a similar technology stack in many places. We need to deploy this, put this live, and build tooling for freelancers, for example, to deploy without needing us. And we need to do the other stuff. And that's when I started to be more hands-on in automation and all the CI/CD work.

The DevOps term was beginning to be popular at the time. Around 2015, when I moved back to Portugal, my first job title was DevOps Engineer; let's say my title switched from Software Engineer to DevOps Engineer. Since then, I've worked more on operations, always with the software engineering mindset. I always think – how can I solve this using my software engineering skills? So, when I started listening, reading, and talking with people about SRE, the appeal was obvious. The genesis of SRE is how I can approach operations as a software engineering problem. That's what I've been trying to do for a very long time now. I try to use my software engineering skills to approach this.

What resonated with me was that I need a way to define Reliability. That's one of the most important, one of the most essential features of a system. Because I can develop whatever I want. If the user is unhappy with it, it's not up, or whatever our Reliability measure is, they won't use it. I've primarily focused on the SRE type of work for the last three to four years. I work at FanDuel in our Blip headquarters in Porto, a company subsidiary. We are trying to build an SRE team from scratch for FanDuel. Before doing any technical work, I am working on defining what Reliability means to us. This means understanding the product as a whole, understanding our customers, and trying to see what they value from our platform. What does it mean for us to be reliable?

You touched upon the need to define Reliability. How do you represent Reliability today based on your experience so far?

I use the SLO framework, although we need to adapt it occasionally. Before even going to SLOs and having a definition of an SLO, the way that I approach it is to talk with the product people, the business people. Consult them about the business flows or the user stories most important in our system. So now that we understand what the user does with a platform, what does it mean for this specific flow? Or let's go through the positive: What does it mean for this flow to be reliable and for users to continue using our product? This sometimes involves talking with the customer itself.

Sometimes, this specific flow needs to be fast to deal with latency. For other flow, I don't need it to be that fast, but it must be accurate. For example, I was thinking about the Banking system. Usually, users are okay with sacrificing speed to be sure that funds were correctly transferred from point A to point B. They're not so worried about this being done in 500 milliseconds or something like that. Once we have this information, we will define SLOs for this. SLOs can be more technical. For example, I can have a SLO for latency. They can also be business-related, and then we can translate into them the technical part. For instance, for the last 30 days - 99.9% of my checkouts are successful. Then, we can define what it means for the checkout to be successful - it must be completed under 500 milliseconds. The response code must be different than 500, and the checkout must be accurate. Okay. Do we have the metrics, logs, and traces to support it? If not, we look at the observability part. I need to incorporate the signals that will allow me to ensure that the SLO can be set up. In this way, I usually go from trying to understand the user, mapping out the business flows, and then incorporating Observability, allowing us to track what and measure the Reliability.

This means you interact with a lot of people all the time. What does your typical day look like?

Yeah. We are building a horizontal team that will build tooling, libraries, and a lot of stuff for other teams. We're not part of a specific team. We will be a horizontal team that will build things for other teams, like a centralized team. At this point, we focus on mapping out the business flows, ensuring Observability, and making some standards because our services are getting quite big. Having standardization means that we can understand what's going on. As a principal engineer, I'm still very hands-on but must network with other teams. This means speaking with other teams to understand their concerns and what they need and don't. It also involves spreading our vision. So I have a lot of meetings. Some days, it will be more hands-on, and some days, it will be heavier on meetings and alignment with heads of departments of other teams.

What does your work setup look like? Do you work remotely, or do you work with people in the office?

Yeah, I'm currently remote, but our company gives us the option of how we want to work. So I can work from the office, I can work in hybrid mode, or I can work entirely remotely. My option was fully remote, but I regularly go to the office. Although it's fully remote, in practice, it's hybrid. That said, we also have people in the US. Our team is spread between Portugal and the US. As a team, we are a distributed team by nature.

Are there any tools you depend on, like a particular editor, command line tool, or some language you use daily?

I have worked with several languages over the last few years, specifically for software engineering. Python was the Lingua franca we used in most companies I worked for. In the last few years, I've been mainly using Golang and building CLIs or APIs using Golang, but now I am somewhere in between. Although most of our systems are JVM-based on Java, Scala, and Kotlin, our team primarily uses Python for the Ops work as there is a lot of Python knowledge in the company. But yes, my most robust programming languages now would be Python and Golang.

Are there any tools that you use every day apart from these languages?

Yeah, for infrastructure as code, I use Terraform, Ansible, and Chef for configuration management, Kubernetes for orchestration, and a lot of stuff around Kubernetes. Stuff like helm using GitOps with Flux or Argo CD, service meshes with Istio. These are many of the CNCF projects we currently use.

How do you decide how to choose a particular Observability tool? There are so many tools present in the CNCF landscape. What points do you consider while evaluating a specific tool?

Regarding Observability, we want a tool that allows you to understand what's happening with your systems. You can go for a complete SaaS solution, something like New Relic or Datadog, or whatever it is. That's good because the pros of that are that most things are already integrated. If you use their libraries and SDKs, everything comes almost free. Pay a price, but everything is taken care of by itself. One of the things, as you said, with the CNCF landscape is that you have so much stuff that it takes effort to understand what's going on and choose tools. I go with popular tools because, usually, those have better documentation. You have more people who can help anyone, more vibrant communities, and more development.

One of the things that is somehow fragmented at the moment with the observability tools is that you need to jump around between tools. So, for example, you're using trace. It would help if you went to Jaeger. Then, it would help if you went to metrics. Maybe you have Prometheus and Grafana. Then you have logs, and then you go somewhere else. The Grafana stack with Loki and all their tools for metrics and logs gives you a central point where you can go and correlate stuff with each other. It makes it similar for you when you use a SaaS solution like New Relic or Datadog, where you go to one place, everything is there, and you can jump around from logs to traces to metrics.

So I'm moving towards that stuff where I can go to a central place and see everything instead of jumping around and copying. It takes a lot of time for engineers otherwise.

In this observability space, we see more of an effort in standardization. So we're seeing more and more of these tools appearing. It's a good option as well.

Does the single source of data play an essential role? Do you see it becoming a critical factor in choosing a tool?

I would think so. If you have several tools, let's consider even if you're using SaaS products - for logs, You need to go into Elastic Cloud. For metrics, you need to go to Datadog or something like that. It is very cumbersome and tiring. And then, you will need help correlating stuff because this information is fragmented across tools. If you can pull those things together, you can browse around easily. So it will be a decisive factor in the future. Of course, for that, we still need some standardization. There was news yesterday that they will merge the conventions from the OpenTelemetry and Elastic's Metrics Convention, which is an excellent step in the direction where we can have a standard way of doing the kind of stuff. We're still at the beginning of the standardization for Observability. But in the end, the solutions, either open source or commercial, will have everything. That doesn't mean there aren't pieces moving around behind the scenes, like microservices. One takes care of metrics; another takes care of logs. But you have a central place, a UI, and an API where you can query for all data types.

Talking about building an SRE team, how is the experience of ramping up new people?

In our case, we already have cloud engineering teams and DevOps teams. One of my responsibilities was deciding what the SRE team would focus on. We identified a couple of things that needed to be covered, and we wanted to. One example is Observability. We want someone who takes care of it and ensures we have enough Observability to understand whether our systems are reliable. We also started to produce and collect a lot of internal information. Our team must first understand the business because we are responsible for the platform's Reliability. So, we're looking at this at a platform level. It didn't make sense to us that I'll focus on a single service, and we'll optimize for that. We want to optimize the platform as a whole. So, the new engineers needed to understand and have an idea of the business, what the business does, what our types of customers are, and what it means to be reliable for them.

Of course, this involves collecting a lot of information. How deploys are done now, how infrastructure is, and then building a knowledge base of what we want to attack and what information is out there is critical. Of course, there are all the other parts related to people management as well.

Switching over to your work, do you have a fixed set of dashboards you look at daily?

Yeah, kind of. We're still working a lot on finding some standardization. But we do have metrics at the system and business level. The business-related metrics are essential. They tell us if something degrades or something funky is going on. We need to look at what we want to build in the future to capture if the user is happy. Then, of course, we can observe specific services or smaller flows. We're doing some POCs for building internal tooling to be smarter than just having a static threshold. We have some internal hackathons where we're thinking a lot about this, and we'll eventually get into all the AI machine-learning stuff. Still, for now, we're just keeping it simple.

What's your process for responding to incidents?

There's an underappreciated art in incident management that people often need to remember about mitigation. People delve in and try to find the root cause head-on. It might require that, but you can, most of the time, mitigate the problem. This is funny because that's one of the questions I usually ask in interviews regarding how engineers approach incident management. And many people say we need to find the root cause, blah, blah, whatever it is. There are better approaches than that. And I usually give an extreme example. Imagine you work in a bank, you're responsible for the banking system, and money is being stolen from your customers. Still, you are probably going to the root cause. It's not the best approach.

Shutting everything down is the best mitigation that you have, perhaps. So, in incident management - mitigating the issue should be your primary concern. When you mitigate the problem, you buy time to understand the issue. It may not be possible in all cases, but even if you make the issue less concerning, you can buy some time. You can say now I can breathe, and everyone is calmer. I can try to understand what the hell is going on.

And sometimes, there are extreme examples where you need to shut everything off because maybe money is being stolen. There may be some PII information that someone has access to that they shouldn't. Other times, you have to contain and understand the degree of impact. For example, there is system degradation, but not everything is down. If the user fails two times, but the third time, it works, so the user can still perform the task. My approach and my team's approach, and it's a good approach, is to try to mitigate it. We need to understand what's going on. Can we make this less painful? We can switch, buy some time, and then go and see.

The other important point is about communication. People need to understand that every once in a while, specifically when dealing with technical stuff, you have to communicate with people about the business impact of this incident. It would help if you had someone who can make that communication and ensure that executives understand what's going on, that people are working on it, and have a good understanding of the problem. Not in terms of this service or a coding issue, but trying to understand the business impact. So, internally and externally, comms are critical so that our customers and senior management understand what's going on and can be calm because their best engineers are working on the problem and trying to fix it.

Where do you think organizations struggle or need help in their observability journey?

Typically, I have seen two kinds of organizations - those with insufficient Observability and those with too much Observability. Not having sufficient Observability is like having only logs, or you only have metrics, and something happens. You need help finding out why it happened. Every organization goes through this. And the best organizations, what they do is identify and say, okay, I only have logs. I need metrics or traces the next time this incident happens. I need to have the information to decide and make better decisions. And that's a regular journey. If you're building a company, you'll work on the most important things. At some point, it will be, okay, my Observability needs to improve, and then you will work on it. Other organizations go the other way, and it's like, okay, I will add everything I can do from the get-go. And they have metrics, logs, traces, stack traces, and real-time user monitoring. They have a whole shebang of everything.

Usually, there needs to be more clarity on what I need to look at for the information in such cases. The information is spread out. I see a need for more standardization. Then, each team does what it wants. Teams create the metrics they want with the names they want. This is just one example. And that makes it very hard actually to make correlations because of a need for more standardization.

And, of course, that usually comes with one of the biggest problems: Cost. Because then you'll have to store all that information you may not even use properly.

The impact can also affect services. If I enabled tracing without sampling, my service would be slow. That is one of the biggest problems that organizations get themselves into. Then there's the discussion about whether I should build or buy it. And for this, there's not a one-size-fits-all answer. If Observability is critical for your organization, you go with standards, and at least in the beginning, you can buy, and then you can build. This is not a very specific answer for an organization. It will depend; it's one of those that will rely on a lot of context. For most organizations, if they go with some standard such as OpenTelemetry, they can move between vendors. Now, I'm using this specific provider, but okay, things are getting too expensive, or I need to renegotiate. I found out that Observability is more critical than I thought. I'll likely build a team inside that can handle this. But you're already using some standards to make migration more manageable. But for most organizations, starting with some standards and using a provider where they can send information because they're focusing on building their products rather than on managing the Observability can be a good option.

If they find out that Observability is very critical for them. They can say, " Okay, I'll build my Prometheus cluster. Or I'll create my own OpenSearch cluster, whatever it is". But only if it makes sense. But again, this will be a company-by-company decision, organization-by-organization decision.

Where do you find the information about the new things that are happening? Do you follow specific newsletters?

Yeah, so I follow many blogs and websites. I follow the blogs or the technical blogs of particular projects. For example, I open telemetry closely and follow some of the most professional team members involved with those projects. Those are usually core contributors or developers for those projects because they do the heavy lifting for you. Okay, this new thing is happening, and they are interested in just making you aware that this is happening. So, I follow a bunch of technical blogs for specific technologies. For example, OpenTelemetry, Kubernetes, Prometheus, and I follow their blogs. I use Feedly, so I have those subscriptions. Every day, I go through the links. I do an introductory read to decide whether this interests me or is not essential. If it's a short article, I read it immediately. If not, I make a note. Okay, I need to read into this because of this new feature and understand. I follow open-source contributors because, on top of the technical blog, they also share things like meetups or talks done at conferences. Or they had a meeting with someone from the core team, and there's this new thing they're thinking about, And perhaps they're sending a form to ask for feedback, for example. I follow a bunch of people both on Twitter and Linkedin just for them to do the heavy lifting for me and not have to sniff around everything and tell me what I should know.

A few interesting publications I follow -

Is there anything in the world of Observability that you're excited about? Is there anything that you are worried about?

The thing that I was worried about, but not so much now, is the trend of rebranding everything as Observability. There's a difference between Monitoring and Observability. But all of a sudden, everything was Observability. It was mainly a rebrand for a lot of tooling.

I was also worried about the explosion of projects and tools in the Observability space. If things keep going this way, we'll never be able to catch up on everything. So, some of these projects need to merge, and some standardization needs to emerge.

And now we are starting to see the beginning of that. What got me most worried about Observability is starting to get tackled. So now we are beginning to see projects saying that it doesn't make sense that OpenTelemetry has its conventions and that Elastic standard conventions also exist. We may need to merge these two projects or their conventions. That should be the way to go. That doesn't mean that there won't be competing things. But it's one thing if you have one or two or three things vs. twenty. Now, the observability space is starting to realize this promise. This consolidation and standardization are beginning to emerge, and it excites me about this space because it will make our lives a lot easier. Because we will have standards, we can build tooling around this. We can move even further with auto instrumentation and provide tools that engineers can use, not out of the box, but almost out of the box. That means that engineers can focus on what matters rather than specific details of a technology they don't care about daily.

Any memorable incident that you worked on in the recent past that you are proud of?

Yeah, let me think about it. So there are a lot of incidents that I've been part of. Some have been our problems, and some have been problems with providers. I remember many years ago, I was working for a company. We were working in the financial sector and tracking a business metric. That was how we were exposed to the market in terms of assets. We had thresholds for exposure. The business defined those with steps for what to do if we exceed a limit. We built an automatic tool that would cover those things. It means we could buy some assets to show that we were within bounds or sell some. What happened was that this was late December. I'm sure this was when we had our company Christmas dinner. We started to see the graph where we mapped out that exposure going just up and down like crazy. And they were like, what the hell is going on here?

So, the whole team came together, and we did an excellent job trying to understand the issue. The problem was not on our side. The problem was with the provider that we used. The problem, without getting into too many details, was that the provider needed to give us accurate information; it was giving us outdated information, which meant that our system, which ethically manages our exposure, was acting on obsolete information. That's why we kept seeing the graph going up, down, and down. And that was very intense discovery work. I'm proud of that event because it involved almost every team our company worked with. The operations team was my team, development engineers, and people from business. It involved software engineers because we all needed to understand what was happening. Okay, the problem is not with us, but it's somewhere where the problem is, then the business side. So, what happens if we go above and beyond certain levels? Okay, so they need to communicate and coordinate with the business. Okay, so we have this problem. How critical is this? Do we have bounds where we need to shut everything off or do some degradation?

It took us quite some time to figure it out, but it went well in the end, and there were no significant issues.

If you want to talk about two essential traits that an SRE should have, what would they be?

Please focus on the client or customer, whatever you want to call it. I still see a lot of SRE struggling with that. I understand because sometimes that reflects the organization they're working on. I can appreciate the obvious things that need to be addressed. For example, if you provide some services from a website and the site keeps going down daily, okay, we need to address this ASAP. That's one thing, but the other is focusing on the user. And when you're at the point where you want to define, measure, and assess Reliability, always focus on the user or the customer, whatever you want to call it. That's the main thing that SREs need to focus on.

Then, there are many other things. You will need a Reliability framework. You can use SLOs, but the focus should always be on the customer. In a customer's eyes, what does it mean to be reliable, and what are we doing to match that? The other thing is that SREs, and this is because there is an anti-pattern, which I call the rebranding anti-pattern, where we keep giving a new name to Ops teams. This also happens on other things, but I'll use the Ops world as an example. So you had sys admins, and they got converted to DevOps engineers. Some of those have been converted to SRE engineers, but they keep doing the same thing repeatedly. They may use a new tool. They may have been using the bash scripts, and now they're using Terraform. But nothing changes in their day-to-day work. So, it must differ from organization to organization to understand their work's scope and approach to Reliability or operations. With a software engineering mindset, everyone can learn to code. There are lots of resources online. That doesn't mean you need to know precisely what the software engineers need, but approach the operations problem as a software engineering problem because that's the only way to create and maintain massive systems now. We need to approach these things with a software engineering mindset and say, okay, some of these things can be solved with code and automation. Those are some of the biggest things I say to SREs: always focus.

Focus on the customer when you're trying to define Reliability. The other is to approach operations as a software engineering problem, the premise where SRE was born within Google. And can I build code actually to fix this problem?

If you're not an SRE, what would you be?

If it were in tech, I would return to back-end engineering because that's where I feel most comfortable. If it were not for tech, I'd be a musician because I like that. I played guitar for many years, and then I stopped. It's something that I enjoy. I'm into guitar and metal music, so I'll probably try to follow a music career.

Thanks, Ricardo, for sharing your story with us. Ricardo can reached on Linkedin and Twitter. He also writes extensively on his personal blog.

SRE Story with Ricardo Castro

Applying Software Engineering principles to world of Operations

Discussion about this post