SRE Story with Iris Dyrmishi

OpenTelemetry and building Observability Platforms

Jan 09, 2024

Today, Iris Dyrmishi, Senior Observability Engineer at Miro, is sharing her story with us.

Iris, let's start.

Hello, my name is Iris. I am from Albania originally, and I moved here to Portugal around three years ago. I work as a senior observability engineer at Miro now. Previously, I worked as a platform engineer with a focus on Observability.

I started my career nearly four years ago in Bulgaria. After earning my bachelor's degree in computer science, I worked as a backend engineer for three months. But then, I worked for a company that offered the services to other companies. They needed a DevOps engineer, and they made me one. So, I started my training in DevOps. That's where my passion for Observability, monitoring, and metrics started.

After I moved to Portugal, I started working for a luxury retail company. This is where Observability became my career's primary focus. I learned a lot, and eventually, I transitioned to being a senior observability engineer doing the same thing I was doing, building an observability platform with my co-workers. We build a platform for other engineers so they can take advantage of the tools we provide to develop their dashboards, alerts, and Observability for their applications in general.

I have seen a lot of blogs from you on topics related to OpenTelemetry. How is your experience with OpenTelemetry?

OpenTelemetry has been my focus area, and it started in my previous company. OpenTelemetry is one of those tools that, in Observability, is backward compatible with almost everything. So, it's one of those tools that is very easy to implement and brings significant benefits.

I started writing my blog because I saw that many engineers who work in Observability wanted to give a shot to OpenTelemetry but found it challenging. They didn't find it very hard technically. Still, it looked like a significant change, and considering that we are offering services to other teams, you think a lot about the risks associated with what you implement. So, when I started working with OpenTelemetry, I realized it was the opposite. It's effortless to implement and very easy to substitute the tools you already have. The example I give is the transition from Jaeger to Opentelemetry, which, in the past, we managed to do without having any downtime and, at the same time, improved the performance and the experience in general for the engineers.

So I decided to write about it, first to show that it's not as scary as it looks. Then, as my experience increased with tracing, metrics, and slightly logs, I wanted to share how much work is needed for another team to do what I was doing. Of course, the circumstances are different in every organization. But I wanted to share my experience, so the many hours of research I did, some other engineer had it ready and summarized. Opentelemetry has helped the organizations I have worked for and many other engineers who share their experiences in the cloud native community to centralize the collection of all the telemetry signals, metrics, logs and traces in one place. Also, to become completely vendor agnostic and stop relying on vendor agents. This has given companies complete control to change their tooling based on their needs. Another great benefit of Opentelemetry is the standardization of the telemetry signals, significantly benefiting data correlation.

Iris's blog can be found here.

How do you adopt OpenTelemetry in large organizations? Tell me about the journey and your experience.

Many organizations have engineers who are usually very well-informed about the newest technologies, but knowledge is not enough; more is needed(i.e., support from upper management). OpenTelemetry usually starts to be discussed, especially at a higher level, when something happens, or there is a significant need for change. And usually, what drives it is the need for standardization and centralization. It is not easy to use many tools for Observability. It requires many engineers to keep everything up-to-date and optimized. So many companies see this as a tool that can help centralize all this information in one place and jump on the opportunity to implement it. Not only to centralize but, therefore, to improve the quality of the information. If asked, I would recommend adopting OpenTelemetry from the pillar of Observability that you find the weakest in your company. And I've noticed that usually, it is tracing. Tracing is one of those forgotten pillars. It's becoming trendy now. Still, many companies leave tracing behind the door, focusing primarily on metrics and logging. So I recommend always starting with the one you think is the least developed in your company. It's a new technology. So, by the time you have already transformed your least developed pillar, you will have the experience and knowledge to move to the other ones, making it easier and faster.

Has OpenTelemetry also helped reduce costs in your experience?

I have seen cost reduction not only because we are using OpenTelemetry; OpenTelemetry is a transporter; it collects, transforms, and transports, but the actual cost usually comes from the back end where the data is being processed. But I could give you an example from what I've experienced about cost saving. Let's take Jaeger as an example. There are a limited number of backends you can use Jaeger with unless you have built something custom. So, if you're using some of the expensive databases like Cassandra, it will be costly. The OpenTelemetry collector has many exporters for many, many back ends. If there isn't one now, you will probably see it available in 10 days because it has so much community support.

One of these many exporters is Grafana Tempo. So, for example, switching Cassandra as a backend to Grafana Tempo, which is based on object storage like S3 and Azure storage accounts, will be a lot cheaper. OpenTelemetry gives us this kind of flexibility.

It's like that because you can choose the back end; it doesn't have to be open-source. It could be an observability vendor doing the processing for you. It could still be cheaper because you have the luxury of sending information how and where you want it. The flexibility provided by using OpenTelemetry effectively results in cost savings.

Do you recommend any resources for people to get started with OpenTelemetry?

The first thing that I recommend is to join the CNCF Slack channel for OpenTelemetry. A vast community has much to give and teach, and everyone is accommodating. So you'll learn a whole deal just joining there and seeing the conversations and what is happening there. But to start the OpenTelemetry journey, go to the official documentation; everything is there. And if you see that something is not done correctly, you can open pull requests.

Join the CNCF Slack here.

What does your typical workday look like?

Well, it's constantly changing. In platform engineering, your work is always evolving. But for me, at the moment, it's like this. The moment I woke up, we have a stand-up meeting. Our team is small, but we are in different countries. We sync with each other while having a cup of coffee. The first thing after that is to review the merge requests. I immediately jump on the task that I'm doing based on priorities. Sometimes, I also take one or two hours to read, stay on top of everything happening in the community, and be observant because bringing all these ideas to the team is vital.

Are you also on-call at times?

It's crucial to have an on-call rotation. The observability engineers are just like other SREs. We maintain all clusters, our namespace, and our components. I'm currently not on an on-call rotation. It's too early. I joined the team in Miro two months ago. But I have done on-call in the past. I've had some stressful situations, but it's also nice to feel in control and know that you can fix something that improves the other teams' experience.

For an effective on-call system, alerting needs to be reliable. What do you think?

A properly tuned alert is the most important thing because you do not want to receive a P1 in the middle of the night when you're sleeping for it to turn out to be a false positive. That is very important. In our organization and usually in the companies I have worked with, all the teams are owners of their alerts and dashboards. So we're not the ones creating them. We create alerts for our stack. We have guidelines, and we help and support other engineers. Hence, they reach the level of maturity that they need in terms of alerting and dashboarding. To have an alert ready for production, we ensure it's appropriately tuned when we create an alert. So it will not wake someone up for no reason. We give it some time and adjust it properly, and only then it becomes a high priority. We make sure that we're viewing alert rules from time to time. We make sure that the rules are correctly documented. So you don't receive an alert, and you say, what is this? We make sure that some incidents that happened in the past are correctly documented in the troubleshooting pages. So you can have an easy guide for other engineers. We do these things for our work, show them, and suggest them to the other teams. But of course, every team has its own processes. The best that we can do is enforce a few guidelines for alerts. Make alerting as centralized as possible because if alerts come from 10 different tools, it will make it more confusing for the teams and challenging to know if something needs fixing. If it is a central tool, it's easier for you to debug and fix immediately to maintain and improve it, and it's always in top shape.

What does your work setup look like?

I have an external computer screen. I have a good camera and am considering buying a good microphone for podcasting and public speaking. But it's a straightforward setup, just a computer screen, camera, and a good chair. Of course, being in a good, relaxed position is essential, especially when you're talking in a meeting.

Are there any programming tools you use daily?

Visual Studio Code is my best friend; it has the best extensions, and I use it often. Of course, I primarily work with Helm because most of our stack is in Kubernetes. Of course, Git, as well.

What are your thoughts on GitOps?

We use Gitops a lot. I'm not part of the team maintaining Gitlab and Gitops, but we significantly use it. We use Jenkins to have very controlled releases and implement security practices. Especially when working with open source, everything must go through the Jenkins pipeline for security validations. But to that extent, I'm just a user of what another team provides for us.

Where do you find information on what's new in OpenTelemetry?

I usually go on LinkedIn because, throughout my career, I have habitually followed all the people that interest me, especially in the Observability and tech space. So, scrolling through LinkedIn is like a source of information; I find everything. And now Twitter, which is X, has also become like that. So LinkedIn, Medium, O11y News, OpenTelemetry blog. I also use New Releases to know when some new software version that interests me gets released.

What are you excited about OpenTelemetry in the coming future? What is something you are not happy about?

I'm most excited about something we're currently using, but it's only improving: instrumentation. Right now, all our engineers have to update their framework. It could be quickly outdated because it's not their main priority. So, the OpenTelemetry operator offers auto-instrumentation capabilities, and you have the SDKs with a framework you can inject with a simple annotation, which is fantastic and mind-blowing. That's something that I love. There are a lot of languages already supported, like .NET, Java, and Python, that we are currently using. Golang is still a work in progress. And that excites me because there is so much support that unique features are being added there. Sometimes, we don't even have to have any manual intervention there.

When it comes to something that I'm skeptical about, I don't know. I'm such a passionate person about OpenTelemetry, and I like the movement. Honestly, it looks all very positive. It's an environment that I enjoy speaking about and being in.

How do you evaluate an observability tool with so many options being available?

It usually depends on the goals that the company has. The first one that I see is cost. For example, you don't have to pay for many open-source tools. Still, the maintenance that comes with it or whatever you have to build, the resources could be enormous. So that's something that needs to be very carefully evaluated. The other one is the quality of the information that you're getting. For example, for metrics, there could be different solutions, and there could be different setups. Knowing our company's needs, how big it is, and how different the architecture is, we decided to get the best tool. Another one is also how modern that tool is. There are tools of the past that many companies still use, but since we have this platform that people actively invest in, we're always trying to work with the ones that are keeping up with the time. We try to find a stable tool but it is also modern and has all the features that a company needs in contemporary times. So I'll say the speed, the costs, the quality, and of course the features it offers.

Is open source cheap compared to managed solutions?

Well, one of the factors is the human factor because, for example, if you buy a tool that you need a license for, it's already ready for you, And you can use it. You don't have to have the extra stuff to support or maintain it. They do everything. You're just buying the subscription, and everything comes to you. If it is an open-source tool that you don't buy a license for, you need people to maintain it, improve it, and adapt it to the current architecture. So that is something that makes a difference.

Many vendors use the tools so efficiently that, for example, when we use ten terabytes of storage, they use just four because they have better compaction. So, the price of that license is worth it compared to our platform. It could be about CPU, memory, or performance. Even performance can be better because you have a list of people working daily to improve the product. Open source is great. I love it, but it requires more people to build, improve, and make it efficient for your organization.

What does reliability mean to you?

Reliability means that there will never be a team blind on their journey at any time of the day, that they will go to our platform, they need some information, metrics, some logs, some traces, they will never be blind. That is the number one criteria. Number two is that they will always get alerted for the things they need to get alerted. That is also up there on the priority levels. The other aspect of reliability is that the teams always have quality information about their applications. You could have millions of metrics, traces, and logs; an engineer goes there to the platform, and they are lost. So it must be reliable, that it is always available no matter the time, but also with high quality data. This is the most important of what makes our system reliable. And another criteria, I don't know if it gets to the point of reliability, but it's up there because the platform, the solution, needs to be very global. Every engineer can go there and find their information. It's kept for more than just Kubernetes metrics. Everyone can onboard into it and have their information transformed, seen, and collected.

What is an essential attribute for someone to be a good platform engineer?

It would help if you were very fast to learn new technologies and to adapt very fast. You need to be able to learn quickly, have curiosity about things, and know that you always need to improve because platform engineering is a lot about open source, which moves a lot. And also, there is a touch of passion that needs to be there. If you do not like this kind of line of work and you don't have passion for it, it gets very overwhelming. But it's perfect for a person who wants change and likes to move fast.

How do you take time away from work?

It depends on the team structure. We have a team with a great support structure. We are all very good at what we do. If one of us feels burned down or needs some time off, we are not reluctant to take this time away from work because we know someone from the team will pick up our tasks and ensure the deadlines are met. Or if you're in the middle of a project and want to break from it. I know how to do this, but I don't want to. I prefer to do something else. You can switch. There's so much to choose from. But that really depends on how the team is built. And we have this lovely culture within us that you can move and whatever makes you feel comfortable with. And, of course, we have personal goals. To reach those goals, we have to work hard. But you have the freedom to change if you need to.

Are there any books or authors you follow?

There is a book about Staff engineering by Will Larson. There is also Observability Engineering Achieving Production Excellence, by Charity Majors, Liz Fong-Jones and George Miranda.

Do you think AI will affect platform engineering and SRE?

We're going to be the last ones to be affected by AI. It will significantly help us improve our lives when analyzing metrics and exploring the data. It would be great. But we're the last ones affected because our job has much to do with a comprehensive view of the architecture that AI may get someday, but it is not there yet.

What would you do if you could change your career to something else?

I've always wanted to work in tech. But when I was a kid in kindergarten, I wanted to be a heart surgeon. But once I learned about tech, I knew I wanted to be a techie.

My father is a detective in Albania, and he's my idol.

If I were to change my profile in tech, I would want to be a cybersecurity specialist and work with the police to fight crime. So I could achieve that dream of following my father's steps and staying in tech. That would be awesome.

Any memorable incident you were part of that you fixed and want to share with us?

I'm proud to say there are so many, to be honest, and let's hope I'm not jinxing it because this is the end of a weekend.

But you have those incidents that are right before your eyes, and you do not find them. I'll share one because it's recent. There are a lot more. We had a lot of complaints about alerts needing to be fired. And it's not the fault of the alert manager. We knew that. And I was checking; it was a predicted linear expression. So, I was researching, reading, and going crazy. What could be wrong with this expression? And I had missed it. There was a label that did not exist there. It took me one week, and only when I was debugging with another co-worker we found an extra label that should not have been there.

How did you find it at the end?

Took a breath and realized that we were checking letter by letter without seeing the bigger picture, and then we noticed that extra label that was never supposed to be there.

If you would like to ask a question to future people who are coming on SRE stories, what would that be?

One thing I also like to do in interviews is ask what they think about Observability. If you are wondering why, it's because so many people, especially when switching from engineering to platform engineering to SRE, need to understand the importance of Observability. So it's crucial for me as someone who is going to be part of my team or who is going to work in this context: what does Observability mean for them? How vital is Observability? And now, if they're reading this, they will know how to answer.

Thanks a lot, Iris, for sharing your story with us. Iris is active on Medium and Linkedin.

SRE Story with Iris Dyrmishi

OpenTelemetry and building Observability Platforms

Discussion about this post