Inside Observability: Maude's Experiences from Her Time at Slack!

Scaling Systems, Observability, and Balancing Roles

Nov 14, 2024

Maude Lemaire, Principal Engineer at GitHub and an active contributor to LeadDev has been a pivotal force in backend systems and observability.

With her extensive experience in performance tooling and distributed systems, she made significant strides during her time at Slack, where she's tackled scaling challenges from both sides of the equation: solving them through broad refactors, and simulated them through highly flexible load testing tools.

What sets Maude apart is her genuine excitement for technology, whether it’s frontend or backend, and her knack for handling diverse challenges across the board.

In our conversation, which took place in April while Maude was still at Slack, she opened up about the intricacies of maintaining high-performance backend systems, the journey towards adopting new observability tools like Astra (previously Kaldb), and the constant need for innovation in a fast-evolving space.

Beyond her technical prowess, Maude shared some insights into how she balances the demands of her role with the joys and chaos of parenthood.

Prathamesh:
How did you start your SRE journey?

Maude:
I come from a pretty traditional background. I have a computer science degree from McGill in Montreal. My first job came from my last internship. I did an internship at a fashion startup called Rent the Runway, based in New York. It was such a cool gig because it combined two of my big interests: fashion and programming. The team was great. At the time of my internship, there were only two interns, so we got to work on a variety of projects.

I was mostly doing front-end work back then. After my internship, I came back full-time and continued in a front-end role. I learned a lot and appreciated working with the team. But eventually, I started to realize that front-end wasn't what I wanted to do long-term.

Prathamesh:
Had you tried backend work before?

Maude:
Yes.

On the front end, we were primarily using Backbone.js, but we frequently had to make changes to our Ruby middleware, and I occasionally was able to dabble in our Java microservices.

We had a small engineering team, and there were quite a few moving pieces to manage. We had a team focused on building out warehouse operations software, another managing the website, an iOS team, and a data team focused on building out our recommendation engine.

Unfortunately, it was difficult to get the bandwidth we wanted to tackle tech debt and performance problems. The product itself wasn't a software product, and the leadership team wasn't very technical. I think they didn't fully understand the tradeoffs they were making.

For example, our company president wanted to add a fourth promotional banner to our website. I decided to push back and proposed a rewrite of our banner system in order to make the hierarchies easier to manage (from both a user and engineering perspective). My product manager was thankfully supportive and built some buffer into our deadlines to allow me to tackle that work.

I eventually decided I wanted to work for a company where software was the product, where hopefully I wouldn't have to push so hard to make important investments in cleaning up tech debt, etc.

I also realized that while I loved working with our talented design team, I only had so much patience for pixel-pushing.

My boyfriend at the time, now husband, was living in Seattle and we were trying to figure out how to close the distance between us. We decided to both look for new jobs in San Francisco!

After a brutal summer of sending out resumes and interviewing and getting rejected from nearly 30 jobs, I finally landed a role at Slack, as a backend engineer. It's been quite a journey– almost eight years now!

Prathamesh:
Wow, that’s a long time! That was actually my next question. How have you experienced your journey at Slack over these eight years? What has changed, and what hasn’t? What do you like and dislike about it? If you can share about processes that have improved over time or general changes you’ve noticed throughout the years?

Maude: Well, for starters, the engineering team has grown tenfold since I joined. That’s been a significant change! The scale at which we operate now is incredible—it’s like night and day. I was originally hired as a product engineer working on the Enterprise Grid product. We were a few months away from launch and were already plagued with performance problems. Things only escalated after GA.

Customers were investing real money in the product, so we quickly assembled a team dedicated entirely to tackling major performance problems for our largest clients. Back then, our biggest customer had about 60,000 daily active users. Just a few months before I joined, we were already grappling with performance issues at 30,000 daily active users. Fast forward to today, and we don’t even blink at supporting customers with close to 400,000 daily active users in a single Slack instance. We handle millions of simultaneous WebSocket connections without breaking a sweat!

Prathamesh: That’s a completely different scale.

Maude: Absolutely!

I’ve been the tech lead for our load testing efforts for the past four years, and it’s been such a thrill. It turns out, I really enjoy breaking things on purpose! A couple of years ago, we asked ourselves, “Can we run Slack with 2 million active users simultaneously typing messages into the same Slack team?”
So, we decided to give it a shot! We built the tools to make it happen and were eager to see if it would actually work. We had to troubleshoot a few issues along the way, but in the end, we did it! So now, we can scale to 2 million users in a single Slack instance.
As for whether we’ll ever have a customer that large—I’m not sure. Maybe one day, we’ll have all of Amazon, including everyone in their warehouses, using Slack simultaneously! But aside from the Department of Defense, I can’t think of any employer that big.

Prathamesh: Yeah, or maybe an entire country could use it!

Maude: Exactly!

With Slack Connect channels, it’s possible to have massive groups from different companies all in the same channel. So, having that much activity in one place could definitely happen eventually. But the real question is whether anyone can actually keep up with all that content and still find it useful!
The scale has changed dramatically, that’s for sure. However, one thing that hasn’t changed much is the type of people we hire and work with.

Everyone here is genuinely nice and thoughtful. I’ve learned from so many incredibly smart individuals, but what stands out is their kindness. They never make you feel bad when you say, “I don’t understand how this works; can you explain it to me like I’m five?” They’ll patiently walk you through it, which creates such a supportive environment.

Even though Stewart Butterfield, one of the co-founders and CEO, left two years ago, his customer obsession is still very much alive. We still care deeply about ensuring customers have a great experience with Slack. That’s been one of the distinguishing features of our product—enterprise software that’s actually pleasant to use.

Prathamesh: Yeah, that’s a fascinating point. Over the years, Slack has become such an integral part of people’s workflows—it’s like your day starts and ends on Slack. You finish your work, say, “Okay, let’s talk tomorrow,” and then you wrap up your day.

Regarding the load test you mentioned, how important is it to really grasp these numbers, especially at this scale? There are significant implications for infrastructure, the code you write, and how you ship it. All these factors interconnect, right? What are your thoughts on that, especially when leading a load-testing team?

How do you approach running load tests? In Site Reliability Engineering (SRE), we’re often expected to maintain tools, databases, and systems, and the key question we’re always asked is, “Will this scale?”

The natural answer is either yes or no, but it really needs to be backed by data and proof. Can you walk us through how you think about, implement, and execute a load test, and how you share the results?

Maude: Sure!

There are two main components to load testing at Slack right now. The first big piece is what we call a continuous test, which runs 24/7. It simulates about 500,000 active users, all within the same Slack instance. We chose that number because it gives us a comfortable margin above the peak usage of our largest customers. The goal here is to catch issues early—before they impact real customers—by identifying them in this testing environment first.
Right now, we deploy a new build roughly 10 times a day. It starts in the staging environment, and then it hits the load test cluster, which is where all our backend business logic code gets deployed for testing. This cluster is isolated from production traffic, though it shares data with production.
One important detail is that the load test cluster doesn’t autoscale, and that’s deliberate. The main reason is cost. If we accidentally run a larger load test than planned, it could put undue strain on the system and trigger autoscaling, which would not only be costly but also difficult to manage. So we’ve set it up this way to avoid those headaches!

We actually want to see how hot we can run those instances in most cases. Over the years, we’ve used that data for all kinds of tests—like figuring out whether it’s cheaper to run one EC2 instance over another. Some instances can run a bit hotter but remain just as reliable, ultimately saving costs. We’ve conducted various tests to understand these dynamics. The continuous test is especially helpful because it creates a very predictable load, mimicking the top 40 APIs in terms of the number of calls per day. And we’re always adding more to the test suite.

The Slack API has thousands of endpoints—somewhere between one and two thousand—but we focus on covering the bulk of traffic during a typical day by targeting the most-used APIs. It’s steady, predictable traffic, which allows us to make educated guesses about how new code will perform when deployed to the load test cluster, right before it hits Dogfood. Dogfood is basically where our internal Slack instance runs. After Dogfood, we move on to Canary deployments: starting with 1%, then 10%, 25%, 50%, 75%, and finally 100% of the user base.

The window to catch issues is tight—usually about five minutes—between the time the code hits the load test cluster and when it moves to Dogfood. We’ve managed to catch a few issues this way, but there’s not much time to catch everything. And one thing I often forget to mention until late in conversations is that our load testing team is really small—just two of us right now. It used to be three, but for most of the last four years, it’s been just two people. So, the ones analyzing all those graphs are also juggling 20 other tasks at the same time. We haven’t had a lot of bandwidth for deep data analysis, but we’re working on improving that.

We’ve been automating a lot of the tooling to help us respond to incidents. Eventually, we want to stop a deployment early if something seems off on the load test cluster. We’re working closely with the reliability team on that. But since our team is small—just the two of us—we have to be careful not to become a bottleneck for production releases. If we did, it could cause unnecessary alerts or confusion, especially in the middle of the night when someone might not fully understand the data being produced by the load test.

Our goal is to act as an informative signal, not a hard blocker. The load test can sometimes generate funky metrics because people might be running other tests against it, or it might have been paused due to an incident. These things happen, and we don’t want to disrupt the process with false positives. So, we aim to influence the deployment process in a meaningful way without causing delays.

That’s how the continuous part of load testing works at Slack.

The other method we use is what we call "ad hoc" load testing. This is when teams approach us for one of two reasons: either they’re building a new feature, or they’re expanding an existing one, and they want to make sure it will scale before releasing it to our biggest customers. We’ve learned that larger companies like to be informed well in advance when Slack is planning to roll out a new feature.

Prathamesh: It's just not about the big customers—everyday users can get frustrated too. You know, sometimes even the smallest changes can have a big impact.

Maude: Exactly! That’s a perfect example. We roll out changes to our biggest customers last. They only got those new UI changes about three months ago, which was almost a year after we started the initial rollout. There’s a completely different release cycle for our largest customers when it comes to a lot of features.

In general, we release new features to them months after smaller users have had a chance to try them out. Pre-release teams usually get access to features when they’re still quite cutting-edge. Occasionally, if a big customer pushes for a feature, we’ll let them in early. But when we do, we make sure to warn them: “Look, there are going to be bugs. Don’t be mad at us when you find them—we already know they’re there!”

Prathamesh: Yeah.

Maude: So, at some point in the release process—ideally before rolling anything out, but definitely before we reach our largest customers—the feature teams come to us with what we call a load test plan. They explain the features they want to test, and we provide guidelines to help shape those plans. One of the key things we ask for is the number of connected clients they want to test for. This is crucial, especially with how Slack handles WebSockets and the WebSocket response loop.

For example, if you post a message in a channel with 10,000 active users, that one message essentially becomes 10,000 messages, as it gets propagated to everyone with an active WebSocket connection in that channel. The same thing happens with reactions—they also travel over the WebSocket connection, multiplying the traffic. So, understanding the scale of those connections is a big part of our load-testing process.

Maude: So, you can imagine how quickly things can balloon.

Prathamesh: Yeah, and it’s almost infinite, right? You never really know how people will interact or what they’ll do next.

Maude: Exactly!

And for some of those messages coming over the WebSocket connection, the client needs to respond accordingly. For example, there was a time when—thankfully, we don’t do this anymore—whenever someone uploaded a new custom emoji, every client had to download the entire Slack emoji set to get that new one.

Prathamesh: Oh, interesting!

Maude: Yeah, exactly.

So, if someone uploaded a bunch of custom emojis all at once, everyone connected to that team would hit the emoji list API repeatedly to fetch the updated set. Instead of waiting to dedupe everything at the end, they were fetching the entire set over and over, which caused all kinds of headaches for quite some time.
These are exactly the kinds of interactions we aim to test—especially for features with a heavy WebSocket component and many connected clients. We ask teams to estimate the number of WebSocket connections they want active and to think about how the feature will be used.
Usually, they’ll base this on data from existing feature usage or the number of customers they expect to push toward the new feature. We also encourage them to test at various thresholds—just below what they expect, at their target usage, and then about 20 to 25% above it.

Prathamesh: Just to make sure everything holds up.

Maude: Exactly. We always want to have a buffer, in case something goes wild or the feature gets used way more than we anticipated. It gives us that extra room in our infrastructure to manage things better.

One area where I’ve been trying to push back—sometimes successfully, sometimes not so much—is around this idea of reasonable usage limits. This is something Stewart always said, even until the day he left: Slack is a product designed for human communication. That’s why we have the rate limits that we do; Slack is meant for human users, not bots spamming channels.

If we keep that mindset when planning product features and system architecture, we should also be asking our PMs, "What’s the reasonable limit for this feature? At what point does it stop making sense from a human usage perspective?" I’ve been pushing back against the idea of having no limits in load test plans. For the longest time—and to some extent, we still do this—we’ve had this mentality that if engineering can support it, then why not have infinite limits?

Prathamesh: No limits for anyone!

Maude: If we can support it, that’s great, but it’s not always realistic.

Slack, as a product, is interconnected in such intricate ways, and that’s where the expertise of our team comes into play.
We understand that every feature is somehow tied to either a channel or a message—those are the two core building blocks. Since we're responsible for load testing, we have to understand the data model implications of these interactions. We're one of the few teams at Slack that have to maintain a very comprehensive understanding of the broader architecture.
We know the right questions to ask, like,
"You're testing a feature that involves files—well, files trigger a ton of WebSocket messages. We typically send an event for every edit, update, or share. Where do you expect the system to break? How many people will be editing this file at once?"
Often, even though we’re such a large product with massive usage, it’s fascinating to see engineers pause and say, "Oh, I need to rethink that." That’s part of why we’re here, to help them consider those factors. But the most important piece is the empirical data we gather.

When we run ad hoc testing, for instance, we spin up a dedicated Slack channel for every load test we kick off in our system. This helps us track everything in real-time and gather the data we need for meaningful results.

We automatically generate alerts embedded within our system for each specific load test, triggered by name through designated channels.

For instance, if we notice a decline in API success rates, an unusually high volume of WebSocket messages being read, or if our edge cache calls suddenly start failing, these issues will automatically send alerts to the respective channel. Ideally, the team conducting the test also shares updates and discussions in real-time within the channel, creating a searchable record for future reference. This way, when someone wonders, "Oh, remember that test we ran six months ago? What broke?" they can easily look back and find the details.

Additionally, we create a Grafana snapshot of all our health metrics from the load test, allowing team members to reference this data for up to three months post-test. This enables them to conduct follow-up tests—perhaps two weeks later—to verify if their fixes worked or to explore other adjustments. This process fosters continuity, which is fantastic.

Prathamesh:
Absolutely! The importance of these load tests cannot be overstated; they serve as custodians of reliability here at Slack. It’s not just about testing a feature; it’s also about discussing and enforcing reliability constraints—essentially identifying where potential failures may occur. As you pointed out, while an infinite number of bots might utilize a feature, there will always be a limited number of human users. That’s a crucial observation.

However, this also ties back to the relationship with the observability and reliability team, as well as the tools we have at our disposal for ensuring reliability. How has your experience been in this regard?

What you’ve described resonates strongly with the responsibilities of Site Reliability Engineers (SREs) in ensuring that systems operate effectively. You seem to be performing similar functions by ensuring that any new feature or capability added does not compromise existing reliability, while also safeguarding future reliability.

How does that collaboration play out? Do you consider yourself an SRE as well? The automation you’ve mentioned and many of the traits you display are quite reminiscent of SRE roles. What’s your perspective on that?

Maude:
While I’ve never held the official title of SRE, my early experiences working on performance at Slack closely resembled that role. After we launched the enterprise product, my team members were equipped with pagers. We had to ensure that at least one of us was physically present in the office by 6 a.m. Pacific Time—right when our largest East Coast-based customer would be logging in around 9 a.m. Eastern.

To manage this, we established a rotation so that someone would be available to firefight any issues that arose with those east-coast customers every morning and address any issues that arose.

We maintained this routine for about six weeks, implementing ad hoc fixes to build up enough headroom to ensure that the morning boot process generally went smoothly. Once we established that stability, we could shift our focus to tackling the core foundational architecture problems that were causing performance issues and bottlenecks from the ground up.

In that sense, yes, I was effectively on call during that period, responsible for ensuring that our customers could boot up successfully each morning.

Currently, our Slack team is part of a pillar dedicated to observability and performance, with a primary focus on backend performance. We lead the development of our load testing and flame graph tools. Although we no longer own all our backend tracing libraries, we still co-manage them. These libraries were custom-built since Slack utilizes Hacklang, a language developed at Meta with little usage and limited open-source activity.

To maintain performance standards, we’ve implemented CI checks that monitor performance at the pull request stage, ensuring we don’t inadvertently introduce new database queries. This means we oversee a diverse array of tools to support our work.

Our sister teams are the monitoring and observability teams. I’m frequently engaged in the tracing channel and actively participate in a Slack Connect channel with the Honeycomb team.

A crucial component of our load testing involves generating traces from most, if not all, of the data we collect. While not everything gets sent to Honeycomb, the vast majority does. We collaborate closely with backend engineers to help them instrument their code and pinpoint potential performance bottlenecks. We then run tests in the load testing cluster to gather clean data, allowing us to verify and empirically compare results with production, ensuring that fixes are effectively implemented.

Additionally, we’ve built intricate Grafana dashboards to visualize all our load-testing metrics. We dedicate considerable time to curating this information because it's essential for ensuring the overall health of our systems and giving engineers the best signal.

Prathamesh:
I assume you don’t maintain Prometheus and Grafana, right? There’s a separate team for that.

Maude:
It is! They have a dedicated team of seven or so people who manage everything for all of Slack. They’re fantastic, and I truly enjoy collaborating with them. They assist us in setting up metrics that automatically shut down our systems when we reach certain thresholds. For instance, if we detect a high level of 500 errors in API responses during a continuous test, the system will automatically shut down the test.

It sends one of us a message in a Slack channel, so when we log back in the next day, we can review what happened and restart the test without needing to be paged to stop the load test manually if Slack is experiencing issues. This automation takes the burden off us, allowing for a more streamlined process. We simply retry once the timing is appropriate.

On the reliability side, there is a dedicated reliability team that falls under the infrastructure part of the organization. We're frequently bouncing ideas off each other.

Recently, we’ve been exploring ways to model important user flows and scenarios, enhancing our instrumentation to gather more data about how everything moves through our system, including identifying our single points of failure. So yes, it’s all part of a larger cohesive effort.

Prathamesh: Organizationally, it makes sense that there are sister teams focused on observability and monitoring.

As you mentioned, these teams have expanded significantly over the last eight years, so having separate departments for these areas is logical. Given your extensive experience with these tools, do you find yourself missing anything in the observability space that could enhance your work? Additionally, are there any emerging trends you believe could benefit people in this field? Have you come across anything noteworthy?

Maude:
I tend to be more of a forward-looking person. As I mentioned earlier, Slack primarily uses Hacklang on the backend, which has led us to hand-roll many of our tools to gain visibility into the backend systems.

I feel like we’re continually making improvements in that area, and there’s nothing I would want to revert to in previous versions of the libraries we created.

However, we’ve been facing several scaling challenges, particularly with Prometheus. Recently, we had the chance to meet in San Francisco with my team and our sister teams for a collaborative presentation. During this session, several of us whiteboarded the architectures of our systems, which was eye-opening. Although I was aware of the components involved in our metrics pipeline, seeing everything laid out on a single whiteboard for the first time was fascinating.

The person leading the presentation walked us through the history of how we scaled our metrics pipeline to its current state, detailing the breaking points we are now encountering and the concerns associated with them.

I realized that I had never taken the time to truly understand the reasons behind our challenges. You know how sometimes things happen around you, and you don’t pause to comprehend why? You might think, “Okay, this is happening, but I have plenty of other things to worry about,” and you just accept it. This presentation was the first time I had the opportunity to let it all sink in.

Maude:

We’ve really pushed our systems to the limit, and I think we’ve reached the edge of what they can handle. More than once, I’ve received messages from the team saying, “So, regarding that metric, you added, the cardinality is a little too high.” Oops!

I don’t know how familiar you are with the Astra project.

It was formerly known as Kaldb and has recently been renamed Astra. It’s an open-source structured logging and metrics solution developed by folks at Slack—some of whom have moved on to other companies, but many are still here.
We’re actively working to migrate everything to Astra, but that’s a significant lift.
I’ve been a huge advocate of tracing as a way to gather data about what happens throughout the entire lifecycle of a user flow: from client interaction to request to downstream asynchronous jobs. I believe it’s an incredibly powerful tool for debugging and understanding the nuance behind all sorts of interactions within our system. Unfortunately, tracing hasn’t had as much adoption as we would’ve liked.
I’ve spent a lot of time discussing this with backend engineers. Why aren’t they adopting tracing more frequently? Some of it is muscle memory and habit– leaning for technologies they’re already familiar with– but some of it has to do with ergonomics.
Our tracing libraries were primarily authored by observability engineers who learned just enough Hacklang to make something functional. Unfortunately, that means they aren’t as extensible, user-friendly, or ergonomic as it could be. The adoption curve for many backend engineers could have been much smoother, and we’re actively working on improving that experience.

The cost has also been a significant issue. We can’t afford to send every single trace to Honeycomb; it’s just too expensive. That’s where Astra comes in handy. We’ve been plugging the trace data it aggregates into Grafana, which doesn’t cost us more than our existing Grafana enterprise contract. Sure, the Honeycomb UI is a thousand times better than what Grafana offers, but for every one trace that lands in Honeycomb, we have ten in Astra. Sometimes, you need that one instance when something happened, and you’ll find it in Astra—that’s where you’ll go to look.

Prathamesh:
What do you do outside of work to recharge? I know managing all of this comes with significant responsibilities. How do you disconnect from work and come back refreshed for the week ahead?

Maude:
Well, I have a two-and-a-half-year-old, so weekends are never really restful! It’s a lot of running around outside. He loves sports, so we go from playing golf. He plays a ton of hockey, which shows my proud Canadian heritage! Being outside is one of the most rewarding things.

When I have some free time, I enjoy cooking and experimenting with new recipes. A couple of weeks ago, I took my birthday off and spent the whole day cooking.

Thanks a lot, Maude for taking the time to connect with us.

Our conversation with Maude was both technically insightful and a refreshing reminder of the importance of work-life balance. Her expertise in scaling backend systems, adopting new tools like tracing with Astra, and continuously enhancing the developer experience speaks volumes about her dedication.

Equally impressive is how she recharges outside of work—whether baking bread or chasing after her energetic toddler.

We'd love to hear from you!

Share your experiences in SRE and your thoughts on reliability, observability, or monitoring. If you know someone passionate about these topics, feel free to suggest them for an interview. Join us in the SRE Discord community!

Thank you once again, Maude, for sharing your journey with us. If you’re passionate about observability and what goes on behind the scenes for a company, connect with Maude on LinkedIn.

Inside Observability: Maude's Experiences from Her Time at Slack!

Scaling Systems, Observability, and Balancing Roles

Discussion about this post