Lessons From The Disaster

Roblox is an extremely successful gaming platform company that makes billions of dollars and is played by over half of the children under 16 in the United States. For me, however, the greatest lessons from Roblox were not from their incredible success, but in the difficult moment they faced.

On October 28 2021 Roblox went down and stayed down for 73 hours!

For a company of that size that draws nearly all of its revenue online, it is a disaster that costs the company millions.

So what is the lesson?

There are many. Let’s start from the beginning.

Roblox relies on HashiStack to manage their large set of clusters and containers on which their services run. One important part of this stack is a component called Consul which they use in a number of very important functions including service discovery, health checks, session locking and a Key Value data store. Sometime prior to the outage this component was upgraded a newer version that was expected to help with resource utilization, improve efficiency and performance. No problem.

On that fateful day of Oct 28th 2021, they started seeing a number of their machines running much slower than usual and eventually halting a system to a stop.

The heroic troubleshooting that included replacing all machines in the cluster multiple times, digging through countless pages of logs, making configuration changes, getting the entire system online and taking it offline again is described in amazing detail on the company’s engineering blog.

Through these incredible engineering effort and sleepless nights they finally discovered the root cause of the problem, which was the earlier Consul upgrade that introduced changes to the streaming functionality that caused resource contention under high read and write load. They managed to restore the service and get the users back onto their system.

Reading this I always think what would I do if I were in the same situation. What are the lessons I can learn here?

In this case, the first thing that happens is panic. We need to get things online now. We need to act. It’s a typical human reaction. It looks like maybe it’s bad hardware, lets replace it right away and see if it works.

On the other hand, a better decision would be to bring the most senior architects and developers, along with the engineers most familiar with the system in question (SRE, DC OPs) in the war room and think deeply about the problem and possible solutions.

The first thing. What are the major changes that happened in the last days, weeks, months. Did we introduce major changes in the software, hardware or configuration?

Upgrading Consul could have been the first clue. Can we revert back? Even if you are not sure if the upgrade of the system is in fact the cause of the problem, since it has been working OK since. Because it is a major relatively recent change, just revert back and see, you can deep dive into the reasons why at a later point.

Thinking retrospectively, is there a staging environment + CI/CD capabilities and practices, where we can (synthetically or otherwise) test the system under load before implementing any major changes into production. During the production rollout, is it being done is a staggered fashion one cluster at a time?

Thinking about the overall system design, do we have single point of failure or circular dependency in our system. If a single cluster can bring down the entire company it’s a huge problem. A single system that is responsible for service discovery, health checks, session locking and a Key Value data store is a problem, because when it breaks many vital functions fail at the same time including the ability to see what’s wrong and fix things. There is a reason why liver, heart and brain are not the same organ. Each is separate and performs its own equally important but independent function.

Could things have been designed better? Maybe.

Could troubleshooting (problem triangulation and mitigation), business continuity, disaster recovery be done better? Maybe

But these are not the most important lessons Roblox demonstrated. Throughout these disaster, they worked together as a team and supported each other.

They worked side-by-side with their partners HashiCorp who also went above and beyond to get to the root of the problem.

Throughout those 73 hours and beyond, they exhibited their strong blameless culture and loyalty to each other and their trusted vendor partners. They focused on solving the problem, finding gaps in their systems and procedures and they did so with extreme transparency and respect for everyone involved – the engineers, vendors, their development and user communities.

Did they lose their reputation and damage their business long term due to the painful prolonged outage?

I do not think so. I think their business is stronger as a result. I don’t think they lost trust.

Why is that?

I believe today’s consumers understand that no one is perfect. Big companies and services are designed and managed by human beings. We are all doing our best, we all make mistakes in our judgement and actions.

The character of people and organizations is revealed during those trying times. Honesty, transparency, empathy, respect and willingness to admit one’s imperfections and go above and beyond to fix those mistakes goes a long way to not only preserve but to improve the business and customer trust and loyalty.

As people we have the capacity to forgive, respect and appreciate one another.

Let’s all grow, learn and prosper!!!

Published by Yev

Happy to meet you all. I am a Technical Program Manager who is passionate about learning, teaching and mentoring.

Leave a Reply

%d bloggers like this: