Do you know if your infrastructure is running as efficiently as possible? Are you burning any extra resources that you don't need? Or worse, not prepared for a spike in traffic that could take your application down? That's a big problem for every software company, but especially for startups that can have extremely volatile traffic. So how can you build an efficient system that can scale up to meet spikes but not waste money?
The solution is actually to take a step back and look at your system as a whole. Your infrastructure likely includes things like databases, queues, load balancers, web servers, third party services and more. On top of that, if you use Kubernetes or any other container orchestration tool, you likely have several internal services to manage as well. Each of those pieces has its own attributes and limits, understanding those limits and how they work together will allow you to eliminate bottlenecks. Your system is only the sum of its parts, and in this post we will cover some of the ways we evaluate those parts to build resilient and efficient infrastructure.
The most important step in improving your system is the first one: identifying the weak points. As the maintainer of infrastructure you should know what load your system can bear, common errors, and your most fragile services. Nothing is perfect, and infrastructure is no different, but the goal here is resilience; being able to bounce back from failures without any catastrophic issues. Being slow is much better than being down.
You probably already know the most common soft spots in your system from repetitive errors. For example, in the past, we’ve had out of memory (OOM) errors for a specific service that would cascade into other operations. This was difficult to resolve because it didn't reveal itself until we had very high traffic. We also had to dig through several other errors to identify this OOM error as the root cause. We knew we had to be proactive in finding these kinds of issues before our customers did, so we invested much more time in load testing.
Why load testing specifically? Because it's the most tried and true method of overloading your system beyond regular traffic. There are a multitude of tools and services you can use based on your specific needs. Since we are mostly a Node shop and run on Kubernetes, we like K6. We also recommend storing this data for later use to review findings. The more data you have, the easier it is to find patterns. What's the point of testing if you don't know whether you're improving or not? (See how we did this with our post on Persisting load testing data with k6).
Before we begin load testing, though, we want to create an isolated load testing environment. Using a pre-existing environment (like staging) is a bad idea because these environments usually don't have the same scale of infrastructure as production. Load testing will push your systems resources to its limits so you want to mimic your production environment as closely as possible. Not to mention that crashing the environment is inevitable. You don't want to halt developers progress while you are testing.
One last step before the fun begins - writing your tests. Of course the tests won't magically appear, you need to write tests that hit a representative set of endpoints that cover an end user's most common use cases. Just running end to end tests won't be granular enough. Breaking tests into common use cases, as well as having end to end tests, allows you to isolate problems.
Finally, you can begin running your tests. You don't want to go and blow your system up with a billion requests off the bat, there's no use in that. Instead, increase load gradually starting with average traffic. Then increase load by no more than 50%. You want to be keeping an eye on your application and infrastructure logs for anything out of the ordinary. This could mean error messages, restarting service, slow response time, just anything at all. Similar to our OOM error, it's likely that one error is cascading and causing many other failures, so it's important to note anything that could be a risk.
In our journey to resolve the OOM issue this testing pointed out that our containers were actually oversized. Odd, right? How did having too much memory throw an OOM? Well after replicating the error we found it was being thrown by the application itself. So even though our pod had 8GB of memory, our application could only use 6GB. When it passed the 6GB it crashed, but the container continued to run, thus never triggering scaling up.
Amazing! Problem solved! All done, right? WRONG.
The goal here is to identify all weak spots, resolve what you can, reinforce anything that can be a bottleneck, and set up alerts for everything else so you can fix it before it breaks. There is no perfect system, that is why the goal is to build a resilient one. The only way to do that is test over and over with small tweaks and adjustments until you find all the edge cases.
In the last section we identified weak spots by looking at our live environment. Then we targeted those spots by creating tests to give us more data on them and find their limits. Once we gathered that data, we implemented a solution to prevent those errors from taking down our system again. But that was just our first pass. The key to resilient systems is to find the pebble that breaks the wheel.
Back to the top. Look for other errors or faults in the system that you can address. Or if you are still not satisfied with your solution (like us), then keep digging deeper. In our case we were left asking ourselves, why is this OOM error even happening? The load we have should not be a problem based on the number of processes running. So now with this new set of data we circled back to our application and infrastructure logs. To really find out what was going on we needed to watch the infrastructure in real time while the tests were running.
This revealed the actual root cause of our OOM errors. A few pods were handling 90% of the requests and the rest were sitting mostly idle. Those few pods would run out of memory, crash, prematurely end requests, and throw errors to our end user. We don't know why this is happening, but at least we know the weak point. Eventually we realized it was a problem with gRPC and our load balancer, which we wrote about in a blog post.
You can repeat this process as many times as needed, and in fact it took us one more pass to feel we fully resolved the issue. Something didn't sit right with us. We were seeing this error at extremely high loads, well beyond what we've ever seen in the real world. So we went back to square one and analyzed the logs in production that led to the OOM errors. We quickly found a pattern. Most of the time the errors would pop up after spikes in traffic. So we tweaked our load tests to mimic real world traffic and finally, after all this we were seeing exactly what we saw in the real world. It turned out that our pods were not scaling up fast enough and our nodes were being underutilized. This was resolved by improving our scaling process between Google Kubernetes Engine and our application infrastructure itself. We go into more detail in another blog post.
Maintaining infrastructure is tough, and with modern distributed systems it has only gotten tougher. However, by leveraging load testing and modern practices you can optimize your system to be low cost and highly scalable at the same time. Building a resilient system may take some effort but it is well worth it to know that your service will never go down when you need it the most.
Have any other recommendations about building resilient systems? We'd love to hear them! Email us firstname.lastname@example.org!