How we scaled the cloud infrastructure at a $20M+ startup

October 7, 2022 min to read

Let’s set the scene.

It’s the end of 2021. Our cloud infrastructure (databases, services, caches) was happily crunching data for more than a thousand customers — but it was reaching its limits.

Chili Piper was growing rapidly and the way our cloud infrastructure was set up would not suffice.

These issues hit us hardest during peak business hours, when some customers would start hitting timeouts (e.g. response slowness, slow user experience, website unreachable).

In this article, I’ll explain how we went from “startup” to “scaleup” — by building a team to take our infrastructure to the next level.

Infrastructure: The silent hero of every tech company

Infrastructure is the backbone holding up your entire product.

If the infrastructure is underprovisioned, it can cause outages. Slow load times. Even data loss.

The longer it’s unprovisioned, the more customer experience can decline.

The more a company grows, the more stress is placed on the infrastructure. And problems like data loss and declines are more likely to occur.

This is when we need experts to step in: Site Reliability Engineers (SRE).

Fun fact: Our infrastructure was initially built by backend engineers.

This is generally fine for a startup, when your engineering team is tiny and the number of customers is limited. But as the company grows and enters scaleup mode, it's time for more specialized roles to step in.

Getting started

Back to our story.

It’s December 2021. The days are short, dark, and cold, and our infrastructure is sagging with the weight of all our customers.

We rolled up our sleeves, formed a firefighter team, and got to work — by “we” I mean myself and Łukasz, a member of our engineering team who was already experienced in cloud infrastructure.

This period was stressful. Sleepless. And one of the most challenging periods in my professional life.

But also one of the most rewarding.

A dangerous road ahead

Once we looked under the hood of our infrastructure, we found there were a lot of challenges waiting for us.

Seeing the dangers in front of us, I set up an on-call rotation for Łukasz and me.

This turned out to be the right decision. We were hit by series of performance degradations and outages from all of the different places (MongoDB issues, certificate expirations in Kubernetes control plane, service crash loops, you name it).

This was a signal to me we needed to bring in more experts.

We hired two more SREs — but we didn’t rush the hiring process. This is key.

Even though we were under a lot of pressure and constantly fighting fires, we took our time with hiring to make sure we found the right people to do the job.

And we started chipping away at a long list of things we needed to do in order to increase stability and decrease technical debt:

- Scale our MongoDB and ensure each deployment in each environment has a replica: Our databases were underprovisioned and in some critical places were deployed in Standalone mode with no replicas.

- Update the version of MongoDB: We were at version 3.2, that had reached end-of-life in September 2018. We needed to get to the more recent versions for bugfixes, security patches, improved stability and clustering. Our target was 5.x series.

- Change our Kubernetes cluster from public to private and expose it via Cloud NAT: For both security and operational reasons, some of our larger clients required us to provide them fixed list of IP addresses of our machines so they can configure whitelisting on their Salesforce instances. Unless we moved to a private cluster, we couldn't upgrade or scale our cluster without disrupting their services.

- Move toward Infrastructure as a Code and GitOps: Every change to infrastructure should be audited and should be easily reverted if required. Due to transparency changes are visible to every engineer, and it's easier to step in to help fixing a problem when it occurs.

- Fix our release process: Releasing new version had been very brittle due to legacy inhouse made solution. We needed to modernize it and make it reliable — we chose Flux to complement.

- Move batch jobs from inhouse logic to Kubernetes Cron Jobs.

- Add tracing and metrics support to our services. We recognized the need to understand more thoroughly what's happening at the application level.

- Configure dashboards and alerts for each environment (and maintain them by a code so they are reproducible across every environment).

- Improve on the logic of our clustering: We have two production clusters and they are not isolated by having to cross cluster dependencies (e.g. one production cluster needs to speak to another one), leading to very hard to solve for bugs.

- Unify our configuration drift in environments (staging, testing, production): To eliminate bugs related to very different setups.

- Support Product teams: With required infrastructure for new products. There were new services being developed by engineers leveraging Kafka and Postgres.

- Support Quality Assurance department: Creating new environment for automated tests.

A lot of these requirements are intertwined. For example, we can't improve upon our clustering logic without making our Kubernetse clusters private first.

And many of them were very time sensitive. An underprovisioned database was leading to customer dissatisfaction due to low performance. This, topped by lack of a replica could potentially lead to full-blown outages.

Inability to release new code reliably led to hotfixes and features being delayed, or to only partially successful deployments.

After a lot of work, our team quickly established itself as critical to the entire organization. We’re the first responders, helping Customer Support, Quality Assurance, and Product teams on a daily basis.

By the middle of 2022, product utilization was higher than ever. This coupled with on-call duty and slow progress towards our goal of eliminating tech debt, led us to hire an additional engineer.

The best is yet to come

Seeing the mountains of work we need to do can be daunting, especially knowing how critical it is.

Joining a scaleup and reworking its legacy infrastructure is never easy. You feel the whole weight of the business on your shoulders.

You’re responsible for the reliability of something that doesn’t follow standard practices.

There’s a lack of documentation.

And you’re working in a system you didn’t build yourself.

So why would you do it?

Because there isn’t anything like the feeling you get when you solve a meaty challenge.

If that sounds like you, you should be glad to know our job is far from done — and our team is looking for new recruits.

If you’re…

Ready for a challenge
Want to feel full ownership for something
Have courage to perform critical upgrades when under stress
Helpful, compassionate, innovative
Ready to take ownership, adopt a growth mindset, and have fun (a la the Chili Piper values)

…join our talent community!

Our infrastructure is already in a much healthier state than when we started, but there’s a whole new future waiting for us. And I can’t wait for it.

See the power of Chili Piper in action today!

About the author

Marek Kádek

Marek is the Director of Engineering at Chili Piper. He enjoys digging into tough challenges and coming up with creative solutions. When he’s not coding at his computer, you can find him training judo, reading books, or playing with his parrot. For more engineering tips, follow Marek on LinkedIn.