SRE NEWSLETTER

Issue #48 // December 10, 2021

The big news this week is, of course, the AWS outage. I actually had 3 articles to review for the SRE Newsletter from AWS re:Invent, but parts of Amazon appear to still be down.

The outage highlights a couple things: 1) if it makes sense for your application, use multiple regions and 2) outages happen even to the best of us.

Amazon Services Down for Thousands of Users
// bbc.co.uk
There are a lot of AWS is down articles this week because, well it's kind of a big deal with the largest cloud provider has an outage. This short little article list some of the impacts like my poor Roomba vacuum.
[Tool] The Honest AWS Health Dashboard
// stop.lying.cloud
I guess it took a while for Amazon to admit it's own outage. Someone decided to create a website with one of the best URLs ever.
[Long Read] Let's Make Faster GitLab CI/CD Pipelines
// blog.nimbleways.com
I hate slow builds and deployments. With everything being pushed left, CI / CD pipelines keep getting slower. This post walks through the optimization process of taking a 14 minute pipeline down to 3 minutes while exploring the most common problems, their solutions, and the trade-offs.
[Long Read] Anti-Patterns When Building Container Images
// jpetazzo.github.io
The list: big images (caused by all-in-one images or including large datasets), small images, rebuilding common bases, building from the root of a giant monorepo, not using BuildKit, requiring rebuilds for every change, using custom scripts instead of existing tools, forcing things to run in containers, using overly complex tools, conflicting names for scripts and images, and building with Dockerfiles.
Why Your Cloud Infrastructure Should Be Immutable
// stroobants.dev
We've heard about immutable infrastructure for a while, but mainstream tooling is finally up to the task. Here's a post that summarizes the benefits of immutable infrastructure.
Scaling Kafka at Honeycomb
// honeycomb.io
Honeycomb has grown 10x in two years while total cost of ownership for their Kafka setup has only gone up 20%. This post describes how they've managed to scale their telemetry system without dedicated Kafka engineers (with the help of some beta AWS instances).
Microservices, Constant Iteration, and the Shelf Life of Good Solutions
// opslevel.com
In an interview with Diederik van Liere, VP of Data & Engineering at Wealthsimple (previously Shopify and Wikipedia), he discusses how service ownership has changed as the company has grown. While many assume service ownership is static, as a business grows, engineering organizations need to figure out how to manage the changes to evolve with the business.
Meta Selects AWS as Strategic Cloud Provider
// businesswire.com
A short little PR piece about how another major company, Meta (AKA Facebook), is going the path of AWS. The partnership will drive the use of PyTorch for ML models.