SRE NEWSLETTER

Issue #52 // February 23, 2024

A few big outages hit the tech world this week. We spend a lot of time considering fault tolerance, scalability, and monitoring. But, it's important to consider that sometimes an outage is intentionally caused by an outside threat.

We must consider the relationship between reliability and security. While AT&T assures us that their issue wasn't caused by an cyber attack, Change Healthcare and many other's were.

AT&T Cellular Service Restored After Daylong Outage
// cnbc.com
I'm sure many of us felt this one either as an AT&T customer or trying to call one. Down for 10 hours, causes are still unknown. Communication has been less than stellar from telecommunications company.
US Health Tech Giant Change Healthcare Hit by Cyber Attack
// techcrunch.com
Starting on Wednesday and continuing into Friday, what was first thought to be a network issue turned out to be a cyber security incident.
Monitoring Latency: Cloudflare Workers vs Fly vs Koyeb vs Railway vs Render
// openstatus.dev
Want to know which cloud providers offer the lowest latency? This post compares the latency of Cloudflare Workers, Fly, Koyeb, Railway and Render.
Unpacking Linux Containers: Understanding Docker and Its Alternatives
// optimizedbyotto.com
Choosing the right container technology to use depends on where you intend to ship your containers. Ths post compares Docker, Podman, LXC and LXD.
The Billion Row Challenge [Long Read]
// questdb.io
This one is a bit obscure, but an interesting technical challenge. Write a program for retrieving temperature measurement values from a text file and calculating the min, mean, and max temperature per weather station. There’s just one caveat: the file has 1,000,000,000 rows!
What is Database Sharding?
// pingcap.com
If you're running into database scaling issues and haven't heard of sharding, this articles gives a good introduction into the options and trade-offs.
SSDs Have Become Ridiculously Fast, Except in the Cloud
// databasearchitects.blogspot.com
Despite AWS continuing to roll out new NVMe drives, cloud IO has stagnated. The author and commenters hypothesize some reasons, but if you really need to be fast, bare metal may be required.
Resend - Incident report for February 21st, 2024
// resend.com
As a counter example to the AT&T outage, Resend shares a detailed postmortem from a recent outage caused by a database migration.