DevOps

Kubernetes the toxic coworker my team couldn’t fire

The Slack notification arrived with the heavy, damp enthusiasm of a wet dog jumping into your lap while you are wearing a tuxedo. It was late on a Thursday, the specific hour when ambitious caffeine consumption turns into existential regret, and the message was brief.

“I don’t think I can do this anymore. Not the coding. The infrastructure. I’m out.”

This wasn’t a junior developer overwhelmed by the concept of recursion. This was my lead backend engineer. A human Swiss Army knife who had spent nine years navigating the dark alleys of distributed systems and could stare down a production outage with the heart rate of a sleeping tortoise. He wasn’t leaving because of burnout from long hours, or an equity dispute, or even because someone microwaved fish in the breakroom.

He was leaving because of Kubernetes.

Specifically, he was leaving because the tool we had adopted to “simplify” our lives had slowly morphed into a second, unpaid job that required the patience of a saint and the forensic skills of a crime scene investigator. We had turned his daily routine of shipping features into a high-stakes game of operation where touching the wrong YAML indentation caused the digital equivalent of a sewer backup.

It was a wake-up call that hit me harder than the realization that the Tupperware at the back of my fridge has evolved its own civilization. We treat Kubernetes like a badge of honor, a maturity medal we pin to our chests. But the dirty secret everyone is too polite to whisper at conferences is that we have invited a chaotic, high-maintenance tyrant into our homes and given it the master bedroom.

When the orchestrator becomes a lifestyle disease

We tend to talk about “cognitive load” in engineering with the same sterile detachment we use to discuss disk space or latency. It sounds clean. Manageable. But in practice, the cognitive load imposed by a raw, unabstracted Kubernetes setup is less like a hard drive filling up and more like trying to cook a five-course gourmet meal while a badger is gnawing on your ankle.

The promise was seductive. We were told that Kubernetes would be the universal adapter for the cloud. It would be the operating system of the internet. And in a way, it is. But it is an operating system that requires you to assemble the kernel by hand every morning before you can open your web browser.

My star engineer didn’t want to leave. He just wanted to write code that solved business problems. Instead, he found himself spending 40% of his week debugging ingress controllers that behaved like moody teenagers (silent, sullen, and refusing to do what they were told) and wrestling with pod eviction policies that seemed based on the whim of a vengeful god rather than logic.

We had fallen into the classic trap of Resume Driven Development. We handed application developers the keys to the cluster and told them they were now “DevOps empowered.” In reality, this is like handing a teenager the keys to a nuclear submarine because they once successfully drove a golf cart. It doesn’t empower them. It terrifies them.

(And let’s be honest, most backend developers look at a Kubernetes manifest with the same mix of confusion and horror that I feel when looking at my own tax returns.)

The archaeological dig of institutional knowledge

The problem with complexity is that it rarely announces itself with a marching band. It accumulates silently, like dust bunnies under a bed, or plaque in an artery.

When we audited our setup after the resignation, we found that our cluster had become a museum of good intentions gone wrong. We found Helm charts that were so customized they effectively constituted a new, undocumented programming language. We found sidecar containers attached to pods for reasons nobody could remember, sucking up resources like barnacles on the hull of a ship, serving no purpose other than to make the diagrams look impressive.

This is what I call “Institutional Knowledge Debt.” It represents the sort of fungal growth that occurs when you let complexity run wild. You know it is there, evolving its own ecosystem, but as long as you don’t look at it directly, you don’t have to acknowledge that it might be sentient.

The “Bus Factor” in our team (the number of people who can get hit by a bus before the project collapses) had reached a terrifying number: one. And that one person had just quit. We had built a system where deploying a hotfix required a level of tribal knowledge usually reserved for initiating members into a secret society.

YAML is just a ransom note with better indentation

If you want to understand why developers hate modern infrastructure, look no further than the file format we use to define it. YAML.

We found files in our repository that were less like configuration instructions and more like love letters written by a stalker: intense, repetitive, and terrifyingly vague about their actual intentions.

The fragility of it is almost impressive. A single misplaced space, a tab character where a space should be, or a dash that looked at you the wrong way, and the entire production environment simply decides to take the day off. It is absurd that in an era of AI assistants and quantum computing, our billion-dollar industries hinge on whether a human being pressed the spacebar two times or four times.

Debugging these files is not engineering. It is hermeneutics. It is reading tea leaves. You stare at the CrashLoopBackOff error message, which is the system’s way of saying “I am unhappy, but I will not tell you why,” and you start making sacrifices to the gods of indentation.

My engineer didn’t hate the logic. He hated the medium. He hated that his intellect was being wasted on the digital equivalent of untangling Christmas lights.

We built a platform to stop the bleeding

The solution to this mess was not to hire “better” engineers who memorized the entire Kubernetes API documentation. That is a strategy akin to buying larger pants instead of going on a diet. It accommodates the problem, but it doesn’t solve it.

We had to perform an exorcism. But not a dramatic one with spinning heads. A boring, bureaucratic one.

We embraced Platform Engineering. Now, that is a buzzword that usually makes my eyes roll back into my head so far I can see my own frontal lobe, but in this case, it was the only way out. We decided to treat the platform as a product and our developers as the customers, customers who are easily confused and frighten easily.

We took the sharp objects away.

We built “Golden Paths.” In plain English, this means we created templates that work. If a developer wants to deploy a microservice, they don’t need to write a 400-line YAML manifesto. They fill out a form that asks five questions: What is it called? How much memory does it need? Who do we call if it breaks?

We hid the Kubernetes API behind a curtain. We stopped asking application developers to care about PodDisruptionBudgets or AffinityRules. Asking a Java developer to configure node affinity is like asking a passenger on an airplane to help calibrate the landing gear. It is not their job, and if they are doing it, something has gone terribly wrong.

Boring is the only metric that matters

After three months of stripping away the complexity, something strange happened. The silence.

The Slack channel dedicated to deployment support, previously a scrolling wall of panic and “why is my pod pending?” screenshots, went quiet. Deployments became boring.

And let me tell you, in the world of infrastructure, boring is the new sexy. Boring means things work. Boring means I can sleep through the night without my phone buzzing across the nightstand like an angry hornet.

Kubernetes is a marvel of engineering. It is powerful, scalable, and robust. But it is also a dense, hostile environment for humans. It is an industrial-grade tool. You don’t put an industrial lathe in your home kitchen to slice carrots, and you shouldn’t force every developer to operate a raw Kubernetes cluster just to serve a web page.

If you are hiring brilliant engineers, you are paying for their ability to solve logic puzzles and build features. If you force them to spend half their week fighting with infrastructure, you are effectively paying a surgeon to mop the hospital floors.

So look at your team. Look at their eyes. If they look tired, not from the joy of creation but from the fatigue of fighting their own tools, you might have a problem. That star engineer isn’t planning their next feature. They are drafting their resignation letter, and it probably won’t be written in YAML.

Your cloud bill is just a mirror of your engineering soul

I have a theory that usually gets me uninvited to the best tech parties. It is a controversial opinion, the kind that makes people shift uncomfortably in their ergonomic chairs and check their phones. Here it is. AWS is not expensive. AWS is actually a remarkably fair judge of character. Most of us are just bad at using it. We are not unlucky, nor are we victims of some grand conspiracy by Jeff Bezos to empty our bank accounts. We are simply lazy in ways that we are too embarrassed to admit.

I learned this the hard way, through a process that felt less like a financial audit and more like a very public intervention.

The expensive silence of a six-figure mistake

Last year, our AWS bill crossed a number that made the people in finance visibly sweat. It was a six-figure sum appearing monthly, a recurring nightmare dressed up as an invoice. The immediate reactions from the team were predictable: a chorus of denial that sounded like a broken record. People started whispering about the insanity of cloud pricing. We talked about negotiating discounts, even though we had no leverage. There was serious talk of going multi-cloud, which is usually just a way to double your problems while hoping for a synergy that never comes. Someone even suggested going back to on-prem servers, which is the technological equivalent of moving back in with your parents because your rent is too high.

We were looking for a villain, but the only villain in the room was our own negligence. Instead of pointing fingers at Amazon, we froze all new infrastructure for two weeks. We locked the doors and audited why every single dollar existed. It was painful. It was awkward. It was necessary.

We hired a therapist for our infrastructure

What we found was not a technical failure. It was a behavioral disorder. We found that AWS was not charging us for scale. It was charging us for our profound indifference. It was like leaving the water running in every sink in the house and then blaming the utility company for the price of water.

We had EC2 instances sized “just to be safe.” This is the engineering equivalent of buying a pair of XXXL sweatpants just in case you decide to take up sumo wrestling next Tuesday. We were paying for capacity we did not need, for a traffic spike that existed only in our anxious imaginations.

We discovered Kubernetes clusters wheezing along at 15% utilization. Imagine buying a Ferrari to drive to the mailbox at the end of the driveway once a week. That was our cluster. Expensive, powerful, and utterly bored.

There were NAT Gateways chugging along in the background, charging us by the gigabyte to forward traffic that nobody remembered creating. It was like paying a toll to cross a bridge that went nowhere. We had RDS instances over-provisioned for traffic that never arrived, like a restaurant staffed with fifty waiters for a lunch crowd of three.

Perhaps the most revealing discovery was our log retention policy. We were keeping CloudWatch logs forever because “storage is cheap.” It is not cheap when you are hoarding digital exhaust like a cat lady hoarding newspapers. We had autoscaling enabled without upper bounds, which is a bit like giving your credit card to a teenager and telling them to have fun. We had Lambdas retrying silently into infinity, little workers banging their heads against a wall forever.

None of this was AWS being greedy. This was engineering apathy. This was the result of a comforting myth that engineers love to tell themselves.

The hoarding habit of the modern engineer

“If it works, do not touch it.”

This mantra makes sense for stability. It is a lovely sentiment for a grandmother’s antique clock. It is a disaster for a cloud budget. AWS does not reward working systems. It rewards intentional systems. Every unmanaged default becomes a subscription you never canceled, a gym membership you keep paying for because you are too lazy to pick up the phone and cancel it.

Big companies can survive this kind of bad cloud usage because they can hide the waste in the couch cushions of their massive budgets. Startups cannot. For a startup, a few bad decisions can double your runway burn, force hiring freezes, and kill experimentation before it begins. I have seen companies rip out AWS, not because the technology failed, but because they never learned how to say no to it. They treated the cloud like an all you can eat buffet, where they forgot to pay the bill first.

Denial is a terrible financial strategy

If your AWS bill feels random, you do not understand your system. If cost surprises you, your architecture is opaque. It is like finding a surprise charge on your credit card and realizing you have no idea what you bought. It is a loss of control.

We realized that if we needed a “FinOps tool” to explain our bill, our infrastructure was already too complex. We did not need another dashboard. We needed a mirror.

The boring magic of actually caring

We did not switch clouds. We did not hire expensive consultants to tell us what we already knew. We did not buy magic software to fix our mess. We did four boring, profoundly unsexy things.

First, every resource needed an owner. We stopped treating servers like communal property. If you spun it up, you fed it. Second, every service needed a cost ceiling. We put a leash on the spending. Third, every autoscaler needed a maximum limit. We stopped the machines from reproducing without permission. Fourth, every log needed a delete date. We learned to take out the trash.

The results were almost insulting in their simplicity. Costs dropped 43% in 30 days. There were no outages. There were no late night heroics. We did not rewrite the core platform. We just applied a little bit of discipline.

Why this makes engineers uncomfortable

Cost optimization exposes bad decisions. It forces you to admit that you over engineered a solution. It forces you to admit that you scaled too early. It forces you to admit that you trusted defaults because you were too busy to read the manual. It forces you to admit that you avoided the hard conversations about budget.

It is much easier to blame AWS. It is comforting to think of them as a villain. It is harder to admit that we built something nobody questioned.

The brutal honesty of the invoice

AWS is not the villain here. It is a mirror. It shows you exactly how careless or thoughtful your architecture is, and it translates that carelessness into dollars. You can call it expensive. You can call it unfair. You can migrate to another cloud provider. But until you fix how you design systems, every cloud will punish you the same way. The problem is not the landlord. The problem is how you are living in the house.

It brings me to a final question that every engineering leader should ask themselves. If your AWS bill doubled tomorrow, would you know why? Would you know exactly where the money was going? Would you know what to delete first?

If the answer is no, the problem is not AWS. And deep down, in the quiet moments when the invoice arrives, you already know that. This article might make some people angry. That is good. Anger is cheaper than denial. And frankly, it is much better for your bottom line.

January 11, 2026 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

How to stay employable when the tools keep changing

I was at my desk the other day attempting to achieve what passes for serenity in modern IT, which is to say I was watching a Kubernetes cluster behave like a supermarket trolley with one cursed wheel. Everything looked stable in the dashboard, which, in cloud terms, is the equivalent of a toddler saying “I am being very quiet” from the other room.

That was when a younger colleague appeared at the edge of my monitor like a pop-up window you simply cannot close.

“Can I ask you something?” he said.

This phrase is rarely followed by useful inquiries, such as “Where do you keep the biscuits?” It is invariably followed by something philosophical, the kind of question that makes you suddenly aware you have become the person other people treat as a human FAQ.

“Is it worth it?” he asked. “All of this. The studying. The certifications. The on-call shifts. With AI coming to take it all away.”

He did not actually use the phrase “robot overlords”, but it hung in the air anyway, right beside that other permanent office presence, the existential dread that arrives every Monday morning and sits down without introducing itself.

Being “senior” in the technology sector is a funny thing. It is not like being a wise mountain sage who understands the mysteries of the wind. It is more like being the only person in the room who remembers what the internet looked like before it became a shopping mall with a comment section. You are not necessarily smarter. You are simply older, and you have survived enough migrations to know that the universe is largely held together by duct tape and misunderstood configuration files.

So I looked at him, panicked slightly, and decided to tell him the truth.

The accidental trap of the perfect puzzle piece

The problem with the way we build careers, especially in engineering, is that we treat ourselves like replacement parts for a very specific machine. We spend years filing down our edges, polishing our corners, and making sure we fit perfectly into a slot labelled “Java Developer” or “Cloud Architect.”

This strategy works wonderfully right up until the moment the machine decides to change its shape.

When that happens, being a perfect puzzle piece is actually a liability. You are left holding a very specific shape in a world that has suddenly decided it prefers round holes. This brings us to the trap of the specialist. The specialist is safe, comfortable, and efficient. But the specialist is also the first thing to be replaced when the algorithm learns how to do the job faster.

The alternative sounds exhausting. It is the path of the “Generalist.”

To a logical brain that enjoys defined parameters, a generalist looks suspiciously like someone who cannot make up their mind. But in the coming years, the generalist (confusing as they may be) is the only one safe from extinction. The generalist does not ask “Where do I fit?” The generalist asks, “What am I trying to build?” and then learns whatever is necessary to build it. It is less like being a factory worker and more like being a frantic homeowner trying to fix a leak with a roll of tape and a YouTube video. It is messy, but unlike the factory worker, the homeowner cannot be automated out of existence because the problems they solve are never exactly the same twice.

The four horsemen of the career apocalypse

Once you accept that the future will not reward narrow excellence, you stumble upon an equally alarming discovery regarding the skills that actually matter. The usual list tends to circle around four eternal pillars known to induce hives in most engineers: marketing, sales, writing, and speaking.

If you work in DevOps or cloud, these words likely land with the gentle comfort of a cold spoon sliding down your back. We tend to view marketing and sales as the parts of the economy where people smile too much and perhaps use too much hair gel. Writing and public speaking, meanwhile, are often just painful reminders of that time we accidentally said “utilize” in a meeting when “use” would have sufficed.

But here is a useful reframing I have been trying to adopt.

Marketing and sales are not trickery. They are simply “the message“. They are the ability to explain to another human being why something matters. If you have ever tried to convince a Product Manager that technical debt is real and dangerous, you have done sales. If you failed, it was likely because your marketing was poor.

Writing and speaking are not performance art. They are “the medium“. In a world where AI can generate code in seconds, the ability to write clean code becomes less valuable than the ability to write a clean explanation of why we need that code. The modern career is increasingly about communicating value rather than just quietly creating it in a dark room. The “Artist” focuses on the craft. The “Sellout” focuses on the money. The goal, irritating as it may be, is to become the “Artist-Entrepreneur” who respects the craft enough to sell it properly.

The museum of ideas and the art of dissatisfaction

So how does one actually prepare for this vaguely threatening future?

The advice usually involves creating a “Vision Board” with pictures of yachts and people laughing at salads. I have always found this difficult, mostly because my vision usually extends no further than wanting my printer to work on the first try.

A far more effective tool is the “Anti-vision“.

This involves looking at the life you absolutely do not want and running in the opposite direction. It is a powerful motivator. I can quite easily visualize a future of endless Zoom meetings where we discuss the synergy of leverage, and that vision propels me to learn new skills faster than any promise of a Ferrari ever could.

This leads to the concept of curating a “Museum of Ideas”. You do not need to be a genius inventor. You just need to be a curator. You collect the ideas, people, and concepts that resonate with you, and you try to figure out why they work. It is reverse engineering, which is something we are actually good at. We do it with software all the time. Doing it with our careers feels strange, but the logic holds. You look at the result you want, and you work backward to find the source code.

This process requires you to embrace a certain amount of boredom and dissatisfaction. We usually treat boredom as a bug in the system, something to be patched immediately with scrolling or distraction. But boredom is actually a feature. It is the signal that it is time to evolve. AI does not get bored. It will happily generate generic emails until the heat death of the universe. Only a human gets bored enough to invent something better.

The currency of confidence

So, back to the colleague at my desk, who was still looking at me with the expectant face of a spaniel waiting for a treat.

I told him that yes, it is worth it. But the game has changed.

We are moving from an economy of “knowing things” (which computers do better) to an economy of “connecting things” (which is still a uniquely human mess). The future belongs to the people who can see the whole system, not just the individual lines of code.

When the output of AI becomes abundant and cheap, the value shifts to confidence. Not the loud, arrogant confidence of a television pundit, but the quiet confidence of someone who understands the trade-offs. Employers and clients will not pay you for the code; they will pay you for the assurance that this specific code is the right solution for their specific, messy reality. They pay for taste. They pay for trust.

If the robots are indeed coming for our jobs, the safest position is not to stand guard over one tiny task. It is to become the person who can see the entire ridiculous machine, spot the real problem, and explain it in plain English while everyone else is still arguing about which dashboard is lying.

That, happily, remains a very human talent.

Now, if you will excuse me, I have to start building my museum of ideas right after I figure out why my Linux kernel has decided to panic-dump in the middle of an otherwise peaceful afternoon. I suspect it, too, has been reading about the future and just wanted to feel something.

January 8, 2026 by Fernando SRE Computer Science stuff DevOps stuff

Microservices are the architectural equivalent of a midlife crisis

Someone in a zip-up hoodie has just told you that monoliths are architectural heresy. They insist that proper companies, the grown-up ones with rooftop terraces and kombucha taps in the breakroom, build systems the way squirrels store acorns. They describe hundreds of tiny, frantic caches scattered across the forest floor, each with its own API, its own database, and its own emotional baggage.

You stand there nodding along while holding your warm beer, feeling vaguely inadequate. You hide the shameful secret that your application compiles in less time than it takes to brew a coffee. You do not mention that your code lives in a repository that does not require a map and a compass to navigate. Your system runs on something scandalously simple. It is a monolith.

Welcome to the cult of small things. We have been expecting you, and we have prepared a very complicated seat for you.

The insecurity of the monolithic developer

The microservices revolution did not begin with logic. It began with envy. It started with a handful of very successful case studies that functioned less like technical blueprints and more like impossible beauty standards for teenagers.

Netflix streams billions of hours of video. Amazon ships everything from electric toothbrushes to tactical uranium (probably) to your door in two days. Their systems are vast, distributed, and miraculous. So the industry did what any rational group of humans would do. We copied their homework without checking if we were taking the same class.

We looked at Amazon’s architecture and decided that our internal employee timesheet application needed the same level of distributed complexity as a global logistics network. This is like buying a Formula 1 pit crew to help you parallel park a Honda Civic. It is technically impressive, sure. But it is also a cry for help.

Suddenly, admitting you maintained a monolith became a confession. Teams began introducing themselves at conferences by stating their number of microservices, the way bodybuilders flex biceps, or suburban dads compare lawn mower horsepower. “We are at 150 microservices,” someone would say, and the crowd would murmur approval. Nobody thought to ask if those services did anything useful. Nobody questioned whether the team spent more time debugging network calls than writing features.

The promise was flexibility. The reality became a different kind of rigidity. We traded the “spaghetti code” of the monolith for something far worse. We built a distributed bowl of spaghetti where the meatballs are hosted on different continents, and the sauce requires a security token to touch the pasta.

Debugging a murder mystery where the body keeps moving

Here is what the brochures and the medium articles do not mention. Debugging a monolith is straightforward. You follow the stack trace like a detective following footprints in the snow.

Debugging a distributed system, however, is less like solving a murder mystery and more like investigating a haunting. The evidence vanishes. The logs are in different time zones. Requests pass through so many services that by the time you find the culprit, you have forgotten the crime.

Everything works perfectly in isolation. This is the great lie of the unit test. Your service A works fine. Your service B works fine. But when you put them together, you get a Rube Goldberg machine that occasionally processes invoices but mostly generates heat and confusion.

To solve this, we invented “observability,” which is a fancy word for hiring a digital private investigator to stalk your own code. You need a service discovery tool. Then, a distributed tracing library. Then a circuit breaker, a bulkhead, a sidecar proxy, a configuration server, and a small shrine to the gods of eventual consistency.

Your developer productivity begins a gentle, heartbreaking decline. A simple feature, such as adding a “middle name” field to a user profile, now requires coordinating three teams, two API version bumps, and a change management ticket that will be reviewed next Thursday. The context switching alone shaves IQ points off your day. You have solved the complexity of the monolith by creating fifty mini monoliths, each with its own deployment pipeline and its own lonely maintainer who has started talking to the linter.

Your infrastructure bill is now a novelty item

There is a financial aspect to this midlife crisis. In the old days, you rented a server. Maybe two. You paid a fixed amount, and the server did the work.

In the microservices era, you are not just paying for the work. You are paying for the coordination of the work. You are paying for the network traffic between the services. You are paying for the serialization and deserialization of data that never leaves your data center. You are paying for the CPU cycles required to run the orchestration tools that manage the containers that hold the services that do the work.

It is an administrative tax. It is like hiring a construction crew where one guy hammers the nail, and twelve other guys stand around with clipboards coordinating the hammering angle, the hammer velocity, and the nail impact assessment strategy.

Amazon Prime Video found this out the hard way. In a move that shocked the industry, they published a case study detailing how they moved from a distributed, serverless architecture back to a monolithic structure for one of their core monitoring services.

The results were not subtle. They reduced their infrastructure costs by 90 percent. That is not a rounding error. That is enough money to buy a private island. Or at least a very nice yacht. They realized that sending video frames back and forth between serverless functions was the digital equivalent of mailing a singular sock to yourself one at a time. It was inefficient, expensive, and silly.

The myth of infinite scalability

Let us talk about that word. Scalability. It gets whispered in architectural reviews like a magic spell. “But will it scale?” someone asks, and suddenly you are drawing boxes and arrows on a whiteboard, each box a little fiefdom with its own database and existential dread.

Here is a secret that might get you kicked out of the hipster coffee shop. Most systems never see the traffic that justifies this complexity. Your boutique e-commerce site for artisanal cat toys does not need to handle Black Friday traffic every Tuesday. It could likely run on a well-provisioned server and a prayer. Using microservices for these workloads is like renting an aircraft hangar to store a bicycle.

Scalability comes in many flavors. You can scale a monolith horizontally behind a load balancer. You can scale specific heavy functions without splitting your entire domain model into atomic particles. Docker and containers gave us consistent deployment environments without requiring a service mesh so complex that it needs its own PhD program to operate.

The infinite scalability argument assumes you will be the next Google. Statistically, you will not. And even if you are, you can refactor later. It is much easier to slice up a monolith than it is to glue together a shattered vase.

Making peace with the boring choice

So what is the alternative? Must we return to the bad old days of unmaintainable codeballs?

No. The alternative is the modular monolith. This sounds like an oxymoron, but it functions like a dream. It is the architectural equivalent of a sensible sedan. It is not flashy. It will not make people jealous at traffic lights. But it starts every morning, it carries all your groceries, and it does not require a specialized mechanic flown in from Italy to change the oil.

You separate concerns inside the same codebase. You make your boundaries clear. You enforce modularity with code structure rather than network latency. When a module truly needs to scale differently, or a team truly needs autonomy, you extract it. You do this not because a conference speaker told you to, but because your profiler and your sprint retrospectives are screaming it.

Your architecture should match your team size. Three engineers do not need a service per person. They need a codebase they can understand without opening seventeen browser tabs. There is no shame in this. The shame is in building a distributed system so brittle that every deploy feels like defusing a bomb in an action movie, but without the cool soundtrack.

Epilogue

Architectural patterns are like diet fads. They come in waves, each promising total transformation. One decade, it is all about small meals, the next it is intermittent fasting, the next it is eating only raw meat like a caveman.

The truth is boring and unmarketable. Balance works. Microservices have their place. They are essential for organizations with thousands of developers who need to work in parallel without stepping on each other’s toes. They are great for systems that genuinely have distinct, isolated scaling needs.

For everything else, simplicity remains the ultimate sophistication. It is also the ultimate sanity preserver.

Next time someone tells you monoliths are dead, ask them how many incident response meetings they attended this week. The answer might be all the architecture review you need.

(Footnote: If they answer “zero,” they are either lying, or their pager duty alerts are currently stuck in a dead letter queue somewhere between Service A and Service B.)

January 6, 2026 by Fernando SRE Cloud stuff Computer Science stuff DevOps stuff SRE stuff

Kubernetes leases or the art of waiting for the bathroom

If you looked inside a running Kubernetes cluster with a microscope, you would not see a perfectly choreographed ballet of binary code. You would see a frantic, crowded open-plan office staffed by thousands of employees who have consumed dangerous amounts of espresso. You have schedulers, controllers, and kubelets all sprinting around, frantically trying to update databases and move containers without crashing into each other.

It is a miracle that the whole thing does not collapse into a pile of digital rubble within seconds. Most human organizations of this size descend into bureaucratic infighting before lunch. Yet, somehow, Kubernetes keeps this digital circus from turning into a riot.

You might assume that the mechanism preventing this chaos is a highly sophisticated, cryptographic algorithm forged in the fires of advanced mathematics. It is not. The thing that keeps your cluster from eating itself is the distributed systems equivalent of a sticky note on a door. It is called a Lease.

And without this primitive, slightly passive-aggressive little object, your entire cloud infrastructure would descend into anarchy faster than you can type kubectl delete namespace.

The sticky note of power

To understand why a Lease is necessary, we have to look at the psychology of a Kubernetes controller. These components are, by design, incredibly anxious. They want to ensure that the desired state of the world matches the actual state.

The problem arises when you want high availability. You cannot just have one controller running because if it dies, your cluster stops working. So you run three replicas. But now you have a new problem. If all three replicas try to update the same routing table or create the same pod at the exact same moment, you get a “split-brain” scenario. This is the technical term for a psychiatric emergency where the left hand deletes what the right hand just created.

Kubernetes solves this with the Lease object. Technically, it is an API resource in the coordination.k8s.io group. Spiritually, it is a “Do Not Disturb” sign hung on a doorknob.

If you look at the YAML definition of a Lease, it is almost insultingly simple. It does not ask for a security clearance or a biometric scan. It essentially asks three questions:

HolderIdentity: Who are you?
LeaseDurationSeconds: How long are you going to be in there?
RenewTime: When was the last time you shouted that you are still alive?

Here is what one looks like in the wild:

apiVersion: coordination.k8s.io/v1
kind: Lease
metadata:
  name: cluster-coordination-lock
  namespace: kube-system
spec:
  holderIdentity: "controller-pod-beta-09"
  leaseDurationSeconds: 15
  renewTime: "2023-10-27T10:04:05.000000Z"

In plain English, this document says: “Controller Beta-09 is holding the steering wheel. It has fifteen seconds to prove it has not died of a heart attack. If it stays silent for sixteen seconds, we are legally allowed to pry the wheel from its cold, dead fingers.”

An awkward social experiment

To really grasp the beauty of this system, we need to leave the server room and enter a shared apartment with a terrible design flaw. There is only one bathroom, the lock is broken, and there are five roommates who all drank too much water.

The bathroom is the “critical resource.” In a computerized world without Leases, everyone would just barge in whenever they felt the urge. This leads to what engineers call a “race condition” and what normal people call “an extremely embarrassing encounter.”

Since we cannot fix the lock, we install a whiteboard on the door. This is the Lease.

The rules of this apartment are strict but effective. When you walk up to the door, you write your name and the current time on the board. You have now acquired the lock. As long as your name is there and the timestamp is fresh, the other roommates will stand in the hallway, crossing their legs and waiting politely.

But here is where it gets stressful. You cannot just write your name and fall asleep in the tub. The system requires constant anxiety. Every few seconds, you have to crack the door open, reach out with a marker, and update the timestamp. This is the “heartbeat.” It tells the people waiting outside that you are still conscious and haven’t slipped in the shower.

If you faint, or if the WiFi cuts out and you cannot reach the whiteboard, you stop updating the time. The roommates outside watch the clock. Ten seconds pass. Fifteen seconds. At sixteen seconds, they do not knock to see if you are okay. They assume you are gone forever, wipe your name off the board, write their own, and barge in.

It is ruthless, but it ensures that the bathroom is never left empty just because the previous occupant vanished into the void.

The paranoia of leader election

The most critical use of this bathroom logic is something called Leader Election. This is the mechanism that keeps your kube-controller-manager and kube scheduler from turning into a bar fight.

You typically run multiple copies of these control plane components for redundancy. However, you absolutely cannot have five different schedulers trying to assign the same pod to five different nodes simultaneously. That would be like having five conductors trying to lead the same orchestra. You do not get music; you get noise and a lot of angry musicians.

So, the replicas hold an election. But it is not a democratic vote with speeches and ballots. It is a race to grab the marker.

The moment the controllers start up, they all rush toward the Lease object. The first one to write its name in the holderIdentity field becomes the Leader. The others, the candidates, do not go home. They stand in the corner, staring at the Lease, refreshing the page every two seconds, waiting for the Leader to fail.

There is something deeply human about this setup. The backup replicas are not “supporting” the leader. They are jealous understudies watching the lead actor, hoping he breaks a leg so they can take center stage.

If the Leader crashes or simply gets stuck in a network traffic jam, the renewTime stops updating. The lease expires. Immediately, the backups scramble to write their own name. The winner takes over the cluster duties instantly. It is seamless, automated, and driven entirely by the assumption that everyone else is unreliable.

Reducing the noise pollution

In the early days of Kubernetes, things were even messier. Nodes, the servers doing the actual work, had to prove they were alive by sending a massive status report to the API server every few seconds.

Imagine a receptionist who has to process a ten-page medical history form from every single employee every ten seconds, just to confirm they are at their desks. It was exhausting. The API server spent so much time reading these reports that it barely had time to do anything else.

Today, Kubernetes uses Leases for node heartbeats, too. Instead of the full medical report, the node just updates a Lease object. It is a quick, lightweight ping.

“I’m here.”

“Good.”

“Still here.”

“Great.”

This change reduced the computational cost of staying alive significantly. The API server no longer needs to know your blood pressure and cholesterol levels every ten seconds; it just needs to know you are breathing. It turns a bureaucratic nightmare into a simple check-in.

How to play with fire

The beauty of the Lease system is that it is just a standard Kubernetes object. You can see these invisible sticky notes right now. If you list the leases in the system namespace, you will see the invisible machinery that keeps the lights on:

kubectl get leases -n kube-system

You will see entries for the controller manager, the scheduler, and probably one for every node in your cluster. If you want to see who the current boss is, you can describe the lease:

kubectl describe lease kube-scheduler -n kube-system

You will see the holderIdentity. That is the name of the replica currently running the show.

Now, if you are feeling particularly chaotic, or if you just want to see the world burn, you can delete a Lease manually.

kubectl delete lease kube-scheduler -n kube-system

Please do not do this in production unless you enjoy panic attacks.

Deleting an active Lease is like ripping the “Occupied” sign off the bathroom door while someone is inside. You are effectively lying to the system. You are telling the backup controllers, “The leader is dead! Long live the new leader!”

The backups will rush in and elect a new leader. But the old leader, who was effectively just sitting there minding its own business, is still running. Suddenly, it realizes it has been fired without notice. Ideally, it steps down gracefully. But in the split second before it realizes what happened, you might have two controllers giving orders.

The system will heal itself, usually within seconds, but those few seconds are a period of profound confusion for everyone involved.

The survival of the loudest

Leases are the unsung heroes of the cloud native world. We like to talk about Service Meshes and eBPF and other shiny, complex technologies. But at the bottom of the stack, keeping the whole thing from exploding, is a mechanism as simple as a name on a whiteboard.

It works because it accepts a fundamental truth about distributed systems: nothing is reliable, everyone is going to crash eventually, and the only way to maintain order is to force components to shout “I am alive!” every few seconds.

Next time your cluster survives a node failure or a controller restart without you even noticing, spare a thought for the humble Lease. It is out there in the void, frantically renewing timestamps, protecting you from the chaos of a split-brain scenario. And that is frankly better than a lock on a bathroom door any day.

January 1, 2026 by Fernando SRE Cloud stuff DevOps stuff Kubernetes Linux Stuff SRE stuff

Managing the emotional stability of your Linux server

Thursday, 3:47 AM. Your server is named Nigel. You named him Nigel because deep down, despite the silicon and the circuitry, he feels like a man who organizes his spice rack alphabetically by the Latin name of the plant. But right now, Nigel is not organizing spices. Nigel has decided to stage a full-blown existential rebellion.

The screen is black. The network fan is humming with a tone of passive-aggressive silence. A cursor blinks in the upper-left corner with a rhythm that seems designed specifically to induce migraines. You reboot. Nigel reboots. Nothing changes. The machine is technically “on,” in the same way a teenager staring at the ceiling for six hours is technically “awake.”

At this moment, the question separating the seasoned DevOps engineer from the panicked googler is not “Why me?” but rather: Which personality did Nigel wake up with today?

This is not a technical question. It is a psychological one. Linux does not break at random; it merely changes moods. It has emotional states. And once you learn to read them, troubleshooting becomes less like exorcising a demon and more like coaxing a sulking relative out of the bathroom during Thanksgiving dinner.

The grumpy grandfather who started it all

We lived in a numeric purgatory for years. In an era when “multitasking” sounded like dangerous witchcraft and coffee came only in one flavor (scorched), Linux used a system called SysVinit to manage its temperaments. This system boiled the entire machine’s existence down to a handful of numbers, zero through six, called runlevels.

It was a rigid caste system. Each number was a dial you could turn to decide how much Nigel was willing to participate in society.

Runlevel 0 meant Nigel was checking out completely. Death. Runlevel 6 meant Nigel had decided to reincarnate. Runlevel 1 was Nigel as a hermit monk, holed up in a cave with no network, no friends, just a single shell and a vow of digital silence. Runlevel 5 was Nigel on espresso and antidepressants, graphical interface blazing, ready to party and consume RAM for no apparent reason.

This was functional, in the way a Soviet-era tractor is functional. It was also about as intuitive as a dishwasher manual written in cuneiform. You would tell a junior admin to “boot to runlevel 3,” and they would nod while internally screaming. What does three mean? Is it better than two? Is five twice as good as three? The numbers did not describe anything; they just were, like the arbitrary rules of a board game invented by someone who actively hated you.

And then there was runlevel 4. Runlevel 4 is the appendix of the Linux anatomy. It is vaguely present, historically relevant, but currently just taking up space. It was the “user-definable” switch in your childhood home that either did nothing or controlled the neighbor’s garage door. It sits there, unused, gathering digital dust.

Enter the overly organized therapist

Then came systemd. If SysVinit was a grumpy grandfather, systemd is the high-energy hospital administrator who carries a clipboard and yells at people for walking too slowly. Systemd took one look at those numbered mood dials and was appalled. “Numbers? Seriously? Even my router has a name.”

It replaced the cold digits with actual descriptive words: multi-user.target, graphical.target, rescue.target. It was as if Linux had finally gone to therapy and learned to use its words to express its feelings instead of grunting “runlevel 3” when it really meant “I need personal space, but WiFi would be nice.”

Targets are just runlevels with a humanities degree. They perform the exact same job, defining which services start, whether the GUI is invited to the party, whether networking gets a plus-one, but they do so with the kind of clarity that makes you wonder how we survived the numeric era without setting more server rooms on fire.

A Rosetta Stone for Nigel’s mood swings

Here is the translation guide that your cheat sheet wishes it had. Think of this as the DSM-5 for your server.

Runlevel 0 becomes poweroff.target
Nigel is taking a permanent nap. This is the Irish Goodbye of operating states.
Runlevel 1 becomes rescue.target
Nigel is in intensive care. Only family is allowed to visit (root user). The network is unplugged, the drives might be mounted read-only, and the atmosphere is grim. This is where you go when you have broken something fundamental and need to perform digital surgery.
Runlevel 3 becomes multi-user.target
Nigel is wearing sweatpants but answering emails. This is the gold standard for servers. Networking is up, multiple users can log in, cron jobs are running, but there is no graphical interface to distract anyone. It is a state of pure, joyless productivity.
Runlevel 5 becomes graphical.target
Nigel is in full business casual with a screensaver. He has loaded the window manager, the display server, and probably a wallpaper of a cat. He is ready to interact with a mouse. He is also consuming an extra gigabyte of memory just to render window shadows.
Runlevel 6 becomes reboot.target
Nigel is hitting the reset button on his life.

The command line couch

Knowing Nigel’s mood is useless unless you can change it. You need tools to intervene. These are the therapy techniques you keep in your utility belt.

To eyeball Nigel’s default personality (the one he wakes up with every morning), you ask:

systemctl get-default

This might spit back graphical.target. This means Nigel is a morning person who greets the world with a smile and a heavy user interface. If it says multi-user.target, Nigel is the coffee-before-conversation type.

But sometimes, you need to force a mood change. Let’s say you want to switch Nigel from party mode (graphical) to hermit mode (text-only) without making it permanent. You are essentially putting an extrovert in a quiet room for a breather.

systemctl isolate multi-user.target

The word “isolate” here is perfect. It is not “disable” or “kill.” It is “isolate”. It sounds less like computer administration and more like what happens to the protagonist in the third act of a horror movie involving Antarctic research stations. It tells systemd to stop everything that doesn’t belong in the new target. The GUI vanishes. The silence returns.

To switch back, because sometimes you actually need the pretty buttons:

systemctl isolate graphical.target

And to permanently change Nigel’s baseline disposition, akin to telling a chronically late friend that dinner is at 6:30 when it is really at 7:00:

systemctl set-default multi-user.target

Now Nigel will always wake up in Command Line Interface mode, even after a reboot. You can practically hear the sigh of relief from your CPU as it realizes it no longer has to render pixels.

When Nigel has a real breakdown

Let’s walk through some actual disasters, because theory is just a hobby until production goes down and your boss starts hovering behind your chair breathing through his mouth.

Scenario one: The fugue state

Nigel updated his kernel and now boots to a black screen. He is not dead; he is just confused. You reboot, interrupt the boot loader, and add systemd.unit=rescue.target to the boot parameters.

Nigel wakes up in a safe room. It is a root shell. There is no networking. There is no drama. It is just you and the config files. It is intimate, in a disturbing way. You fix the offending setting, type exec /sbin/init, and Nigel reboots into his normal self, slightly embarrassed about the whole episode.

Scenario two: The toddler on espresso

Nigel’s graphical interface has started crashing like a toddler after too much sugar. Every time you log in, the desktop environment panics and dies. Instead of fighting it, you switch to multi-user.target.

Nigel is now a happy, stable server with no interest in pretty icons. Your users can still SSH in. Your automated jobs still run. Nigel just doesn’t have to perform anymore. It is like taking the toddler out of the Chuck E. Cheese and putting him in a library. The screaming stops immediately.

Scenario three: The bloatware incident

Nigel is a production web server that has inexplicably slowed to a crawl. You dig through the logs and discover that an intern (let’s call him “Not-Fernando”) installed a full desktop environment six months ago because they liked the screensaver.

This is akin to buying a Ferrari to deliver pizza because you like the leather seats. The graphical target is eating resources that your database desperately needs. You set the default to multi-user.target and reboot. Nigel comes back lean, mean, and suddenly has five hundred extra megabytes of RAM to play with. It is like watching someone shed a winter coat in the middle of July.

The mindset shift

Beginners see a black screen and ask, “Why is Nigel broken?” Professionals see a black screen and ask, “Which target is Nigel in, and which services are active?”

This is not just semantics. It is the difference between treating a symptom and diagnosing a disease. When you understand that Linux doesn’t break so much as it changes states, you stop being a victim of circumstance and start being a negotiator. You are not praying to the machine gods; you are simply asking Nigel, “Hey buddy, what mood are you in?” and then coaxing him toward a more productive state.

The panic evaporates because you know the vocabulary. You know that rescue.target is a panic room, multi-user.target is a focused work session, and graphical.target is Nigel trying to impress someone at a party.

Linux targets are not arcane theory reserved for greybeards and certification exams. They are the foundational language of state management. They are how you tell Nigel, “It is okay to be a hermit today,” or “Time to socialize,” or “Let’s check you into therapy real quick.”

Once you internalize this, boot issues stop being terrifying mysteries. They become logical puzzles. Interviews stop being interrogations. They become conversations. You stop sounding like a generic admin reading a forum post and start sounding like someone who knows Nigel personally.

Because you do. Nigel is that fussy, brilliant, occasionally melodramatic friend who just needs the right kind of encouragement. And now you have the exact words to provide it.

December 22, 2025 by Fernando SRE Cloud stuff DevOps stuff Linux Stuff SRE stuff

Docker didn’t die, it just moved to your laptop

Docker used to be the answer you gave when someone asked, “How do we ship this thing?” Now it’s more often the answer to a different question, “How do I run this thing locally without turning my laptop into a science fair project?”

That shift is not a tragedy. It’s not even a breakup. It’s more like Docker moved out of the busy downtown apartment called “production” and into a cozy suburb called “developer experience”, where the lawns are tidy, the tools are friendly, and nobody panics if you restart everything three times before lunch.

This article is about what changed, why it changed, and why Docker is still very much worth knowing, even if your production clusters rarely whisper its name anymore.

What we mean when we say Docker

One reason this topic gets messy is that “Docker” is a single word used to describe several different things, and those things have very different jobs.

Docker Desktop is the product that many developers actually interact with day to day, especially on macOS and Windows.
Docker Engine and the Docker daemon are the background machinery that runs containers on a host.
The Docker CLI and Dockerfile workflow are the human-friendly interface and the packaging format that people have built habits around.

When someone says “Docker is dying,” they usually mean “Docker Engine is no longer the default runtime in production platforms.” When someone says “Docker is everywhere,” they often mean “Docker Desktop and Dockerfile workflows are still the easiest way to get a containerized dev environment running quickly.”

Both statements can be true at the same time, which is annoying, because humans prefer their opinions to come in single-serving packages.

Docker’s rise and the good kind of magic

Docker didn’t become popular because it invented containers. Containers existed before Docker. Docker became popular because it made containers feel approachable.

It offered a developer experience that felt like a small miracle:

You could build images with a straightforward command.
You could run containers without a small dissertation on Linux namespaces.
You could push to registries and share a runnable artifact.
You could spin up multi-service environments with Docker Compose.

Docker took something that used to feel like “advanced systems programming” and turned it into “a thing you can demo on a Tuesday.”

If you were around for the era of XAMPP, WAMP, and “download this zip file, then pray,” Docker felt like a modern version of that, except it didn’t break as soon as you looked at it funny.

The plot twist in production

Here is the part where the story becomes less romantic.

Production infrastructure grew up.

Not emotionally, obviously. Infrastructure does not have feelings. It has outages. But it did mature in a very specific way: platforms started to standardize around container runtimes and interfaces that did not require Docker’s full bundled experience.

Docker was the friendly all-in-one kitchen appliance. Production systems wanted an industrial kitchen with separate appliances, separate controls, and fewer surprises.

Three forces accelerated the shift.

Licensing concerns changed the mood

Docker Desktop licensing changes made a lot of companies pause, not because engineers suddenly hated Docker, but because legal teams developed a new hobby.

The typical sequence went like this:

Someone in finance asked, “How many Docker Desktop users do we have?”
Someone in legal asked, “What exactly are we paying for?”
Someone in infrastructure said, “We can probably do this with Podman or nerdctl.”

A tool can survive engineers complaining about it. Engineers complain about everything. The real danger is when procurement turns your favorite tool into a spreadsheet with a red cell.

The result was predictable: even developers who loved Docker started exploring alternatives, if only to reduce risk and friction.

The runtime world standardized without Docker

Modern container platforms increasingly rely on runtimes like containerd and interfaces like the Container Runtime Interface (CRI).

Kubernetes is a key example. Kubernetes removed the direct Docker integration path that many people depended on in earlier years, and the ecosystem moved toward CRI-native runtimes. The point was not to “ban Docker.” The point was to standardize around an interface designed specifically for orchestrators.

This is a subtle but important difference.

Docker is a complete experience, build, run, network, UX, opinions included.
Orchestrators prefer modular components, and they want to speak to a runtime through a stable interface.

The practical effect is what most teams feel today:

In many Kubernetes environments, the runtime is containerd, not Docker Engine.
Managed platforms such as ECS Fargate and other orchestrated services often run containers without involving Docker at all.

Docker, the daemon, became optional.

Security teams like control, and they do not like surprises

Security teams do not wake up in the morning and ask, “How can I ruin a developer’s day?” They wake up and ask, “How can I make sure the host does not become a piñata full of root access?”

Docker can be perfectly secure when used well. The problem is that it can also be spectacularly insecure when used casually.

Two recurring issues show up in real organizations:

The Docker socket is powerful. Expose it carelessly, and you are effectively offering a fast lane to host-level control.
The classic pattern of “just give developers sudo docker” can become a horror story with a polite ticket number.

Tools and workflows that separate concerns tend to make security people calmer.

Build tools such as BuildKit and buildah isolate image creation.
Rootless approaches, where feasible, reduce blast radius.
Runtime components can be locked down and audited more granularly.

This is not about blaming Docker. It’s about organizations preferring a setup where the sharp knives are stored in a drawer, not taped to the ceiling.

What Docker is now

Docker’s new role is less “the thing that runs production” and more “the thing that makes local development less painful.”

And that role is huge.

Docker still shines in areas where convenience matters most:

Local development environments
Quick reproducible demos
Multi-service stacks on a laptop
Cross-platform consistency on macOS, Windows, and Linux
Teams that need a simple standard for “how do I run this?”

If your job is to onboard new engineers quickly, Docker is still one of the best ways to avoid the dreaded onboarding ritual where a senior engineer says, “It works on my machine,” and the junior engineer quietly wonders if their machine has offended someone.

A small example that still earns its keep

Here is a minimal Docker Compose stack that demonstrates why Docker remains lovable for local development.

services:
  api:
    build: .
    ports:
      - "8080:8080"
    environment:
      DATABASE_URL: postgres://postgres:example@db:5432/app
    depends_on:
      - db

  db:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: example
      POSTGRES_DB: app
    ports:
      - "5432:5432"

This is not sophisticated. That is the point. It is the “plug it in and it works” power that made Docker famous.

Dockerfile is not the Docker daemon

This is where the confusion often peaks.

A Dockerfile is a packaging recipe. It is widely used. It remains a de facto standard, even when the runtime or build system is not Docker.

Many teams still write Dockerfiles, but build them using tooling that does not rely on the Docker daemon on the CI runner.

Here is a BuildKit example that builds and pushes an image without treating the Docker daemon as a requirement.

buildctl build \
  --frontend dockerfile.v0 \
  --local context=. \
  --local dockerfile=. \
  --output type=image,name=registry.example.com/app:latest,push=true

You can read this as “Dockerfile lives on, but Docker-as-a-daemon is no longer the main character.”

This separation matters because it changes how you design CI.

You can build images in environments where running a privileged Docker daemon is undesirable.
You can use builders that integrate better with Kubernetes or cloud-native pipelines.
You can reduce the amount of host-level power you hand out just to produce an artifact.

What replaced Docker in production pipelines

When teams say they are moving away from Docker in production, they rarely mean “we stopped using containers.” They mean the tooling around building and running containers is shifting.

Common patterns include:

containerd as the runtime in Kubernetes and other orchestrated environments
BuildKit for efficient builds and caching
kaniko for building images inside Kubernetes without a Docker daemon
ko for building and publishing Go applications as images without a Dockerfile
Buildpacks or Nixpacks for turning source code into runnable images using standardized build logic
Dagger and similar tools for defining CI pipelines that treat builds as portable graphs of steps

You do not need to use all of these. You just need to understand the trend.

Production platforms want:

Standard interfaces
Smaller, auditable components
Reduced privilege
Reproducible builds

Docker can participate in that world, but it no longer owns the whole stage.

A Kubernetes-friendly image build example

If you want a concrete example of the “no Docker daemon” approach, kaniko is a popular choice in cluster-native pipelines.

apiVersion: batch/v1
kind: Job
metadata:
  name: build-image-kaniko
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: kaniko
          image: gcr.io/kaniko-project/executor:latest
          args:
            - "--dockerfile=Dockerfile"
            - "--context=dir:///workspace"
            - "--destination=registry.example.com/app:latest"
          volumeMounts:
            - name: workspace
              mountPath: /workspace
      volumes:
        - name: workspace
          emptyDir: {}

This is intentionally simplified. In a real setup, you would bring your own workspace, your own auth mechanism, and your own caching strategy. But even in this small example, the idea is visible: build the image where it makes sense, without turning every CI runner into a tiny Docker host.

The practical takeaway for architects and platform teams

If you are designing platforms, the question is not “Should we ban Docker?” The question is “Where does Docker add value, and where does it create unnecessary coupling?”

A simple mental model helps.

Developer laptops benefit from a friendly tool that makes local environments predictable.
CI systems benefit from builder choices that reduce privilege and improve caching.
Production runtimes benefit from standardized interfaces and minimal moving parts.

Docker tends to dominate the first category, participates in the second, and is increasingly optional in the third.

If your team still uses Docker Engine on production hosts, that is not automatically wrong. It might be perfectly fine. The important thing is that you are doing it intentionally, not because “that’s how we’ve always done it.”

Why this is actually a success story

There is a temptation in tech to treat every shift as a funeral.

But Docker moving toward local development is not a collapse. It is a sign that the ecosystem absorbed Docker’s best ideas and made them normal.

The standardization of OCI images, the popularity of Dockerfile workflows, and the expectations around reproducible environments, all of that is Docker’s legacy living in the walls.

Docker is still the tool you reach for when you want to:

start fast
teach someone new
run a realistic stack on a laptop
avoid spending your afternoon installing the same dependencies in three different ways

That is not “less important.” That is foundational.

If anything, Docker’s new role resembles a very specific kind of modern utility.

It is like Visual Studio Code.

Everyone uses it. Everyone argues about it. It is not what you deploy to production, but it is the thing that makes building and testing your work feel sane.

Docker didn’t die.

It just moved to your laptop, brought snacks, and quietly let production run the serious machinery without demanding to be invited to every meeting.

December 18, 2025 by Fernando SRE DevOps stuff Kubernetes SRE stuff

Let IAM handle the secrets you can avoid

There are two kinds of secrets in cloud security.

The first kind is the legitimate kind: a third-party API token, a password for something you do not control, a certificate you cannot simply wish into existence.

The second kind is the kind we invent because we are in a hurry: long-lived access keys, copied into a config file, then copied into a Docker image, then copied into a ticket, then copied into the attacker’s weekend plans.

This article is about refusing to participate in that second category.

Not because secrets are evil. Because static credentials are the “spare house key under the flowerpot” of AWS. Convenient, popular, and a little too generous with access for something that can be photographed.

The goal is not “no secrets exist.” The goal is no secrets live in code, in images, or in long-lived credentials.

If you do that, your security posture stops depending on perfect human behavior, which is great because humans are famously inconsistent. (We cannot all be trusted with a jar of cookies, and we definitely cannot all be trusted with production AWS keys.)

Why this works in real life

AWS already has a mechanism designed to prevent your applications from holding permanent credentials: IAM roles and temporary credentials (STS).

When your Lambda runs with an execution role, AWS hands it short-lived credentials automatically. They rotate on their own. There is nothing to copy, nothing to stash, nothing to rotate in a spreadsheet named FINAL-final-rotation-plan.xlsx.

What remains are the unavoidable secrets, usually tied to systems outside AWS. For those, you store them in AWS Secrets Manager and retrieve them at runtime. Not at build time. Not at deploy time. Not by pasting them into an environment variable and calling it “secure” because you used uppercase letters.

This gives you a practical split:

Avoidable secrets are replaced by IAM roles and temporary credentials
Unavoidable secrets go into Secrets Manager, encrypted and tightly scoped

The architecture in one picture

A simple flow to keep in mind:

A Lambda function runs with an IAM execution role
The function fetches one third-party API key from Secrets Manager at runtime
The function calls the third-party API and writes results to DynamoDB
Network access to Secrets Manager stays private through a VPC interface endpoint (when the Lambda runs in a VPC)

The best part is what you do not see.

No access keys. No “temporary” keys that have been temporary since 2021. No secrets baked into ZIPs or container layers.

What this protects you from

This pattern is not a magic spell. It is a seatbelt.

It helps reduce the chance of:

Credentials leaking through Git history, build logs, tickets, screenshots, or well-meaning copy-paste
Forgotten key rotation schedules that quietly become “never.”
Overpowered policies that turn a small bug into a full account cleanup
Unnecessary public internet paths for sensitive AWS API calls

Now let’s build it, step by step, with code snippets that are intentionally sanitized.

Step 1 build an IAM execution role with tight policies

The execution role is the front door key your Lambda carries.

If you give it access to everything, it will eventually use that access, if only because your future self will forget why it was there and leave it in place “just in case.”

Keep it boring. Keep it small.

Here is an example IAM policy for a Lambda that only needs to:

write to one DynamoDB table
read one secret from Secrets Manager
decrypt using one KMS key (optional, depending on how you configure encryption)

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "WriteToOneTable",
      "Effect": "Allow",
      "Action": [
        "dynamodb:PutItem",
        "dynamodb:UpdateItem"
      ],
      "Resource": "arn:aws:dynamodb:eu-west-1:111122223333:table/app-results-prod"
    },
    {
      "Sid": "ReadOneSecret",
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetSecretValue"
      ],
      "Resource": "arn:aws:secretsmanager:eu-west-1:111122223333:secret:thirdparty/weather-api-key-*"
    },
    {
      "Sid": "DecryptOnlyThatKey",
      "Effect": "Allow",
      "Action": [
        "kms:Decrypt"
      ],
      "Resource": "arn:aws:kms:eu-west-1:111122223333:key/12345678-90ab-cdef-1234-567890abcdef",
      "Condition": {
        "StringEquals": {
          "kms:ViaService": "secretsmanager.eu-west-1.amazonaws.com"
        }
      }
    }
  ]
}

A few notes that save you from future regret:

The secret ARN ends with -* because Secrets Manager appends a random suffix.
The KMS condition helps ensure the key is used only through Secrets Manager, not as a general-purpose decryption service.
You can skip the explicit kms:Decrypt statement if you use the AWS-managed key and accept the default behavior, but customer-managed keys are common in regulated environments.

Step 2 store the unavoidable secret properly

Secrets Manager is not a place to dump everything. It is a place to store what you truly cannot avoid.

A third-party API key is a perfect example because IAM cannot replace it. AWS cannot assume a role in someone else’s SaaS.

Use a JSON secret so you can extend it later without creating a new secret every time you add a field.

{
  "api_key": "REDACTED-EXAMPLE-TOKEN"
}

If you like the CLI (and I do, because buttons are too easy to misclick), create the secret like this:

aws secretsmanager create-secret \
  --name "thirdparty/weather-api-key" \
  --description "Token for the Weatherly API used by the ingestion Lambda" \
  --secret-string '{"api_key":"REDACTED-EXAMPLE-TOKEN"}' \
  --region eu-west-1

Then configure:

encryption with a customer-managed KMS key if required
rotation if the provider supports it (rotation is amazing when it is real, and decorative when the vendor does not allow it)

If the vendor does not support rotation, you still benefit from central storage, access control, audit logging, and removing the secret from code.

Step 3 lock down secret access with a resource policy

Identity-based policies on the Lambda role are necessary, but resource policies are a nice extra lock.

Think of it like this: your role policy is the key. The resource policy is the bouncer who checks the wristband.

Here is a resource policy that allows only one role to read the secret.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowOnlyIngestionRole",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::111122223333:role/lambda-ingestion-prod"
      },
      "Action": "secretsmanager:GetSecretValue",
      "Resource": "*"
    },
    {
      "Sid": "DenyEverythingElse",
      "Effect": "Deny",
      "Principal": "*",
      "Action": "secretsmanager:GetSecretValue",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:PrincipalArn": "arn:aws:iam::111122223333:role/lambda-ingestion-prod"
        }
      }
    }
  ]
}

This is intentionally strict. Strict is good. Strict is how you avoid writing apology emails.

Step 4 keep Secrets Manager traffic private with a VPC endpoint

If your Lambda runs inside a VPC, it will not automatically have internet access. That is often the point.

In that case, you do not want the function reaching Secrets Manager through a NAT gateway if you can avoid it. NAT works, but it is like walking your valuables through a crowded shopping mall because the back door is locked.

Use an interface VPC endpoint for Secrets Manager.

Here is a Terraform example (sanitized) that creates the endpoint and limits access using a dedicated security group.

resource "aws_security_group" "secrets_endpoint_sg" {
  name        = "secrets-endpoint-sg"
  description = "Allow HTTPS from Lambda to Secrets Manager endpoint"
  vpc_id      = aws_vpc.main.id

  ingress {
    from_port       = 443
    to_port         = 443
    protocol        = "tcp"
    security_groups = [aws_security_group.lambda_sg.id]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

resource "aws_vpc_endpoint" "secretsmanager" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.eu-west-1.secretsmanager"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = [aws_subnet.private_a.id, aws_subnet.private_b.id]
  private_dns_enabled = true
  security_group_ids  = [aws_security_group.secrets_endpoint_sg.id]
}

If your Lambda is not in a VPC, you do not need this step. The function will reach Secrets Manager over AWS’s managed network path by default.

If you want to go further, consider adding a DynamoDB gateway endpoint too, so your function can write to DynamoDB without touching the public internet.

Step 5 retrieve the secret at runtime without turning logs into a confession

This is where many teams accidentally reinvent the problem.

They remove the secret from the code, then log it. Or they put it in an environment variable because “it is not in the repository,” which is a bit like saying “the spare key is not under the flowerpot, it is under the welcome mat.”

The clean approach is:

store only the secret name (not the secret value) as configuration
retrieve the value at runtime
cache it briefly to reduce calls and latency
never print it, even when debugging, especially when debugging

Here is a Python example for AWS Lambda with a tiny TTL cache.

import json
import os
import time
import boto3

_secrets_client = boto3.client("secretsmanager")
_cached_value = None
_cached_until = 0

SECRET_ID = os.getenv("THIRDPARTY_SECRET_ID", "thirdparty/weather-api-key")
CACHE_TTL_SECONDS = int(os.getenv("SECRET_CACHE_TTL_SECONDS", "300"))


def _get_api_key() -> str:
    global _cached_value, _cached_until

    now = int(time.time())
    if _cached_value and now < _cached_until:
        return _cached_value

    resp = _secrets_client.get_secret_value(SecretId=SECRET_ID)
    payload = json.loads(resp["SecretString"])

    api_key = payload["api_key"]
    _cached_value = api_key
    _cached_until = now + CACHE_TTL_SECONDS
    return api_key


def lambda_handler(event, context):
    api_key = _get_api_key()

    # Use the key without ever logging it
    results = call_weatherly_api(api_key=api_key, city=event.get("city", "Seville"))

    write_to_dynamodb(results)

    return {
        "status": "ok",
        "items": len(results) if hasattr(results, "__len__") else 1
    }

This snippet is intentionally short. The important part is the pattern:

minimal secret access
controlled cache
zero secret output

If you prefer a library, AWS provides a Secrets Manager caching client for some runtimes, and AWS Lambda Powertools can help with structured logging. Use them if they fit your stack.

Step 6 make security noisy with logs and alarms

Security without visibility is just hope with a nicer font.

At a minimum:

enable CloudTrail in the account
ensure Secrets Manager events are captured
alert on unusual secret access patterns

A simple and practical approach is a CloudWatch metric filter for GetSecretValue events coming from unexpected principals. Another is to build a dashboard showing:

Lambda errors
Secrets Manager throttles
sudden spikes in secret reads

Here is a tiny Terraform example that keeps your Lambda logs from living forever (because storage is forever, but your attention span is not).

resource "aws_cloudwatch_log_group" "lambda_logs" {
  name              = "/aws/lambda/lambda-ingestion-prod"
  retention_in_days = 14
}

Also consider:

IAM Access Analyzer to spot risky resource policies
AWS Config rules or guardrails if your organization uses them
an alarm on unexpected NAT data processing if you intended to keep traffic private

Common mistakes I have made, so you do not have to

I am listing these because I have either done them personally or watched them happen in slow motion.

Using a wildcard secret policy
secretsmanager:GetSecretValue on * feels convenient until it is a breach multiplier.
Putting secret values into environment variables
Environment variables are not evil, but they are easy to leak through debugging, dumps, tooling, or careless logging. Store secret names there, not secret contents.
Retrieving secrets at build time
Build logs live forever in the places you forget to clean. Runtime retrieval keeps secrets out of build systems.
Logging too much while debugging
The fastest way to leak a secret is to print it “just once.” It will not be just once.
Skipping the endpoint and relying on NAT by accident
The NAT gateway is not evil either. It is just an expensive and unnecessary hallway if a private door exists.

A two minute checklist you can steal

Your Lambda uses an IAM execution role, not access keys
The role policy scopes Secrets Manager access to one secret ARN pattern
The secret has a resource policy that only allows the expected role
Secrets are encrypted with KMS when required
The secret value is never stored in code, images, build logs, or environment variables
If Lambda runs in a VPC, you use an interface VPC endpoint for Secrets Manager
You have CloudTrail enabled and you can answer “who accessed this secret” without guessing

Extra thoughts

If you remove long-lived credentials from your applications, you remove an entire class of problems.

You stop rotating keys that should never have existed in the first place.

You stop pretending that “we will remember to clean it up later” is a security strategy.

And you get a calmer life, which is underrated in engineering.

Let IAM handle the secrets you can avoid.

Then let Secrets Manager handle the secrets you cannot.

And let your code do what it was meant to do: process data, not babysit keys like they are a toddler holding a permanent marker.

December 14, 2025 by Fernando SRE Cloud stuff DevOps stuff

How Dropbox saved millions by leaving AWS

Most of us treat cloud storage like a magical, bottomless attic. You throw your digital clutter into a folder: PDFs of tax returns from 2014, blurred photos of a cat that has long since passed away, unfinished drafts of novels, and you forget about them. It feels weightless. It feels ephemeral. But somewhere in a windowless concrete bunker in Virginia or Oregon, a spinning platter of rust is working very hard to keep those cat photos alive. And every time that platter spins, a meter is running.

For the first decade of its existence, Dropbox was essentially a very polished, user-friendly frontend for Amazon’s garage. When you saved a file to Dropbox, their servers handled the metadata (the index card that says where the file is), but the actual payload (the bytes themselves) was quietly ushered into Amazon S3. It was a brilliant arrangement. It allowed a small startup to scale without worrying about hard drives catching fire or power supplies exploding.

But then Dropbox grew up. And when you grow up, living in a hotel starts to get expensive.

By 2015, Dropbox was storing exabytes of data. The problem wasn’t just the storage fee, which is akin to paying rent. The real killer was the “egress” and request fees. Amazon’s business model is brilliantly designed to function like the Hotel California: you can check out any time you like, but leaving with your luggage is going to cost you a fortune. Every time a user opened a file, edited a document, or synced a folder, a tiny cash register dinged in Jeff Bezos’s headquarters.

The bill was no longer just an operating expense. It was an existential threat. The unit economics were starting to look less like a software business and more like a philanthropy dedicated to funding Amazon’s R&D department.

So, they decided to do something that is generally considered suicidal in the modern software era. They decided to leave the cloud.

The audacity of building your own closet

In Silicon Valley, telling investors you plan to build your own data centers is like telling your spouse you plan to perform your own appendectomy using a steak knife and a YouTube tutorial. It is seen as messy, dangerous, and generally regressive. The prevailing wisdom is that hardware is a commodity, a utility like electricity or sewage, and you should let the professionals handle the sludge.

Dropbox ignored this. They launched a project with the internally ironic name “Magic Pocket.” The goal was to build a storage system from scratch that was cheaper than Amazon S3 but just as reliable.

To understand the scale of this bad idea, you have to understand that S3 is a miracle of engineering. It boasts “eleven nines” of durability (99.999999999%). That means if you store 10,000 files, you might lose one every 10 million years. Replicating that level of reliability requires an obsessive, almost pathological attention to detail.

Dropbox wasn’t just buying servers from Dell and plugging them in. They were designing their own chassis. They realized that standard storage servers were too generic. They needed density. They built a custom box nicknamed “Diskotech” (because engineers love puns almost as much as they love caffeine) that could cram up to a petabyte of storage into a rack unit that was barely deeper than a coffee table.

But hardware has a nasty habit of obeying the laws of physics, and physics is often annoying.

Good vibrations and bad hard drives

When you pack hundreds of spinning hard drives into a tight metal box, you encounter a phenomenon that sounds like a joke but is actually a nightmare: vibration.

Hard drives are mechanical divas. They consist of magnetic platters spinning at 7,200 revolutions per minute, with a read/write head hovering nanometers above the surface. If the drive vibrates too much, that head can’t find the track. It misses. It has to wait for the platter to spin around again. This introduces latency. If enough drives in a rack vibrate in harmony, the performance drops off a cliff.

The Dropbox team found that even the fans cooling the servers were causing acoustic vibrations that made the hard drives sulk. They had to become experts in firmware, dampening materials, and the resonant frequencies of sheet metal. It is the kind of problem you simply do not have when you rent space in the cloud. In the cloud, a vibrating server is someone else’s ticket. When you own the metal, it’s your weekend.

Then there was the software. They couldn’t just use off-the-shelf Linux tools. They wrote their own storage software in Rust. At the time, Rust was the new kid on the block, a language that promised memory safety without the garbage collection pauses of Go or Java. Using a relatively new language to manage the world’s most precious data was a gamble, but it paid off. It allowed them to squeeze every ounce of efficiency out of the CPU, keeping the power bill (and the heat) down.

The great migration was a stealth mission

Building the “Magic Pocket” was only half the battle. The other half was moving 500 petabytes of data from Amazon to these new custom-built caverns without losing a single byte and without any user noticing.

They adopted a strategy that I like to call the “belt, suspenders, and duct tape” approach. For a long period, they used a technique called dual writing. Every time you uploaded a file, Dropbox would save a copy to Amazon S3 (the old reliable) and a copy to their new Magic Pocket (the risky experiment).

They then spent months just verifying the data. They would ask the Magic Pocket to retrieve a file, compare it to the S3 version, and check if they matched perfectly. It was a paranoia-fueled audit. Only when they were absolutely certain that the new system wasn’t eating homework did they start disconnecting the Amazon feed.

They treated the migration like a bomb disposal operation. They moved users over silently. One day, you were fetching your resume from an AWS server in Virginia; the next day, you were fetching it from a custom Dropbox server in Texas. The transfer speeds were often better, but nobody sent out a press release. The ultimate sign of success in infrastructure engineering is that nobody knows you did anything at all.

The savings were vulgar

The financial impact was immediate and staggering. Over the two years following the migration, Dropbox saved nearly $75 million in operating costs. Their gross margins, the holy grail of SaaS financials, jumped from a worrisome 33% to a healthy 67%.

By owning the hardware, they cut out the middleman’s profit margin. They also gained the ability to use “Shingled Magnetic Recording” (SMR) drives. These are cheaper, high-density drives that are notoriously slow at writing data because the data tracks overlap like roof shingles (hence the name). Standard databases hate them. But because Dropbox wrote their own software specifically for their own use case (write once, read many), they could use these cheap, slow drives without the performance penalty.

This is the hidden superpower of leaving the cloud: optimization. AWS has to build servers that work reasonably well for everyone, from Netflix to the CIA to a teenager running a Minecraft server. That means they are optimized for the average. Dropbox optimized for the specific. They built a suit that fit them perfectly, rather than buying a “one size fits all” poncho from the rack.

Why you should probably not do this

If you are reading this and thinking, “I should build my own data center,” please stop. Go for a walk. Drink some water.

Dropbox’s success is the exception that proves the rule. They had a very specific workload (huge files, rarely modified) and a scale (exabytes) that justified the massive R&D expense. They had the budget to hire world-class engineers who dream in Rust and understand the acoustic properties of cooling fans.

For 99% of companies, the cloud is still the right answer. The premium you pay to AWS or Google is not just for storage; it is an insurance policy against complexity. You are paying so that you never have to think about a failed power supply unit at 3:00 AM on a Sunday. You are paying so that you don’t have to negotiate contracts for fiber optic cables or worry about the price of real estate in Nevada.

However, Dropbox didn’t leave the cloud entirely. And this is the punchline.

Today, Dropbox is a hybrid. They store the files, the cold, heavy, static blocks of data, in their own Magic Pocket. But the metadata? The search functions? The flashy AI features that summarize your documents? That all still runs in the cloud.

They treat the public cloud like a utility kitchen. When they need to cook up something complex that requires thousands of CPUs for an hour, they rent them from Amazon or Google. When they just need to store the leftovers, they put them in their own fridge.

Adulthood is knowing when to rent

The story of Dropbox leaving the cloud is not really about leaving. It is about maturity.

In the early days of a startup, you prioritize speed. You pay the “cloud tax” because it allows you to move fast and break things. But there comes a point where the tax becomes a burden.

Dropbox realized that renting is great for flexibility, but ownership is the only way to build equity. They turned a variable cost (a bill that grows every time a user uploads a photo) into a fixed cost (a warehouse full of depreciating assets). It is less sexy. It requires more plumbing.

But there is a quiet dignity in owning your own mess. Dropbox looked at the cloud, with its infinite promise and infinite invoices, and decided that sometimes, the most radical innovation is simply buying a screwdriver, rolling up your sleeves, and building the shelf yourself. Just be prepared for the vibration.

December 7, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

The paranoia that keeps Netflix online

On a particularly bleak Monday, October 20, the internet suffered a collective nervous breakdown. Amazon Web Services decided to take a spontaneous nap, and the digital world effectively dissolved. Slack turned into a $27 billion paperweight, leaving office workers forced to endure the horror of unfiltered face-to-face conversation. Disney+ went dark, stranding thousands of toddlers mid-episode of Bluey and forcing parents to confront the terrifying reality of their own unsupervised children. DoorDash robots sat frozen on sidewalks like confused Daleks, threatening the national supply of lukewarm tacos.

Yet, in a suburban basement somewhere in Ohio, a teenager named Tyler streamed all four seasons of Stranger Things in 4K resolution. He did not see a single buffering wheel. He had no idea the cloud was burning down around him.

This is the central paradox of Netflix. They have engineered a system so pathologically untrusting, so convinced that the world is out to get it, that actual infrastructure collapses register as nothing more than a mild inconvenience. I spent weeks digging through technical documentation and bothering former Netflix engineers to understand how they pulled this off. What I found was not just a story of brilliant code. It is a story of institutional paranoia so profound it borders on performance art.

The paranoid bouncer at the door

When you click play on The Crown, your request does not simply waltz into the Netflix servers. It first has to get past the digital equivalent of a nightclub bouncer who suspects everyone of trying to sneak in a weapon. This is Amazon’s Elastic Load Balancer, or ELB.

Most load balancers are polite traffic cops. They see a server and wave you through. Netflix’s ELB is different. It assumes that every server is about three seconds away from exploding.

Picture a nightclub with 47 identical dance floors. The bouncer’s job is to frisk you, judge your shoes, and shove you toward the floor least likely to collapse under the weight of too many people doing the Macarena. The ELB does this millions of times per second. It does not distribute traffic evenly because “even” implies trust. Instead, it routes you to the server with the least outstanding requests. It is constantly taking the blood pressure of the infrastructure.

If a server takes ten milliseconds too long to respond, the ELB treats it like a contagion. It cuts it off. It ghosts it. This is the first commandment of the Netflix religion. Trust nothing. Especially not the hardware you rent by the hour from a company that also sells lawnmowers and audiobooks.

The traffic controller with a god complex

Once you make it past the bouncer, you meet Zuul.

Zuul is the API gateway, but that is a boring term for what is essentially a micromanager with a caffeine addiction. Zuul is the middle manager who insists on being copied on every single email and then rewrites them because he didn’t like your tone.

Its job is to route your request to the right backend service. But Zuul is neurotic. It operates through a series of filters that feel less like software engineering and more like airport security theater. There is an inbound filter that authenticates you (the TSA agent squinting at your passport), an endpoint filter that routes you (the air traffic controller), and an outbound filter that scrubs the response (the PR agent who makes sure the server didn’t say anything offensive).

All of this runs on the Netty server framework, which sounds cute but is actually a multi-threaded octopus capable of juggling tens of thousands of open connections without dropping a single packet. During the outage, while other companies’ gateways were choking on retries, Zuul continued to sort traffic with the cold detachment of a bureaucrat stamping forms during a fire drill.

A dysfunctional family of specialists

Inside the architecture, there is no single “Netflix” application. There is a squabbling family of thousands of microservices. These are tiny, specialized programs that refuse to speak to each other directly and communicate only through carefully negotiated contracts.

You have Uncle User Profiles, who sits in the corner nursing a grudge about that time you watched seventeen episodes of Is It Cake? at 3 AM. There is Aunt Recommendations, a know-it-all who keeps suggesting The Office because you watched five minutes of it in 2018. Then there is Cousin Billing, who only shows up when money is involved and otherwise sulks in the basement.

This family is held together by a concept called “circuit breaking.” In the old days, they used a library called Hystrix. Think of Hystrix as a court-ordered family therapist with a taser.

When a service fails, let’s say the subtitles database catches fire, most applications would keep trying to call it, waiting for a response that will never come, until the entire system locks up. Netflix does not have time for that. If the subtitle service fails, the circuit breaker pops. The therapist steps in and says, “Uncle Subtitles is having an episode and is not allowed to talk for the next thirty seconds.”

The system then serves a fallback. Maybe you don’t get subtitles for a minute. Maybe you don’t get your personalized list of “Top Picks for Fernando.” But the video plays. The application degrades gracefully rather than failing catastrophically. It is the digital equivalent of losing a limb but continuing to run the marathon because you have a really good playlist going.

Here is a simplified view of how this “fail fast” logic looks in the configuration. It is basically a list of rules for ignoring people who are slow to answer:

hystrix:
  command:
    default:
      execution:
        isolation:
          thread:
            timeoutInMilliseconds: 1000
      circuitBreaker:
        requestVolumeThreshold: 20
        sleepWindowInMilliseconds: 5000

Translated to human English, this configuration says: “If you take more than one second to answer me, you are dead to me. If you fail twenty times, I am going to ignore you for five seconds until you get your act together.”

The digital hoarder pantry

At the scale Netflix operates, data storage is less about organization and more about controlled hoarding. They use a system that only makes sense if you have given up on the concept of minimalism.

They use Cassandra, a NoSQL database, to store user history. Cassandra is like a grandmother who saves every newspaper from 1952 because “you never know.” It is designed to be distributed. You can lose half your hard drives, and Cassandra will simply shrug and serve the data from a backup node.

But the real genius, and the reason they survived the apocalypse, is EVCache. This is their homemade caching system based on Memcached. It is a massive pantry where they store snacks they know you will want before you even ask for them.

Here is the kicker. They do not just cache movie data. They cache their own credentials.

When AWS went down, the specific service that failed was often IAM (Identity and Access Management). This is the service that checks if your computer is allowed to talk to the database. When IAM died, servers all over the world suddenly forgot who they were. They were having an identity crisis.

Netflix servers did not care. They had cached their credentials locally. They had pre-loaded the permissions. It is like filling your basement with canned goods, not because you anticipate a zombie apocalypse, but because you know the grocery store manager personally and you know he is unreliable. While other companies were frantically trying to call AWS to ask, “Who am I?”, Netflix’s servers were essentially lip-syncing their way through the performance using pre-recorded tapes.

Hiring a saboteur to guard the vault

This is where the engineering culture goes from sensible to beautifully unhinged. Netflix employs the Simian Army.

This is not a metaphor. It is a suite of software tools designed to break things. The most famous is Chaos Monkey. Its job is to randomly shut down live production servers during business hours. It just kills them. No warning. No mercy.

Then there is Chaos Kong. Chaos Kong does not just kill a server. It simulates the destruction of an entire AWS region. It nukes the East Coast.

Let that sink in for a moment. Netflix pays engineers very high salaries to build software that attacks their own infrastructure. It is like hiring a pyromaniac to work as a fire inspector. Sure, he will find every flammable material in the building, but usually by setting it on fire first.

I spoke with a former engineer who described their “region evacuation” drills. “We basically declare war on ourselves,” she told me. “At 10 AM on a Tuesday, usually after the second coffee, we decide to kill us-east-1. The first time we did it, half the company needed therapy. Now? We can evacuate a region in six minutes. It’s boring.”

This is the secret sauce. The reason Netflix stayed up is that they have rehearsed the outage so many times that it feels like a chore. While other companies were discovering their disaster recovery plans were written in crayon, Netflix engineers were calmly executing a routine they practice more often than they practice dental hygiene.

Building your own highway system

There is a final plot twist. When you hit play, the video, strictly speaking, does not come from the cloud. It comes from Open Connect.

Netflix realized years ago that the public internet is a dirt road full of potholes. So they built their own private highway. They designed physical hardware, bright red boxes packed with hard drives, and shipped them to Internet Service Providers (ISPs) all over the world.

These boxes sit inside the data centers of your local internet provider. They are like mini-warehouses. When a new season of The Queen’s Gambit comes out, Netflix pre-loads it onto these boxes at 4 AM when nobody is using the internet.

So when you stream the show, the data is not traveling from an Amazon data center in Virginia. It is traveling from a box down the street. It might travel five miles instead of two thousand.

It is an invasive, brilliant strategy. It is like Netflix insisted on installing a mini-fridge in your neighbor’s garage just to ensure your beer is three degrees colder. During the cloud outage, even if the “brain” of Netflix (the control plane in AWS) was having a seizure, the “body” (the video files in Open Connect) was fine. The content was already local. The cloud could burn, but the movie was already in the house.

The beautiful absurdity of it all

The irony is delicious. Netflix is AWS’s biggest customer and its biggest success story. Yet they survive on AWS by fundamentally refusing to trust AWS. They cache credentials, they pre-pull images, they build their own delivery network, and they unleash monkeys to destroy their own servers just to prove they can survive the murder attempt.

They have weaponized Murphy’s Law. They built a company where the unofficial motto seems to be “Everything fails, all the time, so let’s get good at failing.”

So the next time the internet breaks and your Slack goes silent, do not panic. Just open Netflix. Somewhere in the dark, a Chaos Monkey is pulling a plug, a paranoid bouncer is shoving traffic away from a burning server, and your binge-watching will continue uninterrupted. The internet might be held together by duct tape and hubris, but Netflix has invested in really, really expensive duct tape.

November 28, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff