Cloud stuff

Random Thoughts on Different Cloud Computing

How Dropbox saved millions by leaving AWS

Most of us treat cloud storage like a magical, bottomless attic. You throw your digital clutter into a folder: PDFs of tax returns from 2014, blurred photos of a cat that has long since passed away, unfinished drafts of novels, and you forget about them. It feels weightless. It feels ephemeral. But somewhere in a windowless concrete bunker in Virginia or Oregon, a spinning platter of rust is working very hard to keep those cat photos alive. And every time that platter spins, a meter is running.

For the first decade of its existence, Dropbox was essentially a very polished, user-friendly frontend for Amazon’s garage. When you saved a file to Dropbox, their servers handled the metadata (the index card that says where the file is), but the actual payload (the bytes themselves) was quietly ushered into Amazon S3. It was a brilliant arrangement. It allowed a small startup to scale without worrying about hard drives catching fire or power supplies exploding.

But then Dropbox grew up. And when you grow up, living in a hotel starts to get expensive.

By 2015, Dropbox was storing exabytes of data. The problem wasn’t just the storage fee, which is akin to paying rent. The real killer was the “egress” and request fees. Amazon’s business model is brilliantly designed to function like the Hotel California: you can check out any time you like, but leaving with your luggage is going to cost you a fortune. Every time a user opened a file, edited a document, or synced a folder, a tiny cash register dinged in Jeff Bezos’s headquarters.

The bill was no longer just an operating expense. It was an existential threat. The unit economics were starting to look less like a software business and more like a philanthropy dedicated to funding Amazon’s R&D department.

So, they decided to do something that is generally considered suicidal in the modern software era. They decided to leave the cloud.

The audacity of building your own closet

In Silicon Valley, telling investors you plan to build your own data centers is like telling your spouse you plan to perform your own appendectomy using a steak knife and a YouTube tutorial. It is seen as messy, dangerous, and generally regressive. The prevailing wisdom is that hardware is a commodity, a utility like electricity or sewage, and you should let the professionals handle the sludge.

Dropbox ignored this. They launched a project with the internally ironic name “Magic Pocket.” The goal was to build a storage system from scratch that was cheaper than Amazon S3 but just as reliable.

To understand the scale of this bad idea, you have to understand that S3 is a miracle of engineering. It boasts “eleven nines” of durability (99.999999999%). That means if you store 10,000 files, you might lose one every 10 million years. Replicating that level of reliability requires an obsessive, almost pathological attention to detail.

Dropbox wasn’t just buying servers from Dell and plugging them in. They were designing their own chassis. They realized that standard storage servers were too generic. They needed density. They built a custom box nicknamed “Diskotech” (because engineers love puns almost as much as they love caffeine) that could cram up to a petabyte of storage into a rack unit that was barely deeper than a coffee table.

But hardware has a nasty habit of obeying the laws of physics, and physics is often annoying.

Good vibrations and bad hard drives

When you pack hundreds of spinning hard drives into a tight metal box, you encounter a phenomenon that sounds like a joke but is actually a nightmare: vibration.

Hard drives are mechanical divas. They consist of magnetic platters spinning at 7,200 revolutions per minute, with a read/write head hovering nanometers above the surface. If the drive vibrates too much, that head can’t find the track. It misses. It has to wait for the platter to spin around again. This introduces latency. If enough drives in a rack vibrate in harmony, the performance drops off a cliff.

The Dropbox team found that even the fans cooling the servers were causing acoustic vibrations that made the hard drives sulk. They had to become experts in firmware, dampening materials, and the resonant frequencies of sheet metal. It is the kind of problem you simply do not have when you rent space in the cloud. In the cloud, a vibrating server is someone else’s ticket. When you own the metal, it’s your weekend.

Then there was the software. They couldn’t just use off-the-shelf Linux tools. They wrote their own storage software in Rust. At the time, Rust was the new kid on the block, a language that promised memory safety without the garbage collection pauses of Go or Java. Using a relatively new language to manage the world’s most precious data was a gamble, but it paid off. It allowed them to squeeze every ounce of efficiency out of the CPU, keeping the power bill (and the heat) down.

The great migration was a stealth mission

Building the “Magic Pocket” was only half the battle. The other half was moving 500 petabytes of data from Amazon to these new custom-built caverns without losing a single byte and without any user noticing.

They adopted a strategy that I like to call the “belt, suspenders, and duct tape” approach. For a long period, they used a technique called dual writing. Every time you uploaded a file, Dropbox would save a copy to Amazon S3 (the old reliable) and a copy to their new Magic Pocket (the risky experiment).

They then spent months just verifying the data. They would ask the Magic Pocket to retrieve a file, compare it to the S3 version, and check if they matched perfectly. It was a paranoia-fueled audit. Only when they were absolutely certain that the new system wasn’t eating homework did they start disconnecting the Amazon feed.

They treated the migration like a bomb disposal operation. They moved users over silently. One day, you were fetching your resume from an AWS server in Virginia; the next day, you were fetching it from a custom Dropbox server in Texas. The transfer speeds were often better, but nobody sent out a press release. The ultimate sign of success in infrastructure engineering is that nobody knows you did anything at all.

The savings were vulgar

The financial impact was immediate and staggering. Over the two years following the migration, Dropbox saved nearly $75 million in operating costs. Their gross margins, the holy grail of SaaS financials, jumped from a worrisome 33% to a healthy 67%.

By owning the hardware, they cut out the middleman’s profit margin. They also gained the ability to use “Shingled Magnetic Recording” (SMR) drives. These are cheaper, high-density drives that are notoriously slow at writing data because the data tracks overlap like roof shingles (hence the name). Standard databases hate them. But because Dropbox wrote their own software specifically for their own use case (write once, read many), they could use these cheap, slow drives without the performance penalty.

This is the hidden superpower of leaving the cloud: optimization. AWS has to build servers that work reasonably well for everyone, from Netflix to the CIA to a teenager running a Minecraft server. That means they are optimized for the average. Dropbox optimized for the specific. They built a suit that fit them perfectly, rather than buying a “one size fits all” poncho from the rack.

Why you should probably not do this

If you are reading this and thinking, “I should build my own data center,” please stop. Go for a walk. Drink some water.

Dropbox’s success is the exception that proves the rule. They had a very specific workload (huge files, rarely modified) and a scale (exabytes) that justified the massive R&D expense. They had the budget to hire world-class engineers who dream in Rust and understand the acoustic properties of cooling fans.

For 99% of companies, the cloud is still the right answer. The premium you pay to AWS or Google is not just for storage; it is an insurance policy against complexity. You are paying so that you never have to think about a failed power supply unit at 3:00 AM on a Sunday. You are paying so that you don’t have to negotiate contracts for fiber optic cables or worry about the price of real estate in Nevada.

However, Dropbox didn’t leave the cloud entirely. And this is the punchline.

Today, Dropbox is a hybrid. They store the files, the cold, heavy, static blocks of data, in their own Magic Pocket. But the metadata? The search functions? The flashy AI features that summarize your documents? That all still runs in the cloud.

They treat the public cloud like a utility kitchen. When they need to cook up something complex that requires thousands of CPUs for an hour, they rent them from Amazon or Google. When they just need to store the leftovers, they put them in their own fridge.

Adulthood is knowing when to rent

The story of Dropbox leaving the cloud is not really about leaving. It is about maturity.

In the early days of a startup, you prioritize speed. You pay the “cloud tax” because it allows you to move fast and break things. But there comes a point where the tax becomes a burden.

Dropbox realized that renting is great for flexibility, but ownership is the only way to build equity. They turned a variable cost (a bill that grows every time a user uploads a photo) into a fixed cost (a warehouse full of depreciating assets). It is less sexy. It requires more plumbing.

But there is a quiet dignity in owning your own mess. Dropbox looked at the cloud, with its infinite promise and infinite invoices, and decided that sometimes, the most radical innovation is simply buying a screwdriver, rolling up your sleeves, and building the shelf yourself. Just be prepared for the vibration.

The paranoia that keeps Netflix online

On a particularly bleak Monday, October 20, the internet suffered a collective nervous breakdown. Amazon Web Services decided to take a spontaneous nap, and the digital world effectively dissolved. Slack turned into a $27 billion paperweight, leaving office workers forced to endure the horror of unfiltered face-to-face conversation. Disney+ went dark, stranding thousands of toddlers mid-episode of Bluey and forcing parents to confront the terrifying reality of their own unsupervised children. DoorDash robots sat frozen on sidewalks like confused Daleks, threatening the national supply of lukewarm tacos.

Yet, in a suburban basement somewhere in Ohio, a teenager named Tyler streamed all four seasons of Stranger Things in 4K resolution. He did not see a single buffering wheel. He had no idea the cloud was burning down around him.

This is the central paradox of Netflix. They have engineered a system so pathologically untrusting, so convinced that the world is out to get it, that actual infrastructure collapses register as nothing more than a mild inconvenience. I spent weeks digging through technical documentation and bothering former Netflix engineers to understand how they pulled this off. What I found was not just a story of brilliant code. It is a story of institutional paranoia so profound it borders on performance art.

The paranoid bouncer at the door

When you click play on The Crown, your request does not simply waltz into the Netflix servers. It first has to get past the digital equivalent of a nightclub bouncer who suspects everyone of trying to sneak in a weapon. This is Amazon’s Elastic Load Balancer, or ELB.

Most load balancers are polite traffic cops. They see a server and wave you through. Netflix’s ELB is different. It assumes that every server is about three seconds away from exploding.

Picture a nightclub with 47 identical dance floors. The bouncer’s job is to frisk you, judge your shoes, and shove you toward the floor least likely to collapse under the weight of too many people doing the Macarena. The ELB does this millions of times per second. It does not distribute traffic evenly because “even” implies trust. Instead, it routes you to the server with the least outstanding requests. It is constantly taking the blood pressure of the infrastructure.

If a server takes ten milliseconds too long to respond, the ELB treats it like a contagion. It cuts it off. It ghosts it. This is the first commandment of the Netflix religion. Trust nothing. Especially not the hardware you rent by the hour from a company that also sells lawnmowers and audiobooks.

The traffic controller with a god complex

Once you make it past the bouncer, you meet Zuul.

Zuul is the API gateway, but that is a boring term for what is essentially a micromanager with a caffeine addiction. Zuul is the middle manager who insists on being copied on every single email and then rewrites them because he didn’t like your tone.

Its job is to route your request to the right backend service. But Zuul is neurotic. It operates through a series of filters that feel less like software engineering and more like airport security theater. There is an inbound filter that authenticates you (the TSA agent squinting at your passport), an endpoint filter that routes you (the air traffic controller), and an outbound filter that scrubs the response (the PR agent who makes sure the server didn’t say anything offensive).

All of this runs on the Netty server framework, which sounds cute but is actually a multi-threaded octopus capable of juggling tens of thousands of open connections without dropping a single packet. During the outage, while other companies’ gateways were choking on retries, Zuul continued to sort traffic with the cold detachment of a bureaucrat stamping forms during a fire drill.

A dysfunctional family of specialists

Inside the architecture, there is no single “Netflix” application. There is a squabbling family of thousands of microservices. These are tiny, specialized programs that refuse to speak to each other directly and communicate only through carefully negotiated contracts.

You have Uncle User Profiles, who sits in the corner nursing a grudge about that time you watched seventeen episodes of Is It Cake? at 3 AM. There is Aunt Recommendations, a know-it-all who keeps suggesting The Office because you watched five minutes of it in 2018. Then there is Cousin Billing, who only shows up when money is involved and otherwise sulks in the basement.

This family is held together by a concept called “circuit breaking.” In the old days, they used a library called Hystrix. Think of Hystrix as a court-ordered family therapist with a taser.

When a service fails, let’s say the subtitles database catches fire, most applications would keep trying to call it, waiting for a response that will never come, until the entire system locks up. Netflix does not have time for that. If the subtitle service fails, the circuit breaker pops. The therapist steps in and says, “Uncle Subtitles is having an episode and is not allowed to talk for the next thirty seconds.”

The system then serves a fallback. Maybe you don’t get subtitles for a minute. Maybe you don’t get your personalized list of “Top Picks for Fernando.” But the video plays. The application degrades gracefully rather than failing catastrophically. It is the digital equivalent of losing a limb but continuing to run the marathon because you have a really good playlist going.

Here is a simplified view of how this “fail fast” logic looks in the configuration. It is basically a list of rules for ignoring people who are slow to answer:

hystrix:
  command:
    default:
      execution:
        isolation:
          thread:
            timeoutInMilliseconds: 1000
      circuitBreaker:
        requestVolumeThreshold: 20
        sleepWindowInMilliseconds: 5000

Translated to human English, this configuration says: “If you take more than one second to answer me, you are dead to me. If you fail twenty times, I am going to ignore you for five seconds until you get your act together.”

The digital hoarder pantry

At the scale Netflix operates, data storage is less about organization and more about controlled hoarding. They use a system that only makes sense if you have given up on the concept of minimalism.

They use Cassandra, a NoSQL database, to store user history. Cassandra is like a grandmother who saves every newspaper from 1952 because “you never know.” It is designed to be distributed. You can lose half your hard drives, and Cassandra will simply shrug and serve the data from a backup node.

But the real genius, and the reason they survived the apocalypse, is EVCache. This is their homemade caching system based on Memcached. It is a massive pantry where they store snacks they know you will want before you even ask for them.

Here is the kicker. They do not just cache movie data. They cache their own credentials.

When AWS went down, the specific service that failed was often IAM (Identity and Access Management). This is the service that checks if your computer is allowed to talk to the database. When IAM died, servers all over the world suddenly forgot who they were. They were having an identity crisis.

Netflix servers did not care. They had cached their credentials locally. They had pre-loaded the permissions. It is like filling your basement with canned goods, not because you anticipate a zombie apocalypse, but because you know the grocery store manager personally and you know he is unreliable. While other companies were frantically trying to call AWS to ask, “Who am I?”, Netflix’s servers were essentially lip-syncing their way through the performance using pre-recorded tapes.

Hiring a saboteur to guard the vault

This is where the engineering culture goes from sensible to beautifully unhinged. Netflix employs the Simian Army.

This is not a metaphor. It is a suite of software tools designed to break things. The most famous is Chaos Monkey. Its job is to randomly shut down live production servers during business hours. It just kills them. No warning. No mercy.

Then there is Chaos Kong. Chaos Kong does not just kill a server. It simulates the destruction of an entire AWS region. It nukes the East Coast.

Let that sink in for a moment. Netflix pays engineers very high salaries to build software that attacks their own infrastructure. It is like hiring a pyromaniac to work as a fire inspector. Sure, he will find every flammable material in the building, but usually by setting it on fire first.

I spoke with a former engineer who described their “region evacuation” drills. “We basically declare war on ourselves,” she told me. “At 10 AM on a Tuesday, usually after the second coffee, we decide to kill us-east-1. The first time we did it, half the company needed therapy. Now? We can evacuate a region in six minutes. It’s boring.”

This is the secret sauce. The reason Netflix stayed up is that they have rehearsed the outage so many times that it feels like a chore. While other companies were discovering their disaster recovery plans were written in crayon, Netflix engineers were calmly executing a routine they practice more often than they practice dental hygiene.

Building your own highway system

There is a final plot twist. When you hit play, the video, strictly speaking, does not come from the cloud. It comes from Open Connect.

Netflix realized years ago that the public internet is a dirt road full of potholes. So they built their own private highway. They designed physical hardware, bright red boxes packed with hard drives, and shipped them to Internet Service Providers (ISPs) all over the world.

These boxes sit inside the data centers of your local internet provider. They are like mini-warehouses. When a new season of The Queen’s Gambit comes out, Netflix pre-loads it onto these boxes at 4 AM when nobody is using the internet.

So when you stream the show, the data is not traveling from an Amazon data center in Virginia. It is traveling from a box down the street. It might travel five miles instead of two thousand.

It is an invasive, brilliant strategy. It is like Netflix insisted on installing a mini-fridge in your neighbor’s garage just to ensure your beer is three degrees colder. During the cloud outage, even if the “brain” of Netflix (the control plane in AWS) was having a seizure, the “body” (the video files in Open Connect) was fine. The content was already local. The cloud could burn, but the movie was already in the house.

The beautiful absurdity of it all

The irony is delicious. Netflix is AWS’s biggest customer and its biggest success story. Yet they survive on AWS by fundamentally refusing to trust AWS. They cache credentials, they pre-pull images, they build their own delivery network, and they unleash monkeys to destroy their own servers just to prove they can survive the murder attempt.

They have weaponized Murphy’s Law. They built a company where the unofficial motto seems to be “Everything fails, all the time, so let’s get good at failing.”

So the next time the internet breaks and your Slack goes silent, do not panic. Just open Netflix. Somewhere in the dark, a Chaos Monkey is pulling a plug, a paranoid bouncer is shoving traffic away from a burning server, and your binge-watching will continue uninterrupted. The internet might be held together by duct tape and hubris, but Netflix has invested in really, really expensive duct tape.

November 28, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

The secret and anxious life of a data packet inside AWS

You press a finger against the greasy glass of your smartphone. You are in a café in Melbourne, the coffee is lukewarm, and you have made the executive decision to watch a video of a cat falling off a Roomba. It feels like a trivial action.

But for the data packet birthed by that tap, this is D-Day.

It is a tiny, nervous backpacker being kicked out into the digital wilderness with nothing but a destination address and a crippling fear of latency. Its journey through Amazon’s cloud infrastructure is not the clean, sterile diagram your systems architect drew on a whiteboard. It is a micro drama of hope, bureaucratic routing, and existential dread that plays out in roughly 200 milliseconds.

We tend to think of the internet as a series of tubes, but it is more accurate to think of it as a series of highly opinionated bouncers and overworked bureaucrats. To understand how your cat video loads, we have to follow this anxious packet through the gauntlet of Amazon Web Services (AWS).

The initial panic and the mapmaker with a god complex

Our packet leaves your phone and hits the cellular network. It is screaming for directions. It needs to find the server hosting the video, but it only has a name (e.g., cats.example.com). Computers do not speak English; they speak IP addresses.

Enter Route 53.

Amazon calls Route 53 a Domain Name System (DNS) service. In practice, it acts like a travel agent with a philosophy degree and multiple personality disorder. It does not just look up addresses; it judges you based on where you are standing and how healthy the destination looks.

If Route 53 is configured with Geolocation Routing, it acts like a local snob. It looks at our packet’s passport, sees “Melbourne,” and sneers. “You are not going to the Oregon server. The Americans are asleep, and the latency would be dreadful. You are going to Sydney.”

However, Route 53 is also a hypochondriac. Through Health Checks, it constantly pokes the servers to see if they are alive. It is the digital equivalent of texting a friend, “Are you awake?” every ten seconds. If the Sydney server fails to respond three times in a row, Route 53 assumes the worst, death, fire, or a kernel panic, and instantly reroutes our packet to Singapore. This is Failover Routing, the prepared pessimist of the group.

The packet doesn’t care about the logic. It just wants an address so it can stop hyperventilating in the void.

CloudFront is the desperate golden retriever of the internet

Armed with an IP address, our packet rushes toward the destination. But hopefully, it never actually reaches the main server. That would be inefficient. Instead, it runs into CloudFront.

CloudFront is a Content Delivery Network (CDN). Think of it as a network of convenience stores scattered all over the globe, so you don’t have to drive to the factory to buy milk. Or, more accurately, think of CloudFront as a Golden Retriever that wants to please you so badly it is vibrating.

Its job is caching. It memorizes content. When our packet arrives at the CloudFront “Edge Location” in Melbourne, the service frantically checks its pockets. “Do I have the cat video? I think I have the cat video. I fetched it for that guy in the corner five minutes ago!”

If it has the video (a Cache Hit), it hands it over immediately. The packet is relieved. The journey is over. Everyone goes home happy.

But if CloudFront cannot find the video (a Cache Miss), the mood turns sour. The Golden Retriever looks guilty. It now has to turn around and run all the way to the origin server to fetch the data fresh. This is the “Edge” of the network, a place that sounds like a U2 guitarist but is actually just a rack of humming metal in a secure facility near the airport.

The tragedy of CloudFront is the Time To Live (TTL). This is the expiration date on the data. If the TTL is set to 24 hours, CloudFront will proudly hand you a version of the website from yesterday, oblivious to the fact that you updated the spelling errors this morning. It is like a dog bringing you a dead bird it found last week, convinced it is still a great gift.

The security guard who judges your shoes

If our packet suffers a Cache Miss, it must travel deeper into the data center. But first, it has to get past the Web Application Firewall (WAF).

The WAF is not a firewall in the traditional sense; it is a nightclub bouncer who has had a very long shift and hates everyone. It stands at the velvet rope, scrutinizing every packet for signs of “malicious intent.”

It checks for SQL injection, which is the digital equivalent of trying to sneak a knife into the club tape-draped to your ankle. It checks for Cross-Site Scripting (XSS), which is essentially trying to trick the club into changing its name to “Free Drinks for Everyone.”

The WAF operates on a set of rules that range from reasonable to paranoid. Sometimes, it blocks a legitimate packet just because it looks suspicious, perhaps the packet is too large, or it came from a country the WAF has decided to distrust today. The packet pleads its innocence, but the WAF is a piece of software code; it does not negotiate. It simply returns a 403 Forbidden error, which translates roughly to: “Your shoes are ugly. Get out.”

The Application Load Balancer manages the VIP list

Having survived the bouncer, our weary packet arrives at the Application Load Balancer (ALB). If the WAF is the bouncer, the ALB is the Maitre D’ holding the clipboard.

The ALB is obsessed with fairness and health. It stands in front of a pool of identical servers (the Target Group) and decides who has to do the work. It is trying to prevent any single server from having a nervous breakdown due to overcrowding.

“Server A is busy processing a login request,” the ALB mutters. “Server B is currently restarting because it had a panic attack. You,” it points to our packet, “you go to Server C. It looks bored.”

The ALB’s relationship with the servers is codependent and toxic. It performs health checks on them relentlessly. It demands a 200 OK status code every thirty seconds. If a server takes too long to reply or replies with an error, the ALB declares it “Unhealthy” and stops sending it friends. It effectively ghosts the server until it gets its act together.

The Origin, where the magic (and heat) happens

Finally, the packet reaches the destination. The Origin.

We like to imagine the cloud as an ethereal, fluffy place. In reality, the Origin is likely an EC2 instance, a virtual slice of a computer sitting in a windowless room in Northern Virginia or Dublin. The room is deafeningly loud with the sound of cooling fans and smells of ozone and hot plastic.

Here, the application code actually runs. The request is processed, and the server realizes it needs the actual video file. It reaches out to Amazon S3 (Simple Storage Service), which is essentially a bottomless digital bucket where the internet hoards its data.

The EC2 instance grabs the video from the bucket, processes it, and prepares to send it back.

This is the most fragile part of the journey. If the code has a bug, the server might vomit a 500 Internal Server Error. This is the server saying, “I tried, but I broke something inside myself.” If the database is overwhelmed, the request might time out.

When this happens, the failure cascades back up the chain. The ALB shrugs and tells the user “502 Bad Gateway” (translation: ” The guy in the back room isn’t talking to me”). The WAF doesn’t care. CloudFront caches the error page, so now everyone sees the error for the next hour.

And somewhere, a DevOps engineer’s phone starts buzzing at 3:00 AM.

The return trip

But today, the system works. The Origin retrieves the video bytes. It hands them to the ALB, which passes them to the WAF (who checks them one last time for contraband), which hands them to CloudFront, which hands them to the cellular network.

The packet returns to your phone. The screen flickers. The cat falls off the Roomba. You chuckle, swipe up, and request the next video.

You have no idea that you just forced a tiny, digital backpacker to navigate a global bureaucracy, evade a paranoid security guard, and wake up a server in a different hemisphere, all in less time than it takes you to blink. It is a modern marvel held together by fiber optics and anxiety.

So spare a thought for the data. It has seen things you wouldn’t believe.

November 23, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

AWS Lambda SQS provisioned mode is cheaper than therapy

There is a specific flavor of nausea reserved for serverless engineering teams. It usually strikes at 2 a.m., shortly after a major product launch, when someone posts a triumphant screenshot of user traffic in Slack. While the marketing team is virtually high-fiving, CloudWatch quietly begins to draw a perfect, vertical line that looks less like a growth chart and more like a cliff edge.

Your SQS queues swell. Lambda invocations crawl. Suddenly, the phrase “fully managed service” sounds less comforting and more like a cruel punchline delivered by a distant cloud provider.

For years, the relationship between Amazon SQS and AWS Lambda has been the backbone of event-driven architecture. You wire up an event source mapping, let Lambda poll the queue, and trust the system to scale as messages arrive. Most days, this works beautifully. On the wrong day, under the wrong kind of spike, it works “eventually.”

But in the world of high-frequency trading or flash sales, “eventually” is just a polite synonym for “too late.”

With the release of AWS Lambda SQS Provisioned Mode on November 14, Amazon is finally admitting that sometimes magic is too slow. It grants you explicit control over the invisible workers that poll SQS for your function. It ensures they are already awake, caffeinated, and standing in line before the mob shows up. It allows you to trade a bit of extra planning (and money) for the guarantee that your system won’t hit the snooze button while your backlog turns into a towering monument to failure.

The uncomfortable truth about standard SQS polling

To understand why we need Provisioned Mode, we have to look at the somewhat lazy nature of the standard behavior.

Out of the box, Lambda uses an event source mapping to poll SQS on your behalf. You give it a queue and some basic configuration, and Lambda spins up pollers to check for work. You never see these pollers. They are the ghosts in the machine.

The problem with ghosts is that they are not particularly urgent. When a massive spike hits your queue, Lambda realizes it needs more pollers and more concurrent function invocations. However, it does not do this instantly. It ramps up. It adds capacity in increments, like a cautious driver merging onto a freeway.

For a steady workload, you will never notice this ramp-up. But during a viral marketing campaign or a market crash, those minutes of warming up feel like an eternity. You are essentially watching a barista who refuses to start grinding coffee beans until the line of customers has already curled around the block.

Standard SQS polling gives you tools like batch size, but it denies you direct influence over the urgency of the consumption. You cannot tell the system, “I need ten workers ready right now.” You can only stand in line and hope the algorithm notices you are drowning.

This is acceptable for background jobs like resizing images or sending emails. It is decidedly less acceptable for payment processing or fraud detection. In those cases, watching twenty thousand messages pile up while your system “automatically scales” is not an architectural feature. It is a resume-generating event.

Paying for a standing army instead of volunteers

Provisioned Mode flips the script on this reactive behavior. Instead of letting Lambda decide how many pollers to use based purely on demand, you tell it the minimum and maximum number of event pollers you want reserved for that queue.

An event poller is a dedicated worker that reads from SQS and hands batches of messages to your function. In standard mode, these pollers are summoned from a shared pool when needed. In Provisioned Mode, you are paying to keep them on retainer.

Think of it as the difference between calling a ride-share service and hiring a private driver to sit in your driveway with the engine running. One is efficient for the general public; the other is necessary if you need to leave the house in exactly three seconds.

The benefits are stark when translated into human terms.

First, you get speed. AWS advertises significantly faster scaling for SQS event source mappings in Provisioned Mode. We are talking about adding up to one thousand new concurrent invocations per minute.

Second, you get capacity. Provisioned Mode can support massive concurrency per SQS mapping, far higher than the default capabilities.

Third, and perhaps most importantly, you get predictability. A single poller is not just a warm body. It is a unit of throughput (handling up to 1 MB per second or 10 concurrent invokes). By setting a minimum number of pollers, you are mathematically guaranteeing a baseline of throughput. You are no longer hoping the waiters show up; you have paid their salaries in advance.

Configuring this without losing your mind

The good news is that Provisioned Mode is not a new service with its own terrifying learning curve. It is just a configuration toggle on the event source mapping you are already using. You can set it up in the AWS Console, the CLI, or your Infrastructure as Code tool of choice.

The interface asks for two numbers, and this is where the engineering art form comes in.

First, it asks for Minimum Pollers. This is the number of workers you always want ready.

Second, it asks for Maximum Pollers. This is the ceiling, the limit you set to ensure you do not accidentally DDoS your own database.

Choosing these numbers feels a bit like gambling, but there is a logic to it. For the minimum, pick a number that comfortably handles your typical traffic plus a standard spike. Start small. Setting this to 100 when you usually need 2 is the serverless equivalent of buying a school bus to commute to work alone.

For the maximum, look at your downstream systems. There is no point in setting a maximum that allows 5,000 concurrent Lambda functions if your relational database curls into a fetal position at 500 connections.

Once you enable it, you need to watch your metrics. Keep an eye on “Queue Depth” and “Age of Oldest Message.” If the backlog clears too slowly, buy more pollers. If your database administrator starts sending you angry emails in all caps, reduce the maximum. The goal is not perfection on day one; it is to replace guesswork with a feedback loop.

The financial hangover

Nothing in life is free, and this applies doubly to AWS features that solve headaches.

When you enable Provisioned Mode, AWS begins charging you for “Event Poller Units.” You pay for the minimum pollers you configure, regardless of whether there are messages in the queue. You are paying for readiness.

This is a mental shift for serverless purists. The whole promise of serverless was “pay for what you use.” Provisioned Mode is “pay for what you might need.”

You are essentially renting a standing army. Most of the time, they will just stand there, playing cards and eating your budget. But when the enemy (traffic) attacks, they are already in position. Standard SQS polling is cheaper because it relies on volunteers. Volunteers are free, but they take a while to put on their boots.

From a FinOps perspective, or simply from the perspective of explaining the bill to your boss, the question is not “Is this expensive?” The question is “What is the cost of latency?”

For a background report generator, a five-minute delay costs nothing. For a high-frequency trading platform, a five-second delay costs everything. You should not enable Provisioned Mode on every queue in your account. That would be financial malpractice. You reserve it for the critical paths, the workflows where the price of slowness is measured in lost customers rather than just infrastructure dollars.

Why you should care about the fourth dial

Architecturally, Provisioned Mode gives us a new layer of control. Previously, we had three main dials in event-driven systems: how fast we write to the queue, how fast the consumers process messages, and how much concurrency Lambda is allowed.

Provisioned Mode adds a fourth dial: the aggression of the retrieval.

It allows you to reason about your system deterministically. If you know that one poller provides X amount of throughput, you can stack them to meet a specific Service Level Agreement. It turns a “best effort” system into a “calculated guarantee” system.

Serverless was sold to us as freedom from capacity planning. We were told we could just write code and let the cloud handle the undignified details of scaling. For many workloads, that promise holds true.

But as your workloads become more critical, you discover the uncomfortable corners where “just let it scale” is not enough. Latency budgets shrink. Compliance rules tighten. Customers grow less patient.

AWS Lambda SQS Provisioned Mode is a small, targeted answer to that discomfort. It allows you to say, “I want at least this much readiness,” and have the platform respect that wish, even when your traffic behaves like a toddler on a sugar high.

So, pick your most critical queue. The one that keeps you awake at night. Enable Provisioned Mode, set a modest minimum, and watch the metrics. Your future self, staring at a flat latency graph during the next Black Friday, will be grateful you decided to stop trusting in magic and started paying for physics.

November 20, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

Escaping the AWS NAT Gateway toll booth

My coffee went cold. I was staring at my AWS bill, and one line item was staring back at me with a judgmental smirk: NAT Gateway: 33,01 €.

This wasn’t for compute. This wasn’t for storing terabytes of crucial data. This was for the simple, mundane privilege of letting my Lambda functions send emails and tell Stripe to charge a credit card.

Let’s talk about NAT Gateway pricing. It’s a special kind of pain.

$0.045 per hour (That’s roughly $33 a month, just for existing).
$0.045 per GB processed (You get charged for your own data).
…and that’s per Availability Zone. For High Availability, you multiply by two or three.

I was suddenly paying more for a digital toll booth operator than I was for the actual application logic running my startup. That’s when I started asking questions. Did I really need this? What was I actually paying for? And more importantly, was there another way?

This is the story of how I hunted down that 33€ line item. By the end, you’ll know exactly if you need a NAT Gateway, or if you’re just burning money to keep the AWS machine fed.

The great NAT lie

Every AWS tutorial, every Stack Overflow answer, every “serverless best practice” blog post chants the same mantra: “If your Lambda needs to access the internet, and it’s in a VPC, you need a NAT Gateway.”

It’s presented as a law of physics. Like gravity, or the fact that DNS will always be the problem. And I, like a good, obedient engineer, followed the instructions. I clicked the button. I added the NAT. And then the bill came.

It turns out that obedience is expensive.

The gilded cage we call a VPC

Before we storm the castle, we have to understand why we built the castle in the first place. Why are our Lambdas in this mess? The answer is the Virtual Private Cloud (VPC).

By default, a Lambda function is a free spirit. It’s born with a magical, AWS-managed connection to the outside world. It can call any API it wants. It’s a social butterfly.

But then, security happens.

We have a managed database, like MongoDB Atlas. We absolutely, positively do not want this database exposed to the public internet. That’s like shouting your bank details across a crowded shopping mall. So, we rightly configure it to only accept private connections.

To let our Lambda talk to this database, we have to build a “gated community” for it. That’s our VPC. We move the Lambda inside this community and set up a “VPC Peering” connection, which is like a private, guarded footpath between our VPC and the MongoDB VPC.

Our Lambda can now securely whisper secrets to the database. The traffic never touches the public internet. We are secure. We are compliant. We are… trapped.

House arrest

We solved one problem but created a massive new one. In building this fortress to protect our database, we built it with no doors to the outside world.

Our Lambda is now on house arrest.

Sure, it can talk to the database in the adjoining room. But it can no longer call the Stripe API to process a payment. It can’t call an email service. It can’t even phone its own cousins in the AWS family, like AWS Secrets Manager or S3 (not without extra work, anyway). Any attempt to reach the internet just… times out. It’s the sound of silence.

This is the dilemma. To be secure, our Lambda must be in a VPC. But once in a VPC, it’s useless for half its job.

Enter the expensive chaperone

This is where the AWS Gospel presents its solution: the NAT Gateway.

The NAT (Network Address Translation) Gateway is, in our analogy, an extremely expensive, bonded chaperone.

You place this chaperone in a “public” part of your gated community (a public subnet). When your Lambda on house arrest needs to send a letter to the outside world (like an API call to Stripe), it gives the letter to the chaperone.

The chaperone (the NAT) takes the letter, walks it to the main gate, puts its own public return address on it, and sends it. When the reply comes back, the chaperone receives it, verifies it’s for the Lambda, and delivers it.

This works. It’s secure. The Lambda’s private address is never exposed.

But this chaperone charges you. It charges you by the hour just to be on call. It charges you for every letter it carries (data processed). And as we established, you need three of them if you want to be properly redundant.

This is a racket.

The “Split Personality” solution

I refused to pay the toll. There had to be another way. The solution came from realizing I was trying to make one Lambda do two completely opposite jobs.

What if, instead of one “do-it-all” Lambda, I created two specialists?

The hermit: This Lambda lives inside the VPC. Its one and only job is to talk to the database. It is antisocial, secure, and has no idea the internet exists.
The messenger: This Lambda lives outside the VPC. It’s a “free-range” Lambda. Because it’s not attached to any VPC, AWS magically gives it that default internet access. It cannot talk to the database (which is good!), but it can talk to Stripe all day long.

The plan is simple: when The hermit (VPC Lambda) needs something from the internet, it invokes The messenger (Proxy Lambda). It hands it a note: “Please tell Stripe to charge $25.00.” The messenger runs the errand, gets the receipt, and passes it back to The hermit, who then safely logs the result in the database.

It’s a “split personality” architecture.

But is it safe?

I can hear you asking: “Wait. A Lambda with internet access? Isn’t that like leaving your front door wide open for attackers?”

No. And this is the most beautiful part.

A Lambda function, whether in a VPC or not, never gets a public IP address. It can make outbound calls, but nothing from the public internet can initiate a call to it.

It’s like having a phone that can only make calls, not receive them. It’s unreachable. The “Messenger” Lambda is perfectly safe to live outside the VPC, ready to do our bidding.

The secret tunnel system

So, I built it. The hermit. The messenger. I was a genius. I hit “test.”

…timeout.

Of course. I forgot. The hermit is still on house arrest. “Invoking” another Lambda is, itself, an AWS API call. It’s a request that has to leave the VPC to reach the AWS Lambda service. My Lambda couldn’t even call its own lawyer.

This is where the real solution lies. Not in a gateway, but in a series of tunnels.

They’re called VPC Endpoints.

A VPC Endpoint is not a big, expensive, public chaperone. It’s a private, secret tunnel that you build directly from your VPC to a specific AWS service, all within the AWS network.

So, I built two tunnels:

A tunnel to AWS Secrets Manager: Now my hermit Lambda can get its API keys directly, without ever leaving the house.
A tunnel to AWS Lambda: Now my hermit Lambda can use its private phone to “invoke” The messenger.

These endpoints have a small hourly cost, but it’s a fraction of a NAT Gateway, and the data processing fee is either tiny or free, depending on the endpoint type. We’ve replaced a $100/mo toll road with a $5/mo private footpath.

(A grumpy side note: annoyingly, some AWS services like Cognito don’t support VPC Endpoints. For those, you still have to use the Messenger proxy pattern. But for most, the tunnels work.)

Our glorious new contraption

Let’s look at our payment handler again. This little function needed to:

Get API keys from AWS Secrets Manager.
Call Stripe’s API.
Write the transaction to MongoDB.

Here is how our new, glorious, Rube Goldberg machine works:

Step 1: The Payment Lambda (The hermit) gets a request.
Step 2: It needs keys. It pops over to AWS Secrets Manager through its private tunnel (the VPC Endpoint). No internet needed.
Step 3: It needs to charge a card. It calls the invoke command, which goes through its other private tunnel to the AWS Lambda service, triggering The messenger.
Step 4: The messenger (Proxy Lambda), living in the free-range world, makes the outbound call to Stripe. Stripe, delighted, processes the payment and sends a reply.
Step 5: The messenger passes the success (or failure) response back to The hermit.
Step 6: The hermit, now holding the result, calmly turns and writes the transaction record to MongoDB via its private VPC Peering connection.

Everything works. Nothing is exposed. And the NAT Gateway bill is 0€.

For those who speak in code

Here is a simplified look at what our two specialist Lambdas are doing.

Payment Lambda (The hermit – INSIDE VPC)

// This Lambda is attached to your VPC
// It needs VPC Endpoints for 'lambda' and 'secretsmanager'

import { InvokeCommand, LambdaClient } from "@aws-sdk/client-lambda";
// ... (imports for Secrets Manager and Mongo)

const lambda = new LambdaClient({});

export const handler = async (event) => {
  try {
    const amountToCharge = 2500; // 25.00

    // 1. Get secrets via VPC Endpoint
    // const apiKeys = await getSecretsFromManager();
    
    // 2. Prepare to invoke the proxy
    const command = new InvokeCommand({
      FunctionName: process.env.PAYMENT_PROXY_FUNCTION_NAME,
      InvocationType: "RequestResponse",
      Payload: JSON.stringify({
        chargeDetails: { amount: amountToCharge, currency: "usd" },
      }),
    });

    // 3. Invoke the proxy Lambda via VPC Endpoint
    const response = await lambda.send(command);
    const proxyResponse = JSON.parse(
      Buffer.from(response.Payload).toString()
    );

    if (proxyResponse.status === "success") {
      // 4. Write to MongoDB via VPC Peering
      // await writePaymentRecordToMongo(proxyResponse.transactionId);
      
      return {
        statusCode: 200,
        body: `Payment succeeded! TxID: ${proxyResponse.transactionId}`,
      };
    } else {
      // Handle payment failure
      return { statusCode: 400, body: "Payment failed." };
    }
  } catch (error) {
    console.error(error);
    return { statusCode: 500, body: "Server error" };
  }
};

Proxy Lambda (The messenger – OUTSIDE VPC)

// This Lambda is NOT attached to a VPC
// It has default internet access

// ... (import for your Stripe client)
// const stripe = new Stripe(process.env.STRIPE_SECRET_KEY);

export const handler = async (event) => {
  // 1. Extract the data from the invoking Hermit
  const { chargeDetails } = event.payload;

  try {
    // 2. Call the external Stripe API
    // const stripeResponse = await stripe.charges.create({
    //   amount: chargeDetails.amount,
    //   currency: chargeDetails.currency,
    //   source: "tok_visa", // Example token
    // });
   
    // Mocking the Stripe call for this example
    const stripeResponse = {
        id: `txn_${Math.random().toString(36).substring(2, 15)}`,
        status: 'succeeded'
    };


    if (stripeResponse.status === 'succeeded') {
      // 3. Return the successful result
      return {
        status: "success",
        transactionId: stripeResponse.id,
      };
    } else {
      return { status: "failed", error: "Stripe decline" };
    }
  } catch (err) {
    // 4. Return any errors
    return {
      status: "failed",
      error: `Error contacting Stripe: ${err.message}`,
    };
  }
};

Was it worth it?

And there it is. A production-grade, secure, and resilient system. Our hermit Lambda is safe in its VPC, talking to the database, our Messenger Lambda is happily running errands on the internet, and our secret tunnels are connecting everything privately.

That said, figuring all this out and integrating it into a production system takes a significant amount of time. This… this contraption of proxies and endpoints is, frankly, a headache.

If you don’t want the headache, sometimes it’s easier to just pay that damn 30€ for a NAT Gateway and move on with your life.

The purpose of this article wasn’t just to save a few bucks. It was to pull back the curtain. To show that the “one true way” isn’t the only way, and to prove that with a little bit of architectural curiosity, you can, in fact, escape the AWS NAT Gateway toll booth.

November 15, 2025 by Fernando SRE Cloud stuff SRE stuff

Your Multi-Region strategy is a fantasy

The recent failure showed us the truth: your data is stuck, and active-active failover is a fantasy for 99% of us. Here’s a pragmatic high-availability strategy that actually works.

Well, that was an intense week.

When the great AWS outage of October 2025 hit, I did what every senior IT person does: I grabbed my largest coffee mug, opened our monitoring dashboard, and settled in to watch the world burn. us-east-1, the internet’s stubbornly persistent center of gravity, was having what you’d call a very bad day.

And just like clockwork, as the post-mortems rolled in, the old, tired refrain started up on social media and in Slack: “This is why you must be multi-region.”

I’m going to tell you the truth that vendors, conference speakers, and that one overly enthusiastic junior dev on your team won’t. For 99% of companies, “multi-region” is a lie.

It’s an expensive, complex, and dangerous myth sold as a silver bullet. And the recent outage just proved it.

The “Just Be Multi-Region” fantasy

On paper, it sounds so simple. It’s a lullaby for VPs.

You just run your app in us-east-1 (Virginia) and us-west-2 (Oregon). You put a shiny global load balancer in front, and if Virginia decides to spontaneously become an underwater volcano, poof! All your traffic seamlessly fails over to Oregon. Zero downtime. The SREs are heroes. Champagne for everyone.

This is a fantasy.

It’s a fantasy that costs millions of dollars and lures development teams into a labyrinth of complexity they will never escape. I’ve spent my career building systems that need to stay online. I’ve sat in the planning meetings and priced out the “real” cost. Let me tell you, true active-active multi-region isn’t just “hard”; it’s a completely different class of engineering.

And it’s one that your company almost certainly doesn’t need.

The three killers of Multi-Region dreams

It’s not the application servers. Spinning up EC2 instances or containers in another region is the easy part. That’s what we have Infrastructure as Code for. Any intern can do that.

The problem isn’t the compute. The problem is, and always has been, the data.

Killer 1: Data has gravity, and it’s a jerk

This is the single most important concept in cloud architecture. Data has gravity.

Your application code is a PDF. It’s stateless and lightweight. You can email it, copy it, and run it anywhere. Your 10TB PostgreSQL database is not a PDF. It’s the 300-pound antique oak desk the computer is sitting on. You can’t just “seamlessly fail it over” to another continent.

To have a true seamless failover, your data must be available in the second region at the exact moment of the failure. This means you need synchronous, real-time replication across thousands of miles.

Guess what that does to your write performance? It’s like trying to have a conversation with someone on Mars. The latency of a round-trip from Virginia to Oregon adds hundreds of milliseconds to every single database write. The application becomes unusably slow. Every time a user clicks “save,” they have to wait for a photon to physically travel across the country and back. Your users will hate it.

“Okay,” you say, “we’ll use asynchronous replication!”

Great. Now when us-east-1 fails, you’ve lost the last 5 minutes of data. Every transaction, every new user sign-up, every shopping cart order. Vanished. You’ve traded a “Recovery Time” of zero for a “Data Loss” that is completely unacceptable. Go explain to the finance department that you purposefully designed a system that throws away the most recent customer orders. I’ll wait.

This is the trap. Your compute is portable; your data is anchored.

Killer 2: The astronomical cost

I was on a project once where the CTO, fresh from a vendor conference, wanted a full active-active multi-region setup. We scoped it.

Running 2x the servers was fine. The real cost was the inter-region data transfer.

AWS (and all cloud providers) charge an absolute fortune for data moving between their regions. It’s the “hotel minibar” of cloud services. Every single byte your database replicates, every log, every file transfer… cha-ching.

Our projected bill for the data replication and the specialized services (like Aurora Global Databases or DynamoDB Global Tables) was three times the cost of the entire rest of the infrastructure.

You are paying a massive premium for a fleet of servers, databases, and network gateways that are sitting idle 99.9% of the time. It’s like buying the world’s most expensive gym membership and only going once every five years to “test” it. It’s an insurance policy so expensive, you can’t afford the disaster it’s meant to protect you from.

Killer 3: The crushing complexity

A multi-region system isn’t just two copies of your app. It’s a brand new, highly complex, slightly psychotic distributed system that you now have to feed and care for.

You now have to solve problems you never even thought about:

Global DNS failover: How does Route 53 know a region is down? Health checks fail. But what if the health check itself fails? What if the health check thinks Virginia is fine, but it’s just hallucinating?
Data write conflicts: This is the fun part. What if a user in New York (writing to us-east-1) and a user in California (writing to us-west-2) update the same record at the same time? Welcome to the world of split-brain. Who wins? Nobody. You now have two “canonical” truths, and your database is having an existential crisis. Your job just went from “Cloud Architect” to “Data Therapist.”
Testing: How do you even test a full regional failover? Do you have a big red “Kill Virginia” button? Are you sure you know what will happen when you press it? On a Tuesday afternoon? I didn’t think so.

You haven’t just doubled your infrastructure; you’ve 10x’d your architectural complexity.

But we have Kubernetes because we are Cloud Native

This was my favorite part of the October 2025 outage.

I saw so many teams that thought Kubernetes would save them. They had their fancy federated K8s clusters spanning multiple regions, YAML files as far as the eye could see.

And they still went down.

Why? Because Kubernetes doesn’t solve data gravity!

Your K8s cluster in us-west-2 dutifully spun up all your application pods. They woke up, stretched, and immediately started screaming: “WHERE IS MY DISK?!”

Your persistent volumes (PVs) are backed by EBS or EFS. That ‘E’ stands for ‘Elastic,’ not ‘Extradimensional.’ That disk is physically, stubbornly, regionally attached to Virginia. Your pods in Oregon can’t mount a disk that lives 3,000 miles away.

Unless you’ve invested in another layer of incredibly complex, eye-wateringly expensive storage replication software, your “cloud-native” K8s cluster was just a collection of very expensive, very confused applications shouting into the void for a database that was currently offline.

A pragmatic high availability strategy that actually works

So if multi-region is a lie, what do we do? Just give up? Go home? Take up farming?

Yes. You accept some downtime.

You stop chasing the “five nines” (99.999%) myth and start being honest with the business. Your goal is not “zero downtime.” Your goal is a tested and predictable recovery.

Here is the sane strategy.

1. Embrace Multi-AZ (The real HA)

This is what AWS actually means by “high availability.” Run your application across multiple Availability Zones (AZs) within a single region. An AZ is a physically separate data center. us-east-1a and us-east-1b are miles apart, with different power and network.

This is like having a backup generator for your house. Multi-region is like building an identical, fully-furnished duplicate house in another city just in case a meteor hits your first one.

Use a Multi-AZ RDS instance. Use an Auto Scaling Group that spans AZs. This protects you from 99% of common failures: a server rack dying, a network switch failing, or a construction crew cutting a fiber line. This should be your default. It’s cheap, it’s easy, and it works.

2. Focus on RTO and RPO

Stop talking about “nines” and start talking about two simple numbers:

RTO (Recovery Time Objective): How fast do we need to be back up?
RPO (Recovery Point Objective): How much data can we afford to lose?

Get a real answer from the business, not a fantasy. Is a 4-hour RTO and a 15-minute RPO acceptable? For almost everyone, the answer is yes.

3. Build a “Warm Standby” (The sane DR)

This is the strategy that actually works. It’s the “fire drill” plan, not the “build a duplicate city” plan.

Infrastructure: Your entire infrastructure is defined in Terraform or CloudFormation. You can rebuild it from scratch in any region with a single command.
Data: You take regular snapshots of your database (e.g., every 15 minutes) and automatically copy them to your disaster recovery region (us-west-2).
The plan: When us-east-1 dies, you declare a disaster. The on-call engineer runs the “Deploy-to-DR” script.

Here’s a taste of what that “sane” infrastructure-as-code looks like. You’re not paying for two of everything. You’re paying for a blueprint and a backup.

# main.tf (in your primary region module)
# This is just a normal server
resource "aws_instance" "app_server" {
  count         = 3 # Your normal production count
  ami           = "ami-0abcdef123456"
  instance_type = "t3.large"
  # ... other config
}

# dr.tf (in your DR region module)
# This server doesn't even exist... until you need it.
resource "aws_instance" "dr_app_server" {
  # This is the magic.
  # This resource is "off" by default (count = 0).
  # You flip one variable (is_disaster = true) to build it.
  count         = var.is_disaster ? 3 : 0
  provider      = aws.dr_region # Pointing to us-west-2
  ami           = "ami-0abcdef123456" # Same AMI
  instance_type = "t3.large"
  # ... other config
}

resource "aws_db_instance" "dr_database" {
  count                   = var.is_disaster ? 1 : 0
  provider                = aws.dr_region
  
  # Here it is: You build the new DB from the
  # latest snapshot you've been copying over.
  replicate_source_db     = var.latest_db_snapshot_arn
  
  instance_class          = "db.r5.large"
  # ... other config
}

You flip a single DNS record in Route 53 to point all traffic to the new load balancer in us-west-2.

Yes, you have downtime (your RTO of 2–4 hours). Yes, you might lose 15 minutes of data (your RPO).

But here’s the beautiful part: it actually works, it’s testable, and it costs a tiny fraction of an active-active setup.

The AWS outage in October 2025 wasn’t a lesson in the need for multi-region. It was a global, public, costly lesson in humility. It was a reminder to stop chasing mythical architectures that look good on a conference whiteboard and focus on building resilient, recoverable systems.

So, stop feeling guilty because your setup doesn’t span three continents. You’re not lazy; you’re pragmatic. You’re the sane one in a room full of people passionately arguing about the best way to build a teleporter for that 300-pound antique oak desk.

Let them have their complex, split-brain, data-therapy sessions. You’ve chosen a boring, reliable, testable “warm standby.” You’ve chosen to get some sleep.

November 7, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

Burst traffic realities for AWS API Gateway Architects

Let’s be honest. Cloud architecture promises infinite scalability, but sometimes it feels like we’re herding cats wearing rocket boots. I learned this the hard way when my shiny serverless app, built with all the modern best practices, started hiccuping like a soda-drunk kangaroo during a Black Friday sale. The culprit? AWS API Gateway throttling under bursty traffic. And no, it wasn’t my coffee intake causing the chaos.

The token bucket, a simple idea with a sneaky side

AWS API Gateway uses a token bucket algorithm to manage traffic. Picture a literal bucket. Tokens drip into it at a steady rate, your rate limit. Each incoming request steals a token to pass through. If the bucket is empty? Requests get throttled. Simple, right? Like a bouncer checking IDs at a club.

But here’s the twist: This bouncer has a strict hourly wage. If 100 requests arrive in one second, they’ll drain the bucket faster than a toddler empties a juice box. Then, even if traffic calms down, the bucket refills slowly. Your API is stuck in timeout purgatory while tokens trickle back. AWS documents this, but it’s easy to miss until your users start tweeting about your “haunted API.”

Bursty traffic is life’s unpredictable roommate

Bursty traffic isn’t a bug; it’s a feature of modern apps. Think flash sales, mobile app push notifications, or that viral TikTok dance challenge your marketing team insisted would go viral (bless their optimism). Traffic doesn’t flow like a zen garden stream. It arrives in tsunami waves.

I once watched a client’s analytics dashboard spike at 3 AM. Turns out, their smart fridge app pinged every device simultaneously after a firmware update. The bucket emptied. Alarms screamed. My weekend imploded. Bursty traffic doesn’t care about your sleep schedule.

When bursts meet buckets, the throttling tango

Here’s where things get spicy. API Gateway’s token bucket has a burst capacity. For stage-level throttling, it’s tied to your rate limit. Set a rate of 100 requests/second? Your bucket holds 100 tokens. Send 150 requests in one burst? The first 100 sail through. The next 50 get throttled, even if the average traffic is below 100/second.

It’s like a theater with 100 seats. If 150 people rush the door at once, 50 get turned away, even if half the theater is empty later. AWS isn’t being petty. It’s protecting downstream services (like your database) from sudden stampedes. But when your app is the one getting trampled? Less poetic. More infuriating.

Does this haunt all throttling types?

Good news: This quirk primarily targets stage-level and account-level throttling. Usage Plans? They play by different rules. Their buckets refill steadily, making them more burst-friendly. But stage-level throttling? It’s the diva of the trio. Configure it carelessly, and it will sabotage your bursts like a jealous ex.

If you’ve layered all three throttling types (account, stage, usage plan), stage-level settings often dominate the drama. Check your stage settings first. Always.

Taming the beast, practical fixes that work

After several caffeine-fueled debugging sessions, I’ve learned a few tricks to keep buckets full and bursts happy. None requires sacrificing a rubber chicken to the cloud gods.

1. Resize your bucket
Stage-level throttling lets you set a burst limit alongside your rate limit. Double it. Triple it. AWS allows bursts up to 5,000 requests for some tiers. Calculate your peak bursts (use CloudWatch metrics!), then set burst capacity 20% higher. Safety margins are boring until they save your launch day.

2. Queue the chaos
Offload bursts to SQS or Kinesis. Front your API with a lightweight service that accepts requests instantly, dumps them into a queue, and processes them at a civilized pace. Users get a “we got this” response. Your bucket stays calm. Everyone wins. Except the throttling gremlins.

3. Smarter clients are your friends
Teach client apps to retry intelligently. Exponential backoff with jitter isn’t just jargon, it’s the art of politely asking “Can I try again later?” instead of spamming “HELLO?!” every millisecond. AWS SDKs bake this in. Use it.

4. Distribute the pain
Got multiple stages or APIs? Spread bursts across them. A load balancer or Route 53 weighted routing can turn one screaming bucket into several murmuring ones. It’s like splitting a rowdy party into smaller rooms.

5. Monitor like a paranoid squirrel
CloudWatch alarms for 429 Too Many Requests are non-negotiable. Track ThrottledRequests and Count metrics per stage. Set alerts at 70% of your burst limit. Because knowing your bucket is half-empty is far better than discovering it via customer complaints.

The quiet triumph of preparedness

Cloud architecture is less about avoiding fires and more about not using gasoline as hand sanitizer. Bursty traffic will happen. Token buckets will empty. But with thoughtful configuration, you can transform throttling from a silent assassin into a predictable gatekeeper.

AWS gives you the tools. It’s up to us to wield them without setting the data center curtains ablaze. Start small. Test bursts in staging. And maybe keep that emergency coffee stash stocked. Just in case.

Your APIs deserve grace under pressure. Now go forth and throttle wisely. Or better yet, throttle less.

November 4, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

AWS control is the new technical debt

Creating your first AWS account is a modern rite of passage.

It feels like you’ve just been handed the keys to a digital kingdom, a shiny, infinitely powerful box of LEGOs. You log into that console, see that universe of 200+ services, and think, “I have control.”

In reality, you’ve just volunteered to be the kingdom’s chief plumber, electrician, structural engineer, and sanitation officer, all while juggling the royal budget. And you only wanted to build a shop to sell t-shirts.

For years, we in the tech world have accepted this as the default. We believed that “cloud-native” meant getting your hands dirty. We believed that to be a “real” engineer, you had to speak fluent IAM JSON and understand the intimate details of VPC peering.

Let’s be honest with ourselves. In 2025, meticulously managing your own raw AWS infrastructure isn’t a competitive advantage. It’s an anchor. It’s the equivalent of insisting on milling your own flour and churning your own butter just to make a sandwich.

It’s time to call it what it is: the new technical debt.

The seduction of total control

Why did we all fall for this? Because “control” is a powerfully seductive idea.

We were sold a dream of infinite knobs and levers. We thought, “If I can configure everything, I can optimize everything!” We pictured ourselves as brilliant cloud architects, seated at a vast console, fine-tuning the global engine of our application.

But this “control” is a mirage. What it really means is the freedom to spend a Tuesday afternoon debugging why a security group is blocking traffic, or the privilege of becoming an unwilling expert on data transfer pricing.

It’s not strategic control; it’s janitorial control. And it’s costing us dearly.

The three-headed monster of ‘Control’

When you sign up for that “control,” you unknowingly invite a three-headed monster to live in your office. It doesn’t ask for rent, but it feeds on your time, your money, and your sanity.

1. The labyrinth of accidental complexity

You just want to launch a simple web app. How hard can it be?

Famous last words.

To do it “properly” in a raw AWS account, your journey looks less like engineering and more like an archaeological dig.

First, you must enter the dark labyrinth of VPCs, subnets, and NAT Gateways, a plumbing job so complex it would make a Roman aqueduct engineer weep. Then, you must present a multi-page, blood-signed sacrifice to the gods of IAM, praying that your policy document correctly grants one service permission to talk to another without accidentally giving “Public” access to your entire user database.

This is before you’ve even provisioned a server. Want a database? Great. Now you’re a database administrator, deciding on instance types, read replicas, and backup schedules. Need storage? Welcome to S3, where you’re now a compliance officer, managing bucket policies and lifecycle rules.

What started as building a house has turned into you personally mining the copper for the wiring. The complexity isn’t a feature; it’s a bug.

2. The financial hemorrhage

AWS pricing is perhaps the most compelling work of high-fantasy fiction in modern times. “Pay for what you use” sounds beautifully simple.

It’s the “use” part that gets you.

It’s like a bar where the drinks are cheap, but the peanuts are $50, the barstool costs $20 an hour, and you’re charged for the oxygen you breathe.

This “control” means you are now the sole accountant for a thousand tiny, running meters. You’re paying for idle EC2 instances you forgot about, unattached EBS volumes that are just sitting there, and NAT Gateways that cheerfully process data at a price that would make a loan shark blush.

And let’s talk about data transfer. That’s the fine print, written in invisible ink, at the bottom of the contract. It’s the silent killer of cloud budgets, the gotcha that turns your profitable month into a financial horror movie.

Without a full-time “Cloud Cost Whisperer,” your bill becomes a monthly lottery where you always lose.

3. The developer’s schizophrenia

The most expensive-to-fix part of this whole charade is the human cost.

We hire brilliant software developers to build brilliant products. Then, we immediately sabotage them by demanding they also be expert network engineers, security analysts, database administrators, and billing specialists.

The modern “Full-Stack Developer” is now a “Full-Cloud-Stack-Network-Security-Billing-Analyst-Developer.” The cognitive whiplash is brutal.

One moment you’re deep in application logic, crafting an algorithm, designing a user experience, and the next, you’re yanked out to diagnose a slow-running SQL query, optimize a CI/CD pipeline, or figure out why the “simple” terraform apply just failed for the fifth time.

This isn’t “DevOps.” This is a frantic one-person show, a short-order cook trying to run a 12-station Michelin-star kitchen alone. The cost of this context-switching is staggering. It’s the death of focus. It’s how great products become mediocre.

What we were all pretending not to want

For years, we’ve endured this pain. We’ve worn our complex Terraform files and our sprawling AWS diagrams as badges of honor. It was a form of intellectual hazing.

But what if we just… stopped?

What if we admitted what we really want? We don’t want to configure VPCs. We want our app to be secure and private. We don’t want to write auto-scaling policies. We want our app to simply not fall over when it gets popular.

We don’t want to spend a week setting up a deployment pipeline. We just want to git push deploy.

This isn’t laziness. This is sanity. We’ve finally realized that the business value isn’t in the plumbing; it’s in the water coming out of the tap.

The glorious liberation of abstraction

This realization has sparked a revolution. The future of cloud computing is, thankfully, becoming gloriously boring.

The new wave of platforms, PaaS, serverless environments, and advanced, opinionated frameworks, are built to do one thing: handle the plumbing so you don’t have to.

They run on top of the same powerful AWS (or GCP, or Azure) foundation, but they present you with a contract that makes sense. “You give us code,” they say, “and we’ll run it, scale it, secure it, and patch it. Go build your business.”

This isn’t a dumbed-down version of the cloud. It’s a sane one. It’s an abstraction layer that treats infrastructure like the utility it was always supposed to be.

Think about your home’s electricity. You just plug in your toaster and it works. You don’t have to manage the power plant, check the voltage on the high-tension wires, or personally rewire the neighborhood transformer. You just want toast.

The new platforms are finally letting us just make toast.

So what’s the sane alternative

“Abstraction” is a lovely, comforting word. But it’s also vague. It sounds like magic. It isn’t. It’s just a different set of trade-offs, where you trade the janitorial control of raw AWS for the productive speed of a platform that has opinions.

And it turns out, there’s an entire ecosystem of these “sane alternatives,” each designed to cure a specific infrastructure-induced headache.

The Frontend valet service (e.g., Vercel, Netlify):
This is the “I don’t even want to know where the server is” approach. You hand them your Next.js or React repo, and they handle everything else: global CDN, CI/CD, caching, serverless functions. It’s the git push dream realized. You’re not just getting a toaster; you’re getting a personal chef who serves you perfect toast on a silver platter, anywhere in the world, in 100 milliseconds.
The backend butler (e.g., Supabase, Firebase, Appwrite):
Remember the last time you thought, “You know what would be fun? Building user authentication from scratch!”? No, you didn’t. Because it’s a nightmare. These “Backend-as-a-Service” platforms are the butlers who handle the messy stuff, database provisioning, auth, file storage, so you can focus on the actual party (your app’s features).
The “furniture, but assembled” (e.g., Render, Railway, Heroku):
This is the sweet spot for most full-stack apps. You still have your Dockerfile (you know, the “instructions”), but you’re not forced to build the furniture yourself with a tiny Allen key (that’s Kubernetes). You give them a container, they run it, scale it, and even attach the managed database for you. It’s the grown-up version of what we all wished infrastructure was.
The tamed leviathan (e.g., GKE Autopilot, EKS on Fargate):
Okay, so your company is massive. You need the raw, terrifying power of Kubernetes. Fine. But you still don’t have to build the nuclear submarine yourself. These services are the “hire a professional crew” option. You get the power of Kubernetes, but Google or Amazon’s own engineers handle the patching, scaling, and 3 AM “node-is-down” panic attacks. You get to be the Admiral, not the guy shoveling coal in the engine room.

Stop building the car and just drive

Managing your own raw AWS account in 2025 is the very definition of technical debt. It’s an unhedged, high-interest loan you took out for no good reason, and you’re paying it off every single day with your team’s time, focus, and morale.

That custom-tuned VPC you spent three weeks on? It’s not your competitive advantage. That hand-rolled deployment script? It’s not your secret sauce.

Your product is your competitive advantage. Your user experience is your secret sauce.

The industry is moving. The teams that win will be the ones that spend less time tinkering with the engine and more time actually driving. The real work isn’t building the Rube Goldberg machine; it’s building the thing the machine is supposed to make.

So, for your own sanity, close that AWS console. Let someone else manage the plumbing.

Go build something that matters.

October 31, 2025 by Fernando SRE Cloud stuff DevOps stuff

The slow unceremonious death of EC2 Autoscaling

Let’s pour one out for an old friend.

AWS recently announced a small, seemingly boring new feature for EC2 Auto Scaling: the ability to cancel a pending instance refresh. If you squinted, you might have missed it. It sounds like a minor quality-of-life update, something to make a sysadmin’s Tuesday slightly less terrible.

But this isn’t a feature. It’s a gold watch. It’s the pat on the back and the “thanks for your service” speech at the awkward retirement party.

The EC2 Auto Scaling Group (ASG), the bedrock of cloud elasticity, the one tool we all reflexively reached for, is being quietly put out to pasture.

No, AWS hasn’t officially killed it. You can still spin one up, just like you can still technically send a fax. AWS will happily support it. But its days as the default, go-to solution for modern workloads are decisively over. The battle for the future of scaling has ended, and the ASG wasn’t the winner. The new default is serverless containers, hyper-optimized Spot fleets, and platforms so abstract they’re practically invisible.

If you’re still building your infrastructure around the ASG, you’re building a brand-new house with plumbing from 1985. It’s time to talk about why our old friend is retiring and meet the eager new hires who are already measuring the drapes in its office.

So why is the ASG getting the boot?

We loved the ASG. It was a revolutionary idea. But like that one brilliant relative everyone dreads sitting next to at dinner, it was also exhausting. Its retirement was long overdue, and the reasons are the same frustrations we’ve all been quietly grumbling about into our coffee for years.

It promised automation but gave us chores

The ASG’s sales pitch was simple: “I’ll handle the scaling!” But that promise came with a three-page, fine-print addendum of chores.

It was the operational overhead that killed us. We were promised a self-driving car and ended up with a stick-shift that required constant, neurotic supervision. We became part-time Launch Template librarians, meticulously versioning every tiny change. We became health-check philosophers, endlessly debating the finer points of ELB vs. EC2 health checks.

And then… the Lifecycle Hooks.

A “Lifecycle Hook” is a polite, clinical term for a Rube Goldberg machine of desperation. It’s a panic button that triggers a Lambda, which calls a Systems Manager script, which sends a carrier pigeon to… maybe… drain a connection pool before the instance is ruthlessly terminated. Trying to debug one at 3 AM was a rite of passage, a surefire way to lose precious engineering time and a little bit of your soul.

It moves at a glacial pace

The second nail in the coffin was its speed. Or rather, the complete lack of it.

The ASG scales at the speed of a full VM boot. In our world of spiky, unpredictable traffic, that’s an eternity. It’s like pre-heating a giant, industrial pizza oven for 45 minutes just to toast a single slice of bread. By the time your new instance is booted, configured, service-discovered, and finally “InService,” the spike in traffic has already come and gone, leaving you with a bigger bill and a cohort of very annoyed users.

It’s an expensive insurance policy

The ASG model is fundamentally wasteful. You run a “warm” fleet, paying for idle capacity just in case you need it. It’s like paying rent on a 5-bedroom house for your family of three, just in case 30 cousins decide to visit unannounced.

This “scale-up” model was slow, and the “scale-down” was even worse, riddled with fears of terminating the wrong instance and triggering a cascading failure. We ended up over-provisioning to avoid the pain of scaling, which completely defeats the purpose of “auto-scaling.”

The eager interns taking over the desk

So, the ASG has cleared out its desk. Who’s moving in? It turns out there’s a whole line of replacements, each one leaner, faster, and blissfully unconcerned with managing a “fleet.”

1. The appliance Fargate and Cloud Run

First up is the “serverless container”. This is the hyper-efficient new hire who just says, “Give me the Dockerfile. I’ll handle the rest.”

With AWS Fargate or Google’s Cloud Run, you don’t have a fleet. You don’t manage VMs. You don’t patch operating systems. You don’t even think about an instance. You just define a task, give it some CPU and memory, and tell it how many copies you want. It scales from zero to a thousand in seconds.

This is the appliance model. When you buy a toaster, you don’t worry about wiring the heating elements or managing its power supply. You just put in bread and get toast. Fargate is the toaster. The ASG was the “build-your-own-toaster” kit that came with a 200-page manual on electrical engineering.

Just look at the cognitive load. This is what it takes to get a basic ASG running via the CLI:

# The "Old Way": Just one of the many steps...
aws autoscaling create-auto-scaling-group \
    --auto-scaling-group-name my-legacy-asg \
    --launch-template "LaunchTemplateName=my-launch-template,Version='1'" \
    --min-size 1 \
    --max-size 5 \
    --desired-capacity 2 \
    --vpc-zone-identifier "subnet-0571c54b67EXAMPLE,subnet-0c1f4e4776EXAMPLE" \
    --health-check-type ELB \
    --health-check-grace-period 300 \
    --tag "Key=Name,Value=My-ASG-Instance,PropagateAtLaunch=true"

You still need to define the launch template, the subnets, the load balancer, the health checks…

Now, here’s the core of a Fargate task definition. It’s just a simple JSON file:

// The "New Way": A snippet from a Fargate Task Definition
{
  "family": "my-modern-app",
  "containerDefinitions": [
    {
      "name": "my-container",
      "image": "nginx:latest",
      "cpu": 256,
      "memory": 512,
      "portMappings": [
        {
          "containerPort": 80,
          "hostPort": 80
        }
      ]
    }
  ],
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512"
}

You define what you need, and the platform handles everything else.

2. The extreme couponer Spot fleets

For workloads that are less “instant spike” and more “giant batch job,” we have the “optimized fleet”. This is the high-stakes, high-reward world of Spot Instances.

Spot used to be terrifying. AWS could pull the plug with two minutes’ notice, and your entire workload would evaporate. But now, with Spot Fleets and diversification, it’s the smartest tool in the box. You can tell AWS, “I need 1,000 vCPUs, and I don’t care what instance types you give me, just find the cheapest ones.”

The platform then builds a diversified fleet for you across multiple instance types and Availability Zones, making it incredibly resilient to any single Spot pool termination. It’s perfect for data processing, CI/CD runners, and any batch job that can be interrupted and resumed. The ASG was always too rigid for this kind of dynamic, cost-driven scaling.

3. The paranoid security guard MicroVMs

Then there’s the truly weird stuff: Firecracker. This is the technology that powers AWS Lambda and Fargate. It’s a “MicroVM” that gives you the iron-clad security isolation of a full virtual machine but with the lightning-fast startup speed of a container.

We’re talking boot times of under 125 milliseconds. This is for when you need to run thousands of tiny, separate, untrusted workloads simultaneously without them ever being able to see each other. It’s the ultimate “multi-tenant” dream, giving every user their own tiny, disposable, fire-walled VM in the blink of an eye.

4. The invisible platform Edge runtimes

Finally, we have the platforms that are so abstract they’re “scaled to invisibility”. This is the world of Edge. Think Lambda@Edge or CloudFront Functions.

With these, you’re not even scaling in a region anymore. Your logic, your code, is automatically replicated and executed at hundreds of Points of Presence around the globe, as close to the end-user as possible. The entire concept of a “fleet” or “instance” just… disappears. The logic scales with the request.

Life after the funeral. How to adapt

Okay, the eulogy is over. The ASG is in its rocking chair on the porch. What does this mean for us, the builders? It’s time to sort through the old belongings and modernize the house.

Go full Marie Kondo on your architecture

First, you need to re-evaluate. Open up your AWS console and take a hard look at every single ASG you’re running. Be honest. Ask the tough questions:

Does this workload really need to be stateful?
Do I really need VM-level control, or am I just clinging to it for comfort?
Is this a stateless web app that I’ve just been too lazy to containerize?

If it doesn’t spark joy (or isn’t a snowflake legacy app that’s impossible to change), thank it for its service and plan its migration.

Stop shopping for engines, start shopping for cars

The most important shift is this: Pick the runtime, not the infrastructure.

For too long, our first question was, “What EC2 instance type do I need?” That’s the wrong question. That’s like trying to build a new car by starting at the hardware store to buy pistons.

The right question is, “What’s the best runtime for my workload?”

Is it a simple, event-driven piece of logic? That’s a Function (Lambda).
Is it a stateless web app in a container? That’s a Serverless Container (Fargate).
Is it a massive, interruptible batch job? That’s an Optimized Fleet (Spot).
Is it a cranky, stateful monolith that needs a pet VM? Only then do you fall back to an Instance (EC2, maybe even with an ASG).

Automate logic, not instance counts

Your job is no longer to be a VM mechanic. Your team’s skills need to shift. Stop manually tuning desired_capacity and start designing event-driven systems.

Focus on scaling logic, not servers. Your scaling trigger shouldn’t be “CPU is at 80%.” It should be “The SQS queue depth is greater than 100” or “API latency just breached 200ms”. Let the platform, be it Lambda, Fargate, or a KEDA-powered Kubernetes cluster, figure out how to add more processing power.

Was it really better in the old days?

Of course, this move to abstraction isn’t without trade-offs. We’re gaining a lot, but we’re also losing something.

The gain is obvious: We get our nights and weekends back. We get drastically reduced operational overhead, faster scaling, and for most stateless workloads, a much lower bill.

The loss is control. You can’t SSH into a Fargate container. You can’t run a custom kernel module on Lambda. For those few, truly special, high-customization legacy workloads, this is a dealbreaker. They will be the ASG’s loyal companions in the retirement home.

But for everything else? The ASG is a relic. It was a brilliant, necessary solution for the problems of 2010. But the problems of 2025 and beyond are different. The cloud has evolved to scale logic, functions, and containers, not just nodes.

The king isn’t just dead. The very concept of a throne has been replaced by a highly efficient, distributed, and slightly impersonal serverless committee. And frankly, it’s about time.

October 24, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

The great AWS Tag standoff

You tried to launch an EC2 instance. Simple task. Routine, even.

Instead, AWS handed you an AccessDenied error like a parking ticket you didn’t know you’d earned.

Nobody touched the IAM policy. At least, not that you can prove.

Yet here you are, staring at a red banner while your coffee goes cold and your standup meeting starts without you.

Turns out, AWS doesn’t just care what you do; it cares what you call it.

Welcome to the quiet civil war between two IAM condition keys that look alike, sound alike, and yet refuse to share the same room: ResourceTag and RequestTag.

The day my EC2 instance got grounded

It happened on a Tuesday. Not because Tuesdays are cursed, but because Tuesdays are when everyone tries to get ahead before the week collapses into chaos.

A developer on your team ran `aws ec2 run-instances` with all the right parameters and a hopeful heart. The response? A polite but firm refusal.

The policy hadn’t changed. The role hadn’t changed. The only thing that had changed was the expectation that tagging was optional.

In AWS, tags aren’t just metadata. They’re gatekeepers. And if your request doesn’t speak their language, the door stays shut.

Meet the two Tag twins nobody told you about

Think of aws:ResourceTag as the librarian who won’t let you check out a book unless it’s already labeled “Fiction” in neat, archival ink. It evaluates tags on existing resources. You’re not creating anything, you’re interacting with something that’s already there. Want to stop an EC2 instance? Fine, but only if it carries the tag `Environment = Production`. No tag? No dice.

Now meet aws:RequestTag, the nightclub bouncer who won’t let you in unless you show up wearing a wristband that says “VIP,” and you brought the wristband yourself. This condition checks the tags you’re trying to apply when you create a new resource. It’s not about what exists. It’s about what you promise to bring into the world.

One looks backward. The other looks forward. Confuse them, and your policy becomes a riddle with no answer.

Why your policy is lying to you

Here’s the uncomfortable truth: not all AWS services play nice with these conditions.

Lambda? Mostly shrugs. S3? Cooperates, but only if you ask nicely (and include `s3:PutBucketTagging`). EC2? Oh, EC2 loves a good trap.

When you run `ec2:RunInstances`, you’re not just creating an instance. You’re also (silently) creating volumes, network interfaces, and possibly a public IP. Each of those needs tagging permissions. And if your policy only allows `ec2:RunInstances` but forgets `ec2:CreateTags`? AccessDenied. Again.

And don’t assume the AWS Console saves you. Clicking “Add tags” in the UI doesn’t magically bypass IAM. If your role lacks the right conditions, those tags vanish into the void before the resource is born.

CloudTrail won’t judge you, but it will show you exactly which tags your request claimed to send. Sometimes, the truth hurts less than the guesswork.

Building a Tag policy that doesn’t backfire

Let’s build something that works in 2025, not 2018.
Start with a simple rule: all new S3 buckets must carry `CostCenter` and `Owner`. Your policy might look like this:

{
  "Effect": "Allow",
  "Action": ["s3:CreateBucket", "s3:PutBucketTagging"],
  "Resource": "*",
  "Condition": {
    "StringEquals": {
      "aws:RequestTag/CostCenter": ["Marketing", "Engineering", "Finance"],
      "aws:RequestTag/Owner": ["*"]
    },
    "Null": {
      "aws:RequestTag/CostCenter": "false",
      "aws:RequestTag/Owner": "false"
    }
  }
}

Notice the `Null` condition. It’s the unsung hero that blocks requests missing the required tags entirely.

For extra credit, layer this with AWS Organizations Service Control Policies (SCPs) to enforce tagging at the account level, and pair it with AWS Tag Policies (via Resource Groups) to standardize tag keys and values across your estate. Defense in depth isn’t paranoia, it’s peace of mind.

Testing your policy without breaking production

The IAM Policy Simulator is helpful, sure. But it won’t catch the subtle dance between `RunInstances` and `CreateTags`.

Better approach: spin up a sandbox account. Write a Terraform module or a Python script that tries to create resources with and without tags. Watch what succeeds, what fails, and, most importantly, why.

Automate these tests. Run them in CI. Treat IAM policies like code, because they are.

Remember: in IAM, hope is not a strategy, but a good test plan is.

The human side of tagging

Tags aren’t for machines. Machines don’t care.

Tags are for the human who inherits your account at 2 a.m. during an outage. For the finance team trying to allocate cloud spend. For the auditor who needs to prove compliance without summoning a séance.

A well-designed tagging policy isn’t about control. It’s about kindness, to your future self, your teammates, and the poor soul who has to clean up after you.

So next time you write a condition with `ResourceTag` or `RequestTag`, ask yourself: am I building a fence or a welcome mat?

Because in the cloud, even silence speaks, if you’re listening to the tags.

October 22, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff