SRE stuff

Random comments from a SRE

The slow unceremonious death of EC2 Autoscaling

Let’s pour one out for an old friend.

AWS recently announced a small, seemingly boring new feature for EC2 Auto Scaling: the ability to cancel a pending instance refresh. If you squinted, you might have missed it. It sounds like a minor quality-of-life update, something to make a sysadmin’s Tuesday slightly less terrible.

But this isn’t a feature. It’s a gold watch. It’s the pat on the back and the “thanks for your service” speech at the awkward retirement party.

The EC2 Auto Scaling Group (ASG), the bedrock of cloud elasticity, the one tool we all reflexively reached for, is being quietly put out to pasture.

No, AWS hasn’t officially killed it. You can still spin one up, just like you can still technically send a fax. AWS will happily support it. But its days as the default, go-to solution for modern workloads are decisively over. The battle for the future of scaling has ended, and the ASG wasn’t the winner. The new default is serverless containers, hyper-optimized Spot fleets, and platforms so abstract they’re practically invisible.

If you’re still building your infrastructure around the ASG, you’re building a brand-new house with plumbing from 1985. It’s time to talk about why our old friend is retiring and meet the eager new hires who are already measuring the drapes in its office.

So why is the ASG getting the boot?

We loved the ASG. It was a revolutionary idea. But like that one brilliant relative everyone dreads sitting next to at dinner, it was also exhausting. Its retirement was long overdue, and the reasons are the same frustrations we’ve all been quietly grumbling about into our coffee for years.

It promised automation but gave us chores

The ASG’s sales pitch was simple: “I’ll handle the scaling!” But that promise came with a three-page, fine-print addendum of chores.

It was the operational overhead that killed us. We were promised a self-driving car and ended up with a stick-shift that required constant, neurotic supervision. We became part-time Launch Template librarians, meticulously versioning every tiny change. We became health-check philosophers, endlessly debating the finer points of ELB vs. EC2 health checks.

And then… the Lifecycle Hooks.

A “Lifecycle Hook” is a polite, clinical term for a Rube Goldberg machine of desperation. It’s a panic button that triggers a Lambda, which calls a Systems Manager script, which sends a carrier pigeon to… maybe… drain a connection pool before the instance is ruthlessly terminated. Trying to debug one at 3 AM was a rite of passage, a surefire way to lose precious engineering time and a little bit of your soul.

It moves at a glacial pace

The second nail in the coffin was its speed. Or rather, the complete lack of it.

The ASG scales at the speed of a full VM boot. In our world of spiky, unpredictable traffic, that’s an eternity. It’s like pre-heating a giant, industrial pizza oven for 45 minutes just to toast a single slice of bread. By the time your new instance is booted, configured, service-discovered, and finally “InService,” the spike in traffic has already come and gone, leaving you with a bigger bill and a cohort of very annoyed users.

It’s an expensive insurance policy

The ASG model is fundamentally wasteful. You run a “warm” fleet, paying for idle capacity just in case you need it. It’s like paying rent on a 5-bedroom house for your family of three, just in case 30 cousins decide to visit unannounced.

This “scale-up” model was slow, and the “scale-down” was even worse, riddled with fears of terminating the wrong instance and triggering a cascading failure. We ended up over-provisioning to avoid the pain of scaling, which completely defeats the purpose of “auto-scaling.”

The eager interns taking over the desk

So, the ASG has cleared out its desk. Who’s moving in? It turns out there’s a whole line of replacements, each one leaner, faster, and blissfully unconcerned with managing a “fleet.”

1. The appliance Fargate and Cloud Run

First up is the “serverless container”. This is the hyper-efficient new hire who just says, “Give me the Dockerfile. I’ll handle the rest.”

With AWS Fargate or Google’s Cloud Run, you don’t have a fleet. You don’t manage VMs. You don’t patch operating systems. You don’t even think about an instance. You just define a task, give it some CPU and memory, and tell it how many copies you want. It scales from zero to a thousand in seconds.

This is the appliance model. When you buy a toaster, you don’t worry about wiring the heating elements or managing its power supply. You just put in bread and get toast. Fargate is the toaster. The ASG was the “build-your-own-toaster” kit that came with a 200-page manual on electrical engineering.

Just look at the cognitive load. This is what it takes to get a basic ASG running via the CLI:

# The "Old Way": Just one of the many steps...
aws autoscaling create-auto-scaling-group \
    --auto-scaling-group-name my-legacy-asg \
    --launch-template "LaunchTemplateName=my-launch-template,Version='1'" \
    --min-size 1 \
    --max-size 5 \
    --desired-capacity 2 \
    --vpc-zone-identifier "subnet-0571c54b67EXAMPLE,subnet-0c1f4e4776EXAMPLE" \
    --health-check-type ELB \
    --health-check-grace-period 300 \
    --tag "Key=Name,Value=My-ASG-Instance,PropagateAtLaunch=true"

You still need to define the launch template, the subnets, the load balancer, the health checks…

Now, here’s the core of a Fargate task definition. It’s just a simple JSON file:

// The "New Way": A snippet from a Fargate Task Definition
{
  "family": "my-modern-app",
  "containerDefinitions": [
    {
      "name": "my-container",
      "image": "nginx:latest",
      "cpu": 256,
      "memory": 512,
      "portMappings": [
        {
          "containerPort": 80,
          "hostPort": 80
        }
      ]
    }
  ],
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "256",
  "memory": "512"
}

You define what you need, and the platform handles everything else.

2. The extreme couponer Spot fleets

For workloads that are less “instant spike” and more “giant batch job,” we have the “optimized fleet”. This is the high-stakes, high-reward world of Spot Instances.

Spot used to be terrifying. AWS could pull the plug with two minutes’ notice, and your entire workload would evaporate. But now, with Spot Fleets and diversification, it’s the smartest tool in the box. You can tell AWS, “I need 1,000 vCPUs, and I don’t care what instance types you give me, just find the cheapest ones.”

The platform then builds a diversified fleet for you across multiple instance types and Availability Zones, making it incredibly resilient to any single Spot pool termination. It’s perfect for data processing, CI/CD runners, and any batch job that can be interrupted and resumed. The ASG was always too rigid for this kind of dynamic, cost-driven scaling.

3. The paranoid security guard MicroVMs

Then there’s the truly weird stuff: Firecracker. This is the technology that powers AWS Lambda and Fargate. It’s a “MicroVM” that gives you the iron-clad security isolation of a full virtual machine but with the lightning-fast startup speed of a container.

We’re talking boot times of under 125 milliseconds. This is for when you need to run thousands of tiny, separate, untrusted workloads simultaneously without them ever being able to see each other. It’s the ultimate “multi-tenant” dream, giving every user their own tiny, disposable, fire-walled VM in the blink of an eye.

4. The invisible platform Edge runtimes

Finally, we have the platforms that are so abstract they’re “scaled to invisibility”. This is the world of Edge. Think Lambda@Edge or CloudFront Functions.

With these, you’re not even scaling in a region anymore. Your logic, your code, is automatically replicated and executed at hundreds of Points of Presence around the globe, as close to the end-user as possible. The entire concept of a “fleet” or “instance” just… disappears. The logic scales with the request.

Life after the funeral. How to adapt

Okay, the eulogy is over. The ASG is in its rocking chair on the porch. What does this mean for us, the builders? It’s time to sort through the old belongings and modernize the house.

Go full Marie Kondo on your architecture

First, you need to re-evaluate. Open up your AWS console and take a hard look at every single ASG you’re running. Be honest. Ask the tough questions:

Does this workload really need to be stateful?
Do I really need VM-level control, or am I just clinging to it for comfort?
Is this a stateless web app that I’ve just been too lazy to containerize?

If it doesn’t spark joy (or isn’t a snowflake legacy app that’s impossible to change), thank it for its service and plan its migration.

Stop shopping for engines, start shopping for cars

The most important shift is this: Pick the runtime, not the infrastructure.

For too long, our first question was, “What EC2 instance type do I need?” That’s the wrong question. That’s like trying to build a new car by starting at the hardware store to buy pistons.

The right question is, “What’s the best runtime for my workload?”

Is it a simple, event-driven piece of logic? That’s a Function (Lambda).
Is it a stateless web app in a container? That’s a Serverless Container (Fargate).
Is it a massive, interruptible batch job? That’s an Optimized Fleet (Spot).
Is it a cranky, stateful monolith that needs a pet VM? Only then do you fall back to an Instance (EC2, maybe even with an ASG).

Automate logic, not instance counts

Your job is no longer to be a VM mechanic. Your team’s skills need to shift. Stop manually tuning desired_capacity and start designing event-driven systems.

Focus on scaling logic, not servers. Your scaling trigger shouldn’t be “CPU is at 80%.” It should be “The SQS queue depth is greater than 100” or “API latency just breached 200ms”. Let the platform, be it Lambda, Fargate, or a KEDA-powered Kubernetes cluster, figure out how to add more processing power.

Was it really better in the old days?

Of course, this move to abstraction isn’t without trade-offs. We’re gaining a lot, but we’re also losing something.

The gain is obvious: We get our nights and weekends back. We get drastically reduced operational overhead, faster scaling, and for most stateless workloads, a much lower bill.

The loss is control. You can’t SSH into a Fargate container. You can’t run a custom kernel module on Lambda. For those few, truly special, high-customization legacy workloads, this is a dealbreaker. They will be the ASG’s loyal companions in the retirement home.

But for everything else? The ASG is a relic. It was a brilliant, necessary solution for the problems of 2010. But the problems of 2025 and beyond are different. The cloud has evolved to scale logic, functions, and containers, not just nodes.

The king isn’t just dead. The very concept of a throne has been replaced by a highly efficient, distributed, and slightly impersonal serverless committee. And frankly, it’s about time.

The great AWS Tag standoff

You tried to launch an EC2 instance. Simple task. Routine, even.

Instead, AWS handed you an AccessDenied error like a parking ticket you didn’t know you’d earned.

Nobody touched the IAM policy. At least, not that you can prove.

Yet here you are, staring at a red banner while your coffee goes cold and your standup meeting starts without you.

Turns out, AWS doesn’t just care what you do; it cares what you call it.

Welcome to the quiet civil war between two IAM condition keys that look alike, sound alike, and yet refuse to share the same room: ResourceTag and RequestTag.

The day my EC2 instance got grounded

It happened on a Tuesday. Not because Tuesdays are cursed, but because Tuesdays are when everyone tries to get ahead before the week collapses into chaos.

A developer on your team ran `aws ec2 run-instances` with all the right parameters and a hopeful heart. The response? A polite but firm refusal.

The policy hadn’t changed. The role hadn’t changed. The only thing that had changed was the expectation that tagging was optional.

In AWS, tags aren’t just metadata. They’re gatekeepers. And if your request doesn’t speak their language, the door stays shut.

Meet the two Tag twins nobody told you about

Think of aws:ResourceTag as the librarian who won’t let you check out a book unless it’s already labeled “Fiction” in neat, archival ink. It evaluates tags on existing resources. You’re not creating anything, you’re interacting with something that’s already there. Want to stop an EC2 instance? Fine, but only if it carries the tag `Environment = Production`. No tag? No dice.

Now meet aws:RequestTag, the nightclub bouncer who won’t let you in unless you show up wearing a wristband that says “VIP,” and you brought the wristband yourself. This condition checks the tags you’re trying to apply when you create a new resource. It’s not about what exists. It’s about what you promise to bring into the world.

One looks backward. The other looks forward. Confuse them, and your policy becomes a riddle with no answer.

Why your policy is lying to you

Here’s the uncomfortable truth: not all AWS services play nice with these conditions.

Lambda? Mostly shrugs. S3? Cooperates, but only if you ask nicely (and include `s3:PutBucketTagging`). EC2? Oh, EC2 loves a good trap.

When you run `ec2:RunInstances`, you’re not just creating an instance. You’re also (silently) creating volumes, network interfaces, and possibly a public IP. Each of those needs tagging permissions. And if your policy only allows `ec2:RunInstances` but forgets `ec2:CreateTags`? AccessDenied. Again.

And don’t assume the AWS Console saves you. Clicking “Add tags” in the UI doesn’t magically bypass IAM. If your role lacks the right conditions, those tags vanish into the void before the resource is born.

CloudTrail won’t judge you, but it will show you exactly which tags your request claimed to send. Sometimes, the truth hurts less than the guesswork.

Building a Tag policy that doesn’t backfire

Let’s build something that works in 2025, not 2018.
Start with a simple rule: all new S3 buckets must carry `CostCenter` and `Owner`. Your policy might look like this:

{
  "Effect": "Allow",
  "Action": ["s3:CreateBucket", "s3:PutBucketTagging"],
  "Resource": "*",
  "Condition": {
    "StringEquals": {
      "aws:RequestTag/CostCenter": ["Marketing", "Engineering", "Finance"],
      "aws:RequestTag/Owner": ["*"]
    },
    "Null": {
      "aws:RequestTag/CostCenter": "false",
      "aws:RequestTag/Owner": "false"
    }
  }
}

Notice the `Null` condition. It’s the unsung hero that blocks requests missing the required tags entirely.

For extra credit, layer this with AWS Organizations Service Control Policies (SCPs) to enforce tagging at the account level, and pair it with AWS Tag Policies (via Resource Groups) to standardize tag keys and values across your estate. Defense in depth isn’t paranoia, it’s peace of mind.

Testing your policy without breaking production

The IAM Policy Simulator is helpful, sure. But it won’t catch the subtle dance between `RunInstances` and `CreateTags`.

Better approach: spin up a sandbox account. Write a Terraform module or a Python script that tries to create resources with and without tags. Watch what succeeds, what fails, and, most importantly, why.

Automate these tests. Run them in CI. Treat IAM policies like code, because they are.

Remember: in IAM, hope is not a strategy, but a good test plan is.

The human side of tagging

Tags aren’t for machines. Machines don’t care.

Tags are for the human who inherits your account at 2 a.m. during an outage. For the finance team trying to allocate cloud spend. For the auditor who needs to prove compliance without summoning a séance.

A well-designed tagging policy isn’t about control. It’s about kindness, to your future self, your teammates, and the poor soul who has to clean up after you.

So next time you write a condition with `ResourceTag` or `RequestTag`, ask yourself: am I building a fence or a welcome mat?

Because in the cloud, even silence speaks, if you’re listening to the tags.

October 22, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

Why your AWS bill secretly hates Graviton

The party always ends when the bill arrives.

Your team ships a brilliant release. The dashboards glow a satisfying, healthy green. The celebratory GIFs echo through the Slack channels. For a few glorious days, you are a master of the universe, a conductor of digital symphonies.

And then it shows up. The AWS invoice doesn’t knock. It just appears in your inbox with the silent, judgmental stare of a Victorian governess who caught you eating dessert before dinner. You shipped performance, yes. You also shipped a small fleet of x86 instances that are now burning actual, tangible money while you sleep.

Engineers live in a constant tug-of-war between making things faster and making them cheaper. We’re told the solution is another coupon code or just turning off a few replicas over the weekend. But real, lasting savings don’t come from tinkering at the edges. They show up when you change the underlying math. In the world of AWS, that often means changing the very silicon running the show.

Enter a family of servers that look unassuming on the console but quietly punch far above their weight. Migrate the right workloads, and they do the same work for less money. Welcome to AWS Graviton.

What is this Graviton thing anyway?

Let’s be honest. The first time someone says “ARM-based processor,” your brain conjures images of your phone, or maybe a high-end Raspberry Pi. The immediate, skeptical thought is, “Are we really going to run our production fleet on that?”

Well, yes. And it turns out that when you own the entire datacenter, you can design a chip that’s ridiculously good at cloud workloads, without the decades of baggage x86 has been carrying around. Switching to Graviton is like swapping that gas-guzzling ’70s muscle car for a sleek, silent electric skateboard that somehow still manages to tow your boat. It feels wrong… until you see your fuel bill. You’re swapping raw, hot, expensive grunt for cool, cheap efficiency.

Amazon designed these chips to optimize the whole stack, from the physical hardware to the hypervisor to the services you click on. This control means better performance-per-watt and, more importantly, a better price for every bit of work you do.

The lineup is simple:

Graviton2: The reliable workhorse. Great for general-purpose and memory-hungry tasks.
Graviton3: The souped-up model. Faster cores, better at cryptography, and sips memory bandwidth through a wider straw.
Graviton3E: The specialist. Tuned for high-performance computing (HPC) and anything that loves vector math.

This isn’t some lab experiment. Graviton is already powering massive production fleets. If your stack includes common tools like NGINX, Redis, Java, Go, Node.js, Python, or containers on ECS or EKS, you’re already walking on paved roads.

The real numbers behind the hype

The headline from AWS is tantalizing. “Up to 40 percent better price-performance.” “Up to,” of course, are marketing’s two favorite words. It’s the engineering equivalent of a dating profile saying they enjoy “adventures.” It could mean anything.

But even with a healthy dose of cynicism, the trend is hard to ignore. Your mileage will vary depending on your code and where your bottlenecks are, but the gains are real.

Here’s where teams often find the gold:

Web and API services: Handling the same requests per second at a lower instance cost.
CI/CD Pipelines: Faster compile times for languages like Go and Rust on cheaper build runners.
Data and Streaming: Popular engines like NGINX, Envoy, Redis, Memcached, and Kafka clients run beautifully on ARM.
Batch and HPC: Heavy computational jobs get a serious boost from the Graviton3E chips.

There’s also a footprint bonus. Better performance-per-watt means you can hit your ESG (Environmental, Social, and Governance) goals without ever having to create a single sustainability slide deck. A win for engineering, a win for the planet, and a win for dodging boring meetings.

But will my stuff actually run on it?

This is the moment every engineer flinches. The suggestion of “recompiling for ARM” triggers flashbacks to obscure linker errors and a trip down dependency hell.

The good news? The water’s fine. For most modern workloads, the transition is surprisingly anticlimactic. Here’s a quick compatibility scan:

You compile from source or use open-source software? Very likely portable.
Using closed-source agents or vendor libraries? Time to do some testing and maybe send a polite-but-firm support ticket.
Running containers? Fantastic. Multi-architecture images are your new best friend.
What about languages? Java, Go, Node.js, .NET 6+, Python, Ruby, and PHP are all happy on ARM on Linux.
C and C++? Just recompile and link against ARM64 libraries.

The easiest first wins are usually stateless services sitting behind a load balancer, sidecars like log forwarders, or any kind of queue worker where raw throughput is king.

A calm path to migration

Heroic, caffeine-fueled weekend migrations are for rookies. A calm, boring checklist is how professionals do it.

Phase 1: Test in a safe place

Launch a Graviton sibling of your current instance family (e.g., a c7g.large instead of a c6i.large). Replay production traffic to it or run your standard benchmarks. Compare CPU utilization, latency, and error rates. No surprises allowed.

Phase 2: Build for both worlds

It’s time to create multi-arch container images. docker buildx is the tool for the job. This command builds an image for both chip architectures and pushes them to your registry under a single tag.

# Build and push an image for both amd64 and arm64 from one command
docker buildx build \
  --platform linux/amd64,linux/arm64 \
  --tag $YOUR_ACCOUNT.dkr.ecr.$[REGION.amazonaws.com/my-web-app:v1.2.3](https://REGION.amazonaws.com/my-web-app:v1.2.3) \
  --push .

Phase 3: Canary and verify

Slowly introduce the new instances. Route just 5% of traffic to the Graviton pool using weighted target groups. Stare intently at your dashboards. Your “golden signals”, latency, traffic, errors, and saturation, should look identical across both pools.

Here’s a conceptual Terraform snippet of what that weighting looks like:

resource "aws_lb_target_group" "x86_pool" {
  name     = "my-app-x86-pool"
  # ... other config
}

resource "aws_lb_target_group" "arm_pool" {
  name     = "my-app-arm-pool"
  # ... other config
}

resource "aws_lb_listener_rule" "weighted_routing" {
  listener_arn = aws_lb_listener.frontend.arn
  priority     = 100

  action {
    type = "forward"

    forward {
      target_group {
        arn    = aws_lb_target_group.x86_pool.arn
        weight = 95
      }
      target_group {
        arn    = aws_lb_target_group.arm_pool.arn
        weight = 5
      }
    }
  }

  condition {
    path_pattern {
      values = ["/*"]
    }
  }
}

Phase 4: Full rollout with a parachute

If the canary looks healthy, gradually increase traffic: 25%, 50%, then 100%. Keep the old x86 pool warm for a day or two, just in case. It’s your escape hatch. Once it’s done, go show the finance team the new, smaller bill. They love that.

Common gotchas and easy fixes

Here are a few fun ways to ruin your Friday afternoon, and how to avoid them.

The sneaky base image: You built your beautiful ARM application… on an x86 foundation. Your FROM amazonlinux:2023 defaulted to the amd64 architecture. Your container dies instantly. The fix: Explicitly pin your base images to an ARM64 version, like FROM –platform=linux/arm64 public.ecr.aws/amazonlinux/amazonlinux:2023.
The native extension puzzle: Your Python, Ruby, or Node.js app fails because a native dependency couldn’t be built. The fix: Ensure you’re building on an ARM machine or using pre-compiled manylinux wheels that support aarch64.
The lagging agent: Your favorite observability tool’s agent doesn’t have an official ARM64 build yet. The fix: Check if they have a containerized version or gently nudge their support team. Most major vendors are on board now.

A shift in mindset

For decades, we’ve treated the processor as a given, an unchangeable law of physics in our digital world. The x86 architecture was simply the landscape on which we built everything. Graviton isn’t just a new hill on that landscape; it’s a sign the tectonic plates are shifting beneath our feet. This is more than a cost-saving trick; it’s an invitation to question the expensive assumptions we’ve been living with for years.

You don’t need a degree in electrical engineering to benefit from this, though it might help you win arguments on Hacker News. All you really need is a healthy dose of professional curiosity and a good benchmark script.

So here’s the experiment. Pick one of your workhorse stateless services, the ones that do the boring, repetitive work without complaining. The digital equivalent of a dishwasher. Build a multi-arch image for it. Cordon off a tiny, five-percent slice of your traffic and send it to a Graviton pool. Then, watch. Treat your service like a lab specimen. Don’t just glance at the CPU percentage; analyze the cost-per-million-requests. Scrutinize the p99 latency.

If the numbers tell a happy story, you haven’t just tweaked a deployment. You’ve fundamentally changed the economics of that service. You’ve found a powerful new lever to pull. If they don’t, you’ve lost a few hours and gained something more valuable: hard data. You’ve replaced a vague “what if” with a definitive “we tried that.”

Either way, you’ve sent a clear message to that smug monthly invoice. You’re paying attention. And you’re getting smarter. Doing the same work for less money isn’t a stunt. It’s just good engineering.

October 19, 2025 by Fernando SRE Cloud stuff Computer Science stuff DevOps stuff SRE stuff

Your Terraform S3 backend is confused not broken

You’ve done everything right. You wrote your Terraform config with the care of someone assembling IKEA furniture while mildly sleep-deprived. You double-checked your indentation (because yes, it matters). You even remembered to enable encryption, something your future self will thank you for while sipping margaritas on a beach far from production outages.

And then, just as you run terraform init, Terraform stares back at you like a cat that’s just been asked to fetch the newspaper.

Error: Failed to load state: NoSuchBucket: The specified bucket does not exist

But… you know the bucket exists. You saw it in the AWS console five minutes ago. You named it something sensible like company-terraform-states-prod. Or maybe you didn’t. Maybe you named it tf-bucket-please-dont-delete in a moment of vulnerability. Either way, it’s there.

So why is Terraform acting like you asked it to store your state in Narnia?

The truth is, Terraform’s S3 backend isn’t broken. It’s just spectacularly bad at telling you what’s wrong. It doesn’t throw tantrums, it just fails silently, or with error messages so vague they could double as fortune cookie advice.

Let’s decode its passive-aggressive signals together.

The backend block that pretends to listen

At the heart of remote state management lies the backend “s3” block. It looks innocent enough:

terraform {
  backend "s3" {
    bucket         = "my-team-terraform-state"
    key            = "networking/main.tfstate"
    region         = "us-west-2"
    dynamodb_table = "tf-lock-table"
    encrypt        = true
  }
}

Simple, right? But this block is like a toddler with a walkie-talkie: it only hears what it wants to hear. If one tiny detail is off, region, permissions, bucket name, it won’t say “Hey, your bucket is in Ohio but you told me it’s in Oregon.” It’ll just shrug and fail.

And because Terraform backends are loaded before variable interpolation, you can’t use variables inside this block. Yes, really. You’re stuck with hardcoded strings. It’s like being forced to write your grocery list in permanent marker.

The four ways Terraform quietly sabotages you

Over the years, I’ve learned that S3 backend errors almost always fall into one of four buckets (pun very much intended).

1. The credentials that vanished into thin air

Terraform needs AWS credentials. Not “kind of.” Not “maybe.” It needs them like a coffee machine needs beans. But it won’t tell you they’re missing, it’ll just say the bucket doesn’t exist, even if you’re looking at it in the console.

Why? Because without valid credentials, AWS returns a 403 Forbidden, and Terraform interprets that as “bucket not found” to avoid leaking information. Helpful for security. Infuriating for debugging.

Fix it: Make sure your credentials are loaded via environment variables, AWS CLI profile, or IAM roles if you’re on an EC2 instance. And no, copying your colleague’s .aws/credentials file while they’re on vacation doesn’t count as “secure.”

2. The region that lied to everyone

You created your bucket in eu-central-1. Your backend says us-east-1. Terraform tries to talk to the bucket in Virginia. The bucket, being in Frankfurt, doesn’t answer.

Result? Another “bucket not found” error. Because of course.

S3 buckets are region-locked, but the error message won’t mention regions. It assumes you already know. (Spoiler: you don’t.)

Fix it: Run this to check your bucket’s real region:

aws s3api get-bucket-location --bucket my-team-terraform-state

Then update your backend block accordingly. And maybe add a sticky note to your monitor: “Regions matter. Always.”

3. The lock table that forgot to show up

State locking with DynamoDB is one of Terraform’s best features; it stops two engineers from simultaneously destroying the same VPC like overeager toddlers with a piñata.

But if you declare a dynamodb_table in your backend and that table doesn’t exist? Terraform won’t create it for you. It’ll just fail with a cryptic message about “unable to acquire state lock.”

Fix it: Create the table manually (or with separate Terraform code). It only needs one attribute: LockID (string). And make sure your IAM user has dynamodb:GetItem, PutItem, and DeleteItem permissions on it.

Think of DynamoDB as the bouncer at a club: if it’s not there, anyone can stumble in and start redecorating.

4. The missing safety nets

Versioning and encryption aren’t strictly required, but skipping them is like driving without seatbelts because “nothing bad has happened yet.”

Without versioning, a bad terraform apply can overwrite your state forever. No undo. No recovery. Just you, your terminal, and the slow realization that you’ve deleted production.

Enable versioning:

aws s3api put-bucket-versioning \
  --bucket my-team-terraform-state \
  --versioning-configuration Status=Enabled

And always set encrypt = true. Your state file contains secrets, IDs, and the blueprint of your infrastructure. Treat it like your diary, not your shopping list.

Debugging without losing your mind

When things go sideways, don’t guess. Ask Terraform nicely for more details:

TF_LOG=DEBUG terraform init

Yes, it spits out a firehose of logs. But buried in there is the actual AWS API call, and the real error code. Look for lines containing AWS request or ErrorResponse. That’s where the truth hides.

Also, never run terraform init once and assume it’s locked in. If you change your backend config, you must run:

terraform init -reconfigure

Otherwise, Terraform will keep using the old settings cached in .terraform/. It’s stubborn like that.

A few quiet rules for peaceful coexistence

After enough late-night debugging sessions, I’ve adopted a few personal commandments:

One project, one bucket. Don’t mix dev and prod states in the same bucket. It’s like keeping your tax documents and grocery receipts in the same shoebox, technically possible, spiritually exhausting.
Name your state files clearly. Use paths like prod/web.tfstate instead of final-final-v3.tfstate.
Never commit backend configs with real bucket names to public repos. (Yes, people still do this. No, it’s not cute.)
Test your backend setup in a sandbox first. A $0.02 bucket and a tiny DynamoDB table can save you a $10,000 mistake.

It’s not you, it’s the docs

Terraform’s S3 backend works beautifully, once everything aligns. The problem isn’t the tool. It’s that the error messages assume you’re psychic, and the documentation reads like it was written by someone who’s never made a mistake in their life.

But now you know its tells. The fake “bucket not found.” The silent region betrayal. The locking table that ghosts you.

Next time it acts up, don’t panic. Pour a coffee, check your region, verify your credentials, and whisper gently: “I know you’re trying your best.”

Because honestly? It is.

October 13, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

Playing detective with dead Kubernetes nodes

It arrives without warning, a digital tap on the shoulder that quickly turns into a full-blown alarm. Maybe you’re mid-sentence in a meeting, or maybe you’re just enjoying a rare moment of quiet. Suddenly, a shriek from your phone cuts through everything. It’s the on-call alert, flashing a single, dreaded message: NodeNotReady.

Your beautifully orchestrated city of containers, a masterpiece of modern engineering, now has a major power outage in one of its districts. One of your worker nodes, a once-diligent and productive member of the cluster, has gone completely silent. It’s not responding to calls, it’s not picking up new work, and its existing jobs are in limbo. In the world of Kubernetes, this isn’t just a technical issue; it’s a ghosting of the highest order.

Before you start questioning your life choices or sacrificing a rubber chicken to the networking gods, take a deep breath. Put on your detective’s trench coat. We have a case to solve.

First on the scene, the initial triage

Every good investigation starts by surveying the crime scene and asking the most basic question: What the heck happened here? In our world, this means a quick and clean interrogation of the Kubernetes API server. It’s time for a roll call.

kubectl get nodes -o wide

This little command is your first clue. It lines up all your nodes and points a big, accusatory finger at the one in the Not Ready state.

NAME                    STATUS     ROLES    AGE   VERSION   INTERNAL-IP      EXTERNAL-IP     OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
k8s-master-1            Ready      master   90d   v1.28.2   10.128.0.2       34.67.123.1     Ubuntu 22.04.1 LTS   5.15.0-78-generic   containerd://1.6.9
k8s-worker-node-7b5d    NotReady   <none>   45d   v1.28.2   10.128.0.5       35.190.45.6     Ubuntu 22.04.1 LTS   5.15.0-78-generic   containerd://1.6.9
k8s-worker-node-fg9h    Ready      <none>   45d   v1.28.2   10.128.0.4       35.190.78.9     Ubuntu 22.04.1 LTS   5.15.0-78-generic   containerd://1.6.9

There’s our problem child: k8s-worker-node-7b5d. Now that we’ve identified our silent suspect, it’s time to pull it into the interrogation room for a more personal chat.

kubectl describe node k8s-worker-node-7b5d

The output of describe is where the juicy gossip lives. You’re not just looking at specs; you’re looking for a story. Scroll down to the Conditions and, most importantly, the Events section at the bottom. This is where the node often leaves a trail of breadcrumbs explaining exactly why it decided to take an unscheduled vacation.

Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Mon, 13 Oct 2025 09:55:12 +0200   Mon, 13 Oct 2025 09:45:30 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Mon, 13 Oct 2025 09:55:12 +0200   Mon, 13 Oct 2025 09:45:30 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Mon, 13 Oct 2025 09:55:12 +0200   Mon, 13 Oct 2025 09:45:30 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   Mon, 13 Oct 2025 09:55:12 +0200   Mon, 13 Oct 2025 09:50:05 +0200   KubeletNotReady              container runtime network not ready: CNI plugin reporting error: rpc error: code = Unavailable desc = connection error

Events:
  Type     Reason                   Age                  From                       Message
  ----     ------                   ----                 ----                       -------
  Normal   Starting                 25m                  kubelet                    Starting kubelet.
  Warning  ContainerRuntimeNotReady 5m12s (x120 over 25m) kubelet                    container runtime network not ready: CNI plugin reporting error: rpc error: code = Unavailable desc = connection error

Aha! Look at that. The Events log is screaming for help. A repeating warning, ContainerRuntimeNotReady, points to a CNI (Container Network Interface) plugin having a full-blown tantrum. We’ve moved from a mystery to a specific lead.

The usual suspects, a rogues’ gallery

When a node goes quiet, the culprit is usually one of a few repeat offenders. Let’s line them up.

1. The silent saboteur network issues

This is the most common villain. Your node might be perfectly healthy, but if it can’t talk to the control plane, it might as well be on a deserted island. Think of the control plane as the central office trying to call its remote employee (the node). If the phone line is cut, the office assumes the employee is gone. This can be caused by firewall rules blocking ports, misconfigured VPC routes, or a DNS server that’s decided to take the day off.

2. The overworked informant, the kubelet

The kubelet is the control plane’s informant on every node. It’s a tireless little agent that reports on the node’s health and carries out orders. But sometimes, this agent gets sick. It might have crashed, stalled, or is struggling with misconfigured credentials (like expired TLS certificates) and can’t authenticate with the mothership. If the informant goes silent, the node is immediately marked as a person of interest.

You can check on its health directly on the node:

# SSH into the problematic node
ssh user@<node-ip>

# Check the kubelet's vital signs
systemctl status kubelet

A healthy output should say active (running). Anything else, and you’ve found a key piece of evidence.

3. The glutton resource exhaustion

Your node has a finite amount of CPU, memory, and disk space. If a greedy application (or a swarm of them) consumes everything, the node itself can become starved. The kubelet and other critical system daemons need resources to breathe. Without them, they suffocate and stop reporting in. It’s like one person eating the entire buffet, leaving nothing for the hosts of the party.

A quick way to check for gluttons is with:

kubectl top node <your-problem-child-node-name>

If you see CPU or memory usage kissing 100%, you’ve likely found your culprit.

The forensic toolkit: digging deeper

If the initial triage and lineup didn’t reveal the killer, it’s time to break out the forensic tools and get our hands dirty.

Sifting Through the Diary with journalctl

The journalctl command is your window into the kubelet’s soul (or, more accurately, its log files). This is where it writes down its every thought, fear, and error.

# On the node, tail the kubelet's logs for clues
journalctl -u kubelet -f --since "10 minutes ago"

Look for recurring error messages, failed connection attempts, or anything that looks suspiciously out of place.

Quarantining the patient with drain

Before you start performing open-heart surgery on the node, it’s wise to evacuate the civilians. The kubectl drain command gracefully evicts all the pods from the node, allowing them to be rescheduled elsewhere.

kubectl drain k8s-worker-node-7b5d --ignore-daemonsets --delete-local-data

This isolates the patient, letting you work without causing a city-wide service outage.

Confirming the phone lines with curl

Don’t just trust the error messages. Verify them. From the problematic node, try to contact the API server directly. This tells you if the fundamental network path is even open.

# From the problem node, try to reach the API server endpoint
curl -k https://<api-server-ip>:<port>/healthz

If you get ok, the basic connection is fine. If it times out or gets rejected, you’ve confirmed a networking black hole.

Crime prevention: keeping your nodes out of trouble

Solving the case is satisfying, but a true detective also works to prevent future crimes.

Set up a neighborhood watch: Implement robust monitoring with tools like Prometheus and Grafana. Set up alerts for high resource usage, disk pressure, and node status changes. It’s better to spot a prowler before they break in.
Install self-healing robots: Most cloud providers (GKE, EKS, AKS) offer node auto-repair features. If a node fails its health checks, the platform will automatically attempt to repair it or replace it. Turn this on. It’s your 24/7 robotic police force.
Enforce city zoning laws: Use resource requests and limits on your deployments. This prevents any single application from building a resource-hogging skyscraper that blocks the sun for everyone else.
Schedule regular health checkups: Keep your cluster components, operating systems, and container runtimes updated. Many Not Ready mysteries are caused by long-solved bugs that you could have avoided with a simple patch.

The case is closed for now

So there you have it. The rogue node is back in line, the pods are humming along, and the city of containers is once again at peace. You can hang up your trench coat, put your feet up, and enjoy that lukewarm coffee you made three hours ago. The mystery is solved.

But let’s be honest. Debugging a Not Ready node is less like a thrilling Sherlock Holmes novel and more like trying to figure out why your toaster only toasts one side of the bread. It’s a methodical, often maddening, process of elimination. You start with grand theories of network conspiracies and end up discovering the culprit was a single, misplaced comma in a YAML file, the digital equivalent of the butler tripping over the rug.

So the next time an alert yanks you from your peaceful existence, don’t panic. Remember that you are a digital detective, a whisperer of broken machines. Your job is to patiently ask the right questions until the silent, uncooperative suspect finally confesses. After all, in the world of Kubernetes, a node is never truly dead. It’s just being dramatic and waiting for a good detective to find the clues, and maybe, just maybe, restart its kubelet. The city is safe… until the next time. And there is always a next time.

October 13, 2025 by Fernando SRE Cloud stuff DevOps stuff Kubernetes SRE stuff

When invisible limits beat warm Lambdas

My team had a problem. Or rather, we had a cause. A noble crusade that consumed our sprints, dominated our Slack channels, and haunted our architectural diagrams. We were on a relentless witch hunt for the dreaded Lambda cold start.

We treated those extra milliseconds of spin-up time like a personal insult from Jeff Bezos himself. We became amateur meteorologists, tracking “cold start storms” across regions. We had dashboards so finely tuned they could detect the faint, quantum flutter of an EC2 instance thinking about starting up. We proudly spent over $3,000 a month on provisioned concurrency¹, a financial sacrifice to the gods of AWS to keep our functions perpetually toasty.

We had done it. Cold starts were a solved problem. We celebrated with pizza and self-congratulatory Slack messages. The system was invincible.

Or so we thought.

The 2:37 am wake-up call

It was a Tuesday, of course. The kind of quiet, unassuming Tuesday that precedes all major IT disasters. At 2:37 AM, my phone began its unholy PagerDuty screech. The alert was as simple as it was terrifying: “API timeouts.”

I stumbled to my laptop, heart pounding, expecting to see a battlefield. Instead, I saw a paradox.

The dashboards were an ocean of serene green.

Cold starts? 0%. Our $3,000 was working perfectly. Our Lambdas were warm, cozy, and ready for action.
Lambda health? 100%. Every function was executing flawlessly, not an error in sight.
Database queries? 100% failure rate.

It was like arriving at a restaurant to find the chefs in the kitchen, knives sharpened and stoves hot, but not a single plate of food making it to the dining room. Our Lambdas were warm, our dashboards were green, and our system was dying. It turns out that for $3,000 a month, you can keep your functions perfectly warm while they helplessly watch your database burn to the ground.

We had been playing Jenga with AWS’s invisible limits, and someone had just pulled the wrong block.

Villain one, The great network card famine

Every Lambda function that needs to talk to services within your VPC, like a database, requires a virtual network card, an Elastic Network Interface (ENI). It’s the function’s physical connection to the world. And here’s the fun part that AWS tucks away in its documentation: your account has a default, region-wide limit on these. Usually around 250.

We discovered this footnote from 2018 when the Marketing team, in a brilliant feat of uncoordinated enthusiasm, launched a flash promo.

Our traffic surged. Lambda, doing its job beautifully, began to scale. 100 concurrent executions. 200. Then 300.

The 251st request didn’t fail. Oh no, that would have been too easy. Instead, it just… waited. For fourteen seconds. It was waiting in a silent, invisible line for AWS to slowly hand-carve a new network card from the finest, artisanal silicon.

Our “optimized” system had become a lottery.

The winners: Got an existing ENI and a zippy 200ms response.
The losers: Waited 14,000ms for a network card to materialize out of thin air, causing their request to time out.

The worst part? This doesn’t show up as a Lambda error. It just looks like your code is suddenly, inexplicably slow. We were hunting for a bug in our application, but the culprit was a bureaucrat in the AWS networking department.

Do this right now. Seriously. Open a terminal and check your limit. Don’t worry, we’ll wait.

# This command reveals the 'Maximum network interfaces per Region' quota.
# You might be surprised at what you find.
aws service-quotas get-service-quota \
  --service-code vpc \
  --quota-code L-F678F1CE

Villain two, The RDS proxy’s velvet rope policy

Having identified the ENI famine, we thought we were geniuses. But fixing that only revealed the next layer of our self-inflicted disaster. Our Lambdas could now get network cards, but they were all arriving at the database party at once, only to be stopped at the door.

We were using RDS Proxy, the service AWS sells as the bouncer for your database, managing connections so your Aurora instance doesn’t get overwhelmed. What we failed to appreciate is that this bouncer has its own… peculiar rules. The proxy itself has CPU limits. When hundreds of Lambdas tried to get a connection simultaneously, the proxy’s CPU spiked to 100%.

It didn’t crash. It just became incredibly, maddeningly slow. It was like a nightclub bouncer enforcing a strict one-in, one-out policy, not because the club was full, but because he could only move his arms so fast. The queue of connections grew longer and longer, each one timing out, while the database inside sat mostly idle, wondering where everybody went.

The humbling road to recovery

The fixes weren’t complex, but they were humbling. They forced us to admit that our beautiful, perfectly-tuned relational database architecture was, for some tasks, the wrong tool for the job.

The great VPC escape
For any Lambda that only needed to talk to public AWS services like S3 or SQS, we ripped it out of the VPC. This is Lambda 101, but we had put everything in the VPC for “security.” Moving them out meant they no longer needed an ENI to function. We implemented VPC Endpoints², allowing these functions to access AWS services over a private link without the ENI overhead.
RDS proxy triage
For the databases we couldn’t escape, we treated the proxy like the delicate, overworked bouncer it was. We massively over-provisioned the proxy instances, giving them far more CPU than they should ever need. We also implemented client-side jitter, a small, random delay before retrying a connection, to stop our Lambdas from acting like a synchronized mob storming the gates.
The nuclear option DynamoDB
For one critical, high-throughput service, we did the unthinkable. We migrated it from Aurora to DynamoDB. The hardest part wasn’t the code; it was the ego. It was admitting that the problem didn’t require a Swiss Army knife when all we needed was a hammer. The team’s reaction after the migration was telling: “Wait… you mean we don’t need to worry about connection pooling at all?” Every developer, after their first taste of NoSQL freedom.

The real lesson we learned

Obsessing over cold starts is like meticulously polishing the chrome on your car’s engine while the highway you’re on is crumbling into a sinkhole. It’s a visible, satisfying metric to chase, but it often distracts from the invisible, systemic limits that will actually kill you.

Yes, optimize your cold starts. Shave off those milliseconds. But only after you’ve pressure-tested your system for the real bottlenecks. The unsexy ones. The ones buried in AWS service quota pages and 5-year-old forum posts.

Stop micro-optimizing the 50ms you can see and start planning for the 14-second delays you can’t. We learned that the hard way, at 2:37 AM on a Tuesday.

¹ The official term for ‘setting a pile of money on fire to keep your functions toasty’.

² A fancy AWS term for ‘a private, secret tunnel to an AWS service so your Lambda doesn’t have to go out into the scary public internet’. It’s like an employee-only hallway in a giant mall.

October 10, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

Parenting your Kubernetes using hierarchical namespaces

Let’s be honest. Your Kubernetes cluster, on its bad days, feels less like a sleek, futuristic platform and more like a chaotic shared apartment right after college. The frontend team is “borrowing” CPU from the backend team, the analytics project left its sensitive data lying around in a public bucket, and nobody knows who finished the last of the memory reserves.

You tried to bring order. You dutifully handed out digital rooms to each team using namespaces. For a while, there was peace. But then those teams had their own little sub-projects, staging, testing, that weird experimental feature no one talks about, and your once-flat world devolved into a sprawling city with no zoning laws. The shenanigans continued, just inside slightly smaller boxes.

What you need isn’t more rules scribbled on a whiteboard. You need a family tree. It’s time to introduce some much-needed parental supervision into your cluster. It’s time for Hierarchical Namespaces.

The origin of the namespace rebellion

In the beginning, Kubernetes gave us namespaces, and they were good. The goal was simple: create virtual walls to stop teams from stealing each other’s lunch (metaphorically speaking, of course). Each namespace was its own isolated island, a sovereign nation with its own rules. This “flat earth” model worked beautifully… until it didn’t.

As organizations scaled, their clusters turned into bustling archipelagos of hundreds of namespaces. Managing them felt like being an air traffic controller for a fleet of paper airplanes in a hurricane. Teams realized that a flat structure was basically a free-for-all party where every guest could raid the fridge, as long as they stayed in their designated room. There was no easy way to apply a single rule, like a network policy or a set of permissions, to a group of related namespaces. The result was a maddening copy-paste-a-thon of YAML files, a breeding ground for configuration drift and human error.

The community needed a way to group these islands, to draw continents. And so, the Hierarchical Namespace Controller (HNC) was born, bringing a simple, powerful concept to the table: namespaces can have parents.

What this parenting gig gets you

Adopting a hierarchical structure isn’t just about satisfying your inner control freak. It comes with some genuinely fantastic perks that make cluster management feel less like herding cats.

The “Because I said so” principle: This is the magic of policy inheritance. Any Role, RoleBinding, or NetworkPolicy you apply to a parent namespace automatically cascades down to all its children and their children, and so on. It’s the parenting dream: set a rule once, and watch it magically apply to everyone. No more duplicating RBAC roles for the dev, staging, and testing environments of the same application.
The family budget: You can set a resource quota on a parent namespace, and it becomes the total budget for that entire branch of the family tree. For instance, team-alpha gets 100 CPU cores in total. Their dev and qa children can squabble over that allowance, but together, they can’t exceed it. It’s like giving your kids a shared credit card instead of a blank check.
Delegated authority: You can make a developer an admin of a “team” namespace. Thanks to inheritance, they automatically become an admin of all the sub-namespaces under it. They get the freedom to manage their own little kingdoms (staging, testing, feature-x) without needing to ping a cluster-admin for every little thing. You’re teaching them responsibility (while keeping the master keys to the kingdom, of course).

Let’s wrangle some namespaces

Convinced? I thought so. The good news is that bringing this parental authority to your cluster isn’t just a fantasy. Let’s roll up our sleeves and see how it works.

Step 0: Install the enforcer

Before we can start laying down the law, we need to invite the enforcer. The Hierarchical Namespace Controller (HNC) doesn’t come built-in with Kubernetes. You have to install it first.

You can typically install the latest version with a single kubectl command:

kubectl apply -f [https://github.com/kubernetes-sigs/hierarchical-namespaces/releases/latest/download/hnc-manager.yaml](https://github.com/kubernetes-sigs/hierarchical-namespaces/releases/latest/download/hnc-manager.yaml)

Wait a minute for the controller to be up and running in its own hnc-system namespace. Once it’s ready, you’ll have a new superpower: the kubectl hns plugin.

Step 1: Create the parent namespace

First, let’s create a top-level namespace for a project. We’ll call it project-phoenix. This will be our proud parent.

kubectl create namespace project-phoenix

Step 2: Create some children

Now, let’s give project-phoenix a couple of children: staging and testing. Wait, what’s that hns command? That’s not your standard kubectl. That’s the magic wand the HNC just gave you. You’re telling it to create a staging namespace and neatly tuck it under its parent.

kubectl hns create staging -n project-phoenix
kubectl hns create testing -n project-phoenix

Step 3: Admire your family tree

To see your beautiful new hierarchy in all its glory, you can ask HNC to draw you a picture.

kubectl hns tree project-phoenix

You’ll get a satisfyingly clean ASCII art diagram of your new family structure:

You can even create grandchildren. Let’s give the staging namespace its own child for a specific feature branch.

kubectl hns create feature-login-v2 -n staging
kubectl hns tree project-phoenix

And now your tree looks even more impressive:

Step 4 Witness the magic of inheritance

Let’s prove that this isn’t all smoke and mirrors. We’ll create a Role in the parent namespace that allows viewing Pods.

# viewer-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: pod-viewer
  namespace: project-phoenix
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

Apply it:

kubectl apply -f viewer-role.yaml

Now, let’s give a user, let’s call her jane.doe, that role in the parent namespace.

kubectl create rolebinding jane-viewer --role=pod-viewer --user=jane.doe -n project-phoenix

Here’s the kicker. Even though we only granted Jane permission in project-phoenix, she can now magically view pods in the staging and feature-login-v2 namespaces as well.

# This command would work for Jane!
kubectl auth can-i get pods -n staging --as=jane.doe
# YES

# And even in the grandchild namespace!
kubectl auth can-i get pods -n feature-login-v2 --as=jane.doe
# YES

No copy-pasting required. The HNC saw the binding in the parent and automatically propagated it down the entire tree. That’s the power of parenting.

A word of caution from a fellow parent

As with real parenting, this new power comes with its own set of challenges. It’s not a silver bullet, and you should be aware of a few things before you go building a ten-level deep namespace dynasty.

Complexity can creep in: A deep, sprawling tree of namespaces can become its own kind of nightmare to debug. Who has access to what? Which quota is affecting this pod? Keep your hierarchy logical and as flat as you can get away with. Just because you can create a great-great-great-grandchild namespace doesn’t mean you should.
Performance is not free: The HNC is incredibly efficient, but propagating policies across thousands of namespaces does have a cost. For most clusters, it’s negligible. For mega-clusters, it’s something to monitor.
Not everyone obeys the parents: Most core Kubernetes resources (RBAC, Network Policies, Resource Quotas) play nicely with HNC. But not all third-party tools or custom controllers are hierarchy-aware. They might only see the flat world, so always test your specific tools.

Go forth and organize

Hierarchical Namespaces are the organizational equivalent of finally buying drawer dividers for that one kitchen drawer, you know the one. The one where the whisk is tangled with the batteries and a single, mysterious key. They transform your cluster from a chaotic free-for-all into a structured, manageable hierarchy that actually reflects how your organization works. It’s about letting you set rules with confidence and delegate with ease.

So go ahead, embrace your inner cluster parent. Bring some order to the digital chaos. Your future self, the one who isn’t spending a Friday night debugging a rogue pod in the wrong environment, will thank you. Just don’t be surprised when your newly organized child namespaces start acting like teenagers, asking for the production Wi-Fi password or, heaven forbid, the keys to the cluster-admin car.After all, with great power comes great responsibility… and a much, much cleaner kubectl get ns output.

October 3, 2025 by Fernando SRE Cloud stuff DevOps stuff Kubernetes SRE stuff

Trust your images again with Docker Scout

Containers behave perfectly until you check their pockets. Then you find an elderly OpenSSL and a handful of dusty transitive dependencies that they swore they did not know. Docker Scout is the friend who quietly pats them down at the door, lists what they are carrying, and whispers what to swap so the party does not end with a security incident.

This article is a field guide for getting value from Docker Scout without drowning readers in output dumps. It keeps the code light, focuses on practical moves, and uses everyday analogies instead of cosmic prophecy. By the end, you will have a small set of habits that reduce late‑night pages and cut vulnerability noise to size.

Why scanners overwhelm and what to keep

Most scanners are fantastic at finding problems and terrible at helping you fix the right ones first. You get a laundry basket full of CVEs, you sort by severity, and somehow the pile never shrinks. What you actually need is:

Context plus action: show the issues and show exactly what to change, especially base images.
Comparison across builds: did this PR make things better or worse?
A tidy SBOM: not a PDF doorstop, an artifact you can diff and feed into tooling.

Docker Scout leans into those bits. It plugs into the Docker tools you already use, gives you short summaries when you need them, and longer receipts when auditors appear.

What Docker Scout actually gives you

Quick risk snapshot with counts by severity and a plain‑language hint if a base image refresh will clear most of the mess.
Targeted recommendations that say “move from X to Y” rather than “good luck with 73 Mediums.”
Side‑by‑side comparisons so you can fail a PR only when it truly regresses security.
SBOM on demand in useful formats for compliance and diffs.

That mix turns CVE management from whack‑a‑mole into something closer to doing the dishes with a proper rack. The plates dry, nothing falls on the floor, and you get your counter space back.

A five-minute tour

Keep this section handy. It is the minimum set of commands that deliver outsized value.

# 1) Snapshot risk and spot low‑hanging fruit
# Tip: use a concrete tag to keep comparisons honest
docker scout quickview acme/web:1.4.2

# 2) See only the work that unblocks a release
# Critical and High issues that already have fixes
docker scout cves acme/web:1.4.2 \
  --only-severities critical,high \
  --only-fixed

# 3) Ask for the shortest path to green
# Often this is just a base image refresh
docker scout recommendations acme/web:1.4.2

# 4) Check whether a PR helps or hurts
# Fail the check only if the new image is riskier
docker scout compare acme/web:1.4.1 --to acme/web:1.4.2

# 5) Produce an SBOM you can diff and archive
docker scout sbom acme/web:1.4.2 --format cyclonedx-json > sbom.json

Pro tip
Run QuickView first, follow it with recommendations, and treat Compare as your gate. This sequence removes bikeshedding from PR reviews.

One small diagram to keep in your head

Nothing exotic here. You do not need a new mental model, only a couple of strategic checks where they hurt the least.

A pull request check that is sharp but kind

You want security to act like a seatbelt, not a speed bump. The workflow below uploads findings to GitHub Code Scanning for visibility and uses a comparison gate so PRs only fail when risk goes up.

name: Container Security
on: [pull_request, push]

jobs:
  scout:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
      security-events: write   # upload SARIF
    steps:
      - uses: actions/checkout@v4

      - name: Login to Docker Hub
        uses: docker/login-action@v3
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}

      - name: Build image
        run: |
          docker build -t ghcr.io/acme/web:${{ github.sha }} .

      - name: Analyze CVEs and upload SARIF
        uses: docker/scout-action@v1
        with:
          command: cves
          image: ghcr.io/acme/web:${{ github.sha }}
          only-severities: critical,high
          only-fixed: true
          sarif-file: scout.sarif

      - name: Upload SARIF to Code Scanning
        uses: github/codeql-action/upload-sarif@v3
        with:
          sarif_file: scout.sarif

      - name: Compare against latest and fail on regression
        if: github.event_name == 'pull_request'
        uses: docker/scout-action@v1
        with:
          command: compare
          image: ghcr.io/acme/web:${{ github.sha }}
          to-latest: true
          exit-on: vulnerability
          only-severities: critical,high

Why this works:

SARIF lands in Code Scanning, so the whole team sees issues inline.
The compare step keeps momentum. If the PR makes the risk lower than or equal to, it passes. If it makes things worse at High or Critical, it fails.
The gate is opinionated about fixed issues, which are the ones you can actually do something about today.

Triage that scales beyond one heroic afternoon

People love big vulnerability cleanups the way they love moving house. It feels productive for a day, and then you are exhausted, and the boxes creep back in. Try this instead:

Set a simple SLA

Push on two levers before touching the application code

Refresh the base image suggested by the recommendations. This often clears the noisy majority in minutes.
Switch to a slimmer base if your app allows it. debian:bookworm-slim or a minimal distroless image reduces attack surface, and your scanner reports will look cleaner because there is simply less there.

Use comparisons to stop bikeshedding
Make the conversation about direction rather than absolutes. If each PR is no worse than the baseline, you are winning.

Document exceptions as artifacts
When something is not reachable or is mitigated elsewhere, record it alongside the SBOM or in your tracking system. Invisible exceptions return like unwashed coffee mugs.

Common traps and how to step around them

The base image is doing most of the damage
If your report looks like a fireworks show, run recommendations. If it says “update base” and you ignore it, you are choosing to mop the floor while the tap stays open.

You still run everything as root
Even perfect CVE hygiene will not save you if the container has god powers. If you can, adopt a non‑root user and a slimmer runtime image. A typical multi‑stage pattern looks like this:

# Build stage
FROM golang:1.22 as builder
WORKDIR /src
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o /bin/app ./cmd/api

# Runtime stage
FROM gcr.io/distroless/static:nonroot
COPY --from=builder /bin/app /app
USER nonroot:nonroot
ENTRYPOINT ["/app"]

Now your scanner report shrinks, and your container stops borrowing the keys to the building.

Your scanner finds Mediums you cannot fix today
Save your energy for issues with available fixes or for regressions. Mediums without fixes belong on a to‑do list, not a release gate.

The team treats the scanner as a chore
Keep the feedback quick and visible. Short PR notes, one SBOM per release, and a small monthly base refresh beat quarterly crusades.

Working with registries without drama

Local images work out of the box. For remote registries, enable analysis where you store images and authenticate normally through Docker. If you are using a private registry such as ECR or ACR, link it through the vendor’s integration or your registry settings, then keep using the same CLI commands. The aim is to avoid side channels and keep your workflow boring on purpose.

A lightweight checklist you can adopt this week

Baseline today: run QuickView on your main images and keep the outputs as a reference.
Gate on direction: use compare in PRs with exit-on: vulnerability limited to High and Critical.
Refresh bases monthly: schedule a small chore day where you accept the recommended base image bumps and rebuild.
Keep an SBOM: publish cyclonedx-json or SPDX for every release so audits are not a scavenger hunt.
Write down exceptions: if you decide not to fix something, make the decision discoverable.

Frequently asked questions you will hear in standups

Can we silence CVEs that we do not ship to production
Yes. Focus on fixed Highs and Criticals, and gate only on regressions. Most other issues are housekeeping.

Will this slow our builds?
Not meaningfully when you keep output small and comparisons tight. It is cheaper than a hotfix sprint on Friday.

Do we need another dashboard?
You need visibility where developers live. Upload SARIF to Code Scanning, and you are done. The fewer tabs, the better.

Final nudge

Security that ships beats security that lectures. Start with a baseline, gate on direction, and keep a steady rhythm of base refreshes. In a couple of sprints, you will notice fewer alarms, fewer debates, and release notes that read like a grocery receipt instead of a hostage letter.

If your containers still show up with suspicious items in their pockets, at least now you can point to the pocket, the store it came from, and the cheaper replacement. That tiny bit of provenance is often the difference between a calm Tuesday and a war room with too much pizza.

If you remember nothing else, remember three habits. Run QuickView on your main images once a week. Let compare guard your pull requests. Accept the base refresh that Scout recommends each month. Everything else is seasoning.

Measure success by absence. Fewer “just-one-hotfix” pings at five on Friday. Fewer meetings where severity taxonomies are debated like baby names. More merges that feel like brushing your teeth, brief, boring, done.

Tools will not make you virtuous, but good routines will. Docker Scout shortens the routine and thins the excuses. Baseline today, set the gate, add a tiny chore to the calendar, and then go do something nicer with your afternoon.

October 2, 2025 by Fernando SRE DevOps stuff SRE stuff

Ingress and egress on EKS made understandable

Getting traffic in and out of a Kubernetes cluster isn’t a magic trick. It’s more like running the city’s most exclusive nightclub. It’s a world of logistics, velvet ropes, bouncers, and a few bureaucratic tollbooths on the way out. Once you figure out who’s working the front door and who’s stamping passports at the exit, the rest is just good manners.

Let’s take a quick tour of the establishment.

A ninety-second tour of the premises

There are really only two journeys you need to worry about in this club.

Getting In: A hopeful guest (the client) looks up the address (DNS), arrives at the front door, and is greeted by the head bouncer (Load Balancer). The bouncer checks the guest list and directs them to the right party room (Service), where they can finally meet up with their friend (the Pod).

Getting Out: One of our Pods needs to step out for some fresh air. It gets an escort from the building’s internal security (the Node’s ENI), follows the designated hallways (VPC routing), and is shown to the correct exit—be it the public taxi stand (NAT Gateway), a private car service (VPC Endpoint), or a connecting tunnel to another venue (Transit Gateway).

The secret sauce in EKS is that our Pods aren’t just faceless guests; the AWS VPC CNI gives them real VPC IP addresses. This means the building’s security rules, Security Groups, route tables, and NACLs aren’t just theoretical policies. They are the very real guards and locked doors that decide whether a packet’s journey ends in success or a silent, unceremonious death.

Getting past the velvet rope

In Kubernetes, Ingress is the set of rules that governs the front door. But rules on paper are useless without someone to enforce them. That someone is a controller, a piece of software that translates your guest list into actual, physical bouncers in AWS.

The head of security for EKS is the AWS Load Balancer Controller. You hand it an Ingress manifest, and it sets up the door staff.

For your standard HTTP web traffic, it deploys an Application Load Balancer (ALB). Think of the ALB as a meticulous, sharp-dressed bouncer who doesn’t just check your name. It inspects your entire invitation (the HTTP request), looks at the specific event you’re trying to attend (/login or /api/v1), and only then directs you to the right room.
For less chatty protocols like raw TCP, UDP, or when you need sheer, brute-force throughput, it calls in a Network Load Balancer (NLB). The NLB is the big, silent type. It checks that you have a ticket and shoves you toward the main hall. It’s incredibly fast but doesn’t get involved in the details.

This whole operation can be made public or private. For internal-only events, the controller sets up an internal ALB or NLB and uses a private Route 53 zone, hiding the party from the public internet entirely.

The modern VIP system

The classic Ingress system works, but it can feel a bit like managing your guest list with a stack of sticky notes. The rules for routing, TLS, and load balancer behavior are all crammed into a single resource, creating a glorious mess of annotations.

This is where the Gateway API comes in. It’s the successor to Ingress, designed by people who clearly got tired of deciphering annotation soup. Its genius lies in separating responsibilities.

The Platform team (the club owners) manages the Gateway. They decide where the entrances are, what protocols are allowed (HTTP, TCP), and handle the big-picture infrastructure like TLS certificates.
The Application teams (the party hosts) manage Routes (HTTPRoute, TCPRoute, etc.). They just point to an existing Gateway and define the rules for their specific application, like “send traffic for app.example.com/promo to my service.”

This creates a clean separation of duties, offers richer features for traffic management without resorting to custom annotations, and makes your setup far more portable across different environments.

The art of the graceful exit

So, your Pods are happily running inside the club. But what happens when they need to call an external API, pull an image, or talk to a database? They need to get out. This is egress, and it’s mostly about navigating the building’s corridors and exits.

The public taxi stand: For general internet access from private subnets, Pods are sent to a NAT Gateway. It works, but it’s like a single, expensive taxi stand for the whole neighborhood. Every trip costs money, and if it gets too busy, you’ll see it on your bill. Pro tip: Put one NAT in each Availability Zone to avoid paying extra for your Pods to take a cross-town cab just to get to the taxi stand.
The private car service: When your Pods need to talk to other AWS services (like S3, ECR, or Secrets Manager), sending them through the public internet is a waste of time and money. Use
VPC endpoints instead. Think of this as a pre-booked black car service. It creates a private, secure tunnel directly from your VPC to the AWS service. It’s faster, cheaper, and the traffic never has to brave the public internet.
The diplomatic passport: The worst way to let Pods talk to AWS APIs is by attaching credentials to the node itself. That’s like giving every guest in the club a master key. Instead, we use
IRSA (IAM Roles for Service Accounts). This elegantly binds an IAM role directly to a Pod’s service account. It’s the equivalent of issuing your Pod a diplomatic passport. It can present its credentials to AWS services with full authority, no shared keys required.

Setting the house rules

By default, Kubernetes networking operates with the cheerful, chaotic optimism of a free-for-all music festival. Every Pod can talk to every other Pod. In production, this is not a feature; it’s a liability. You need to establish some house rules.

Your two main tools for this are Security Groups and NetworkPolicy.

Security Groups are your Pod’s personal bodyguards. They are stateful and wrap around the Pod’s network interface, meticulously checking every incoming and outgoing connection against a list you define. They are an AWS-native tool and very precise.

NetworkPolicy, on the other hand, is the club’s internal security team. You need to hire a third-party firm like Calico or Cilium to enforce these rules in EKS, but once you do, you can create powerful rules like “Pods in the ‘database’ room can only accept connections from Pods in the ‘backend’ room on port 5432.”

The most sane approach is to start with a default deny policy. This is the bouncer’s universal motto: “If your name’s not on the list, you’re not getting in.” Block all egress by default, then explicitly allow only the connections your application truly needs.

A few recipes from the bartender

Full configurations are best kept in a Git repository, but here are a few cocktail recipes to show the key ingredients.

Recipe 1: Public HTTPS with a custom domain. This Ingress manifest tells the AWS Load Balancer Controller to set up a public-facing ALB, listen on port 443, use a specific TLS certificate from ACM, and route traffic for app.yourdomain.com to the webapp service.

# A modern Ingress for your web application
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: webapp-ingress
  annotations:
    # Set the bouncer to be public
    alb.ingress.kubernetes.io/scheme: internet-facing
    # Talk to Pods directly for better performance
    alb.ingress.kubernetes.io/target-type: ip
    # Listen for secure traffic
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    # Here's the TLS certificate to wear
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:123456789012:certificate/your-cert-id
spec:
  ingressClassName: alb
  rules:
    - host: app.yourdomain.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: webapp-service
                port:
                  number: 8080

Recipe 2: A diplomatic passport for S3 access. This gives our Pod a ServiceAccount annotated with an IAM role ARN. Any Pod that uses this service account can now talk to AWS APIs (like S3) with the permissions granted by that role, thanks to IRSA.

# The ServiceAccount with its IAM credentials
apiVersion: v1
kind: ServiceAccount
metadata:
  name: s3-reader-sa
  annotations:
    # This is the diplomatic passport: the ARN of the IAM role
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/EKS-S3-Reader-Role
---
# The Deployment that uses the passport
apiVersion: apps/v1
kind: Deployment
metadata:
  name: report-generator
spec:
  replicas: 1
  selector:
    matchLabels: { app: reporter }
  template:
    metadata:
      labels: { app: reporter }
    spec:
      # Use the service account we defined above
      serviceAccountName: s3-reader-sa
      containers:
        - name: processor
          image: your-repo/report-generator:v1.5.2
          ports:
            - containerPort: 8080

A short closing worth remembering

When you boil it all down, Ingress is just the etiquette you enforce at the front door. Egress is the paperwork required for a clean exit. In EKS, the etiquette is defined by Kubernetes resources, while the paperwork is pure AWS networking. Neither one cares about your intentions unless you write them down clearly.

So, draw the path for traffic both ways, pick the right doors for the job, give your Pods a proper identity, and set the tolls where they make sense. If you do, the cluster will behave, the bill will behave, and your on-call shifts might just start tasting a lot more like sleep.

September 23, 2025 by Fernando SRE Cloud stuff DevOps stuff Kubernetes SRE stuff

Your metrics are lying

It’s 3 AM. The pager screams, a digital banshee heralding doom. You stumble to your desk, eyes blurry, to find a Slack channel ablaze with panicked messages. The checkout service is broken. Customers are furious. Revenue is dropping.

You pull up the dashboards, your sacred scrolls of system health. Everything is… fine. P95 latency is a flat line of angelic calm. CPU usage is so low it might as well be on a tropical vacation. The error count is zero. According to your telemetry, the system is a picture of perfect health.

And yet, the world is on fire.

Welcome to the great lie of modern observability. We’ve become masters at measuring signals while remaining utterly clueless about the story they’re supposed to tell. This isn’t a guide about adding more charts to your dashboard collection. It’s about teaching your system to stop mumbling in arcane metrics and start speaking human. It’s about making it tell you the truth.

The seductive lie of the green dashboard

We were told to worship the “golden signals”: latency, traffic, errors, and saturation. They’re like a hospital patient’s vital signs. They can tell you if the patient is alive, but they can’t tell you why they’re miserable, what they argued about at dinner, or if they’re having an existential crisis.

Our systems are having existential crises all the time.

Latency lies when the real work is secretly handed off to a background queue. The user gets a quick “OK!” while their request languishes in a forgotten digital purgatory.
Traffic lies when a buggy client gets stuck in a retry loop, making it look like you’re suddenly the most popular app on the internet.
Errors lie when you only count the exceptions you had the foresight to catch, ignoring the vast, silent sea of things failing in ways you never imagined.

Golden signals are fine for checking if a server has a pulse. But they are completely useless for answering the questions that actually keep you up at night, like, “Why did the CEO’s demo fail five minutes before the big meeting?”

The truth serum: Semantic Observability

The antidote to this mess is what we’ll call semantic observability. It’s a fancy term for a simple idea: instrumenting the meaning of what your system is doing. It’s about capturing the plot, not just the setting.

Instead of just logging Request received, we record the business-meaningful story:

Domain events: The big plot points. UserSignedUp, CartAbandoned, InvoiceSettled, FeatureFlagEvaluated. These are the chapters of your user’s journey.
Intent assertions: What the system swore it would do. “I will try this payment gateway up to 3 times,” or “I promise to send this notification to the user’s phone.”
Outcome checks: The dramatic conclusion. Did the money actually move? Was the email really delivered? This is the difference between “I tried” and “I did.”

Let’s revisit our broken checkout service. Imagine a user is buying a book right after you’ve flipped on a new feature flag for a “revolutionary” payment path.

With classic observability, you see nothing. With semantic observability, you can ask your system questions like a detective interrogating a witness:

“Show me all the customers who tried to check out in the last 30 minutes but didn’t end up with a successful order.”
“Of those failures, how many had the new shiny-payment-path feature flag enabled?”
“Follow the trail for one of those failed orders. What was the last thing they intended to do, and what was the actual, tragic outcome?”

Notice we haven’t mentioned CPU once. We’re asking about plot, motive, and consequence.

Your detective’s toolkit (Minimal OTel patterns)

You don’t need a fancy new vendor to do this. You just need to use your existing OpenTelemetry tools with a bit more narrative flair.

Teach your spans to gossip: Don’t just create a span; stuff its pockets with juicy details. Use span attributes to carry the context. Instead of just a request_id, add feature.flag.variant, customer.tier, and order.value. Make it tell you if this is a VIP customer buying a thousand-dollar item or a tire-kicker with a free-tier coupon.
Mark the scene of the crime: Use events on spans to log key transitions. FraudCheckPassed, PaymentAuthorized, EnteringRetryLoop. These are the chalk outlines of your system’s behavior.
Connect the dots: For asynchronous workflows (like that queue we mentioned), use span links to connect the cause to the effect. This builds a causal chain so you can see how a decision made seconds ago in one service led to a dumpster fire in another.

Rule of thumb: If a human is asking the question during an incident, a machine should be able to answer it with a single query.

The case of intent vs. outcome

This is the most powerful trick in the book. Separate what your system meant to do from what actually happened.

The intent: At the start of a process, emit an event: NotificationIntent with details like target: email and deadline: t+5s.
The outcome: When (or if) it finishes, emit another: NotificationDelivered with latency: 2.5s and channel: email.

Now, your master query isn’t about averages. It’s about broken promises: “Show me all intents that don’t have a matching successful outcome within their SLA.”

Suddenly, your SLOs aren’t some abstract percentage. They are a direct measure of your system’s integrity: its intent satisfied rate.

Your first 30 days as a telemetry detective

Week 1: Pick a single case. Don’t boil the ocean. Focus on one critical user journey, like “User adds to cart -> Pays -> Order created.” List the 5-10 key “plot points” (domain events) and 3 “promises” (intent assertions) in that story.

Week 2: Plant the evidence. Go into your code and start enriching your existing traces. Add those gossipy attributes about feature flags and customer tiers. Add events. Link your queues.

Week 3: Build your “Why” query. Create the one query that would have saved you during the last outage. Something like, “Show me degraded checkouts, grouped by feature flag and customer cohort.” Put a link to it at the top of your on-call runbook.

Week 4: Close the loop. Define an SLO on your new “intent satisfied rate.” Watch it like a hawk. Review your storage costs and turn on tail-based sampling to keep the interesting stories (the errors, the weird edge cases) without paying to record every boring success story.

Anti-Patterns to gently escort out the door

Dashboard worship: If your incident update includes a screenshot of a CPU graph, you owe everyone an apology. Show them the business impact, the cohort of affected users, the broken promise.
Logorrhea: The art of producing millions of lines of logs that say absolutely nothing. One good semantic event is worth a thousand INFO: process running logs.
Tag confetti: Using unbounded tags like user_id for everything, turning your observability bill into a piece of abstract art that costs more than a car.
Schrödinger’s feature flag: Shipping a new feature behind a flag but forgetting to record the flag’s decision in your telemetry. The flag is simultaneously the cause of and solution to all your problems, and you have no way of knowing which.

The moral of the story

Observability isn’t about flying blind without metrics. It’s about refusing to outsource your understanding of the system to a pile of meaningless averages.

Instrument intent. Record outcomes. Connect causes.

When your system can clearly explain what it tried to do and what actually happened, on-call stops feeling like hunting for ghosts in a haunted house and starts feeling like science. And you might even get a full night’s sleep.

September 21, 2025 by Fernando SRE DevOps stuff SRE stuff