July 15, 2026

Blog NivelEpsilon

AWS Proton or how to stop developers from burning down the infrastructure

Writing code is a clean, almost intellectual pursuit. You sit in a quiet room, sip your beverage of choice, and arrange logic into a beautiful digital tapestry. If you do your job well, the application works perfectly on your laptop. But then comes the moment when you must share your creation with the rest of the world. This is where the poetry ends, and the manual labor begins.

Suddenly, you are no longer a software creator. You are an amateur construction worker trying to pave a highway while driving on it. You find yourself wrestling with security configurations, arguing with network routing protocols, and praying that your cloud deployment pipelines do not collapse under the weight of a single misplaced space in a configuration file. For many developers, managing the cloud feels like buying a brand-new television, only to discover that you have to personally run copper wire to the local power plant just to turn it on.

While looking into how modern cloud infrastructure operates, I spent some time investigating AWS Proton. The philosophy behind this service is fascinating because it tackles one of the oldest, most polite cold wars in the history of office environments, the constant struggle between the people who build software and the people who keep the servers from catching fire.

The high cost of giving matches to creative people

When a software company is small, cloud infrastructure is a domestic affair. You might have two developers and a single cloud account. If someone needs to deploy a new feature, they simply log in, click a few buttons in a console, and hope for the best. It is chaotic, but it is a cozy kind of chaos.

Once a company grows, however, teams multiply. If you leave developers to their own devices without any central coordination, they will inevitably invent their own highly creative, deeply eccentric ways to deploy their code.

One team might rely on custom scripts that only run on a specific laptop currently sitting under a coffee-stained desk. Another team might build a labyrinth of configuration files that are so complex they resemble ancient runic spells. A third team might simply copy and paste outdated templates they found on an internet forum, hoping that nobody notices the glaring security vulnerabilities hidden inside.

Before you know it, your corporate cloud architecture looks less like a modern facility and more like a crowded public pool where nobody is paying attention to the lifeguards. The challenge is no longer about writing good code. It is about preventing the sheer variety of deployment methods from driving your operations team to physical and emotional exhaustion.

Dividing the kitchen between the chefs and the safety inspectors

This is where AWS Proton enters the room, holding a clipboard and looking very serious. The easiest way to understand the service is to look at how a professional restaurant kitchen operates.

If you let every line cook design their own stove, choose their own gas pressures, and source their own fire extinguishers, the restaurant will burn down before the first appetizer is served. Instead, a master chef designs the kitchen layout once, sets up the safety parameters, and ensures the prep stations are stocked. The line cooks can then focus entirely on cooking the food without having to worry about plumbing or municipal gas lines.

AWS Proton does exactly this for cloud deployments by separating your engineering department into two distinct, cooperative camps.

The platform engineers act as the safety inspectors. They define the corporate standards, write the reusable infrastructure templates, and establish secure delivery pipelines. They build a safe sandbox with very tall, very soft walls.

The developers act as the creative chefs. Instead of writing custom deployment configurations from scratch, they simply log into a self-service portal, select an approved template, and deploy their applications. They do not have to know how the network routing works under the hood. They just need to know that their code has a safe place to run.

The magic of filling out a form without crying

To see how this works in practice, we can look at what this separation actually looks like on a file level.

First, the platform team defines what a standard, secure service should look like. They write a schema file using YAML, which is the industry-standard language for telling computers how to build virtual networks.

Here is a simplified example of what a platform engineer might write to define an environment template:

schema:
  format:
    version: "1"
proton: EnvironmentTemplate
index:
  name: "secure-ecs-fargate-environment"
  version: "1.0"
  description: "A standard environment with sensible defaults so nobody accidentally exposes our database to the open internet"

This template is stored centrally in AWS Proton. It acts as an official blueprint.

When a developer wants to deploy a new microservice, they do not need to read through hundreds of lines of infrastructure code. They do not need to learn how to configure an AWS load balancer. Instead, they write a very simple specification file that only asks them for the details that actually matter to their application.

Here is what the developer’s configuration file looks like:

proton: ServiceSpec
  spec:
    inputs:
      image_tag: "v2.1.0"
      container_port: 8080
      cpu: "512"
      memory: "1024"
      billing_tag: "marketing-campaign"

The developer only has to specify the basic dimensions of their application, such as how much memory it needs and which port it uses. AWS Proton takes this small file, combines it with the platform team’s secure blueprint, and builds the entire system automatically. The developer gets their application deployed in minutes, and the platform team can sleep at night knowing that nobody used insecure settings.

The industrialization of the digital assembly line

If you are just learning the basics of cloud computing or building a personal website to display pictures of your cat, AWS Proton is almost certainly more tool than you need. It is the industrial equivalent of buying a commercial cement mixer to repair a crack in your driveway.

But if you look at where the wider technology industry is heading, services like AWS Proton represent a massive cultural shift. For the past decade, the industry told developers that they needed to know everything. They were told to write the code, configure the networks, manage the databases, and monitor the security alerts. We called this DevOps, and while the intentions were noble, it often resulted in highly skilled programmers spending half their week acting as frustrated system administrators.

Companies are starting to realize that cognitive overload is real. If you force a developer to become an expert in cloud networking, they will have less energy to spend on making your product actually work.

The rise of platform engineering is a quiet admission that we need specialists. We need people who are incredibly good at building secure, stable platforms, and we need to let everyone else use those platforms without having to understand the underlying physics of the cloud.

Some parting thoughts on staying warm without catching fire

The more you look at modern cloud architectures, the more you realize that the hardest problems are rarely technical. Computers will almost always do exactly what we tell them to do, provided we format our instructions correctly. The real friction exists in the human systems we build around those computers.

AWS Proton is an attempt to reduce that human friction. By turning infrastructure into a collaborative, template-driven system, it allows different teams to work together without constantly stepping on each other’s toes.

If you are currently studying cloud technologies or preparing for a career in platform engineering, understanding these patterns is incredibly valuable. The future of software development is not about making systems more complex. It is about building elegant interfaces that keep us from burning down the very things we are trying to build.

The sandbox finally gets its own apartment

There is a particular kind of optimism that appears in architecture diagrams. It is usually represented by a neat rectangle labelled “sandbox”, placed between “user code” and “production systems”, as if malicious input could be discouraged by tidy geometry.

For years, we have treated this rectangle with a level of trust usually reserved for elevators, bridges, and the one person in the meeting who says, “This should be simple.” Inside the box went user scripts, notebook cells, plugin code, CI jobs, vulnerability scanners, browser IDE sessions and, more recently, AI agents with the confidence of a junior developer and the impulse control of a caffeinated raccoon.

The problem was never running code. Computers are embarrassingly eager to do that. The problem was trusting the code not to bite the furniture.

AWS Lambda MicroVMs are interesting because they arrive at the exact moment when everyone is trying to run more untrusted code, more often, with fewer humans staring nervously at terminal output. They are not just another Lambda feature wearing a party hat. They are AWS trying to give each user, job, or agentic session its own small, isolated room, complete with VM-level walls, preserved state, and a serverless landlord who handles the plumbing.

That is worth paying attention to.

The old sandbox had suspicious drywall

Containers made modern platform engineering possible. They package applications beautifully, start quickly and behave well enough that we have built entire civilizations on top of them, some of which even survive Monday mornings.

But containers share the host kernel. That is not a moral failure. It is simply the bargain. You get speed and density, but the isolation boundary is not the same as a virtual machine boundary. For normal workloads, this is often a perfectly reasonable trade. For hostile or unknown code, it becomes a slightly more awkward dinner party.

Running untrusted code in containers usually requires extra hardening. Seccomp profiles, AppArmor or SELinux policies, rootless runtimes, careful filesystem restrictions, capability dropping, network controls, resource limits, image hygiene, and a collection of internal documents with titles like “please do not run customer code as root again”.

This does not mean containers are bad. It means they have spent years being asked to babysit workloads they would probably rather not meet in a dark data center.

Virtual machines have always had a stronger answer. Each workload gets a separate kernel and a more substantial isolation boundary. Security teams like that sort of thing. They sleep better when the boundary is not made entirely of namespace optimism.

The problem is that traditional VMs are heavier. They take longer to boot, require more lifecycle management and make interactive platforms feel like they are starting a small moon every time someone clicks “run”. If you are building a browser IDE, a notebook platform, or an AI coding assistant, waiting minutes for an execution environment is not a product experience. It is a hostage situation with a spinner.

So teams built their own sandboxes. Some used hardened containers. Some orchestrated pools of VMs. Some built elaborate custom schedulers. Many discovered that safely running other people’s code is less a feature and more a lifestyle choice involving pager alerts.

Firecracker was already doing the quiet work

The interesting part is that AWS already had a lightweight virtualization story hiding in plain sight. Firecracker has powered Lambda and Fargate-style isolation patterns for years. It was designed to provide lightweight MicroVMs with fast startup and strong boundaries, without dragging the full emotional luggage of traditional VM management into every request.

Lambda MicroVMs take that idea and expose it as a managed primitive for a very specific class of applications. These are not ordinary Lambda functions that got promoted. They are separate resources with a distinct lifecycle and a different purpose.

Regular Lambda functions are still excellent for event-driven work. A request arrives, a function runs, a result returns, and the function disappears back into the mist like a responsible woodland creature. That model works beautifully for stateless tasks.

Lambda MicroVMs are for the messier cases. A user opens an interactive coding environment. An AI agent needs to inspect files, run code, call tools, and maintain context. A data analytics session installs packages, loads working files, and waits while the human operator goes to lunch, a meeting, or an unexpected argument with Kubernetes.

These sessions need state. They need isolation. They need to start quickly. They need to pause without losing their belongings.

That combination used to be annoying. Now it has a first-class shape.

Lambda functions answer the doorbell, MicroVMs let the guest stay

The lifecycle is the real story.

You create a MicroVM image from a Dockerfile and a code artifact. AWS initializes the environment and captures a Firecracker snapshot of disk and memory. Later, when you launch a MicroVM from that image, it starts from the pre-initialized snapshot instead of beginning from scratch.

This is the difference between cooking dinner from raw ingredients and reheating yesterday’s lasagna. One is noble. The other is faster and, under the correct emotional conditions, superior.

Snapshots matter because many interactive environments spend a surprising amount of time getting ready to do the thing the user actually asked for. Installing packages, loading runtimes, preparing tools, arranging directories, initializing services, warming models, importing dependencies, and opening files. All of this can become part of the prepared environment.

For AI agents, this is especially useful. The environment can already contain the tools, SDKs, interpreters, and supporting processes the agent needs. When a session starts or resumes, it does not have to assemble itself from a cold and resentful universe.

There is a catch, because distributed systems are legally required to have at least one. Snapshotting state means engineers must think carefully about what is safe to preserve and clone. Unique identifiers, network connections, ephemeral credentials, time-sensitive tokens, and initialization-time randomness may all need special handling.

A snapshot is not just a performance trick. It is a memory of a machine. And memories, as anyone who has debugged a stale cache knows, can lie with great confidence.

Eight hours change the serverless conversation

The 8-hour runtime limit is one of the details that make this more than a faster function wrapper.

Traditional serverless thinking has usually pushed us toward short, stateless execution. Do a job, return a result, vanish. This is clean, scalable, and architecturally pleasing, which means real users immediately invent workflows that do not fit it.

Interactive sessions are not short. Vulnerability scans can run for a while. Data analysis can wander. AI agents can start with “inspect this repository” and somehow end up creating four temporary files, two package installs, and a philosophical dispute with a linter.

Lambda MicroVMs support longer-running sessions while preserving memory, disk, and process state. During idle periods, the environment can be suspended and later resumed. That matters because many workloads are not constantly busy. Humans pause. Agents wait. Browsers sleep. Developers get pulled into meetings where someone says “alignment” until the will to compute leaves the room.

Idle suspension changes the economics and the experience. The user can return to a session that still has its working state, but the platform does not have to pay as if the machine were actively sweating the entire time.

This is still serverless, but not the old cartoon version where everything is tiny, stateless, and over before the coffee cools. It is serverless with a memory.

That sentence would have terrified architects ten years ago. Now it may be exactly what agent platforms need.

AI agents need walls, not vibes

Agentic AI has made sandboxing fashionable in the same way kitchen fires make extinguishers fashionable.

An AI agent can generate code, run commands, inspect files, call APIs, and modify state. Sometimes it does this brilliantly. Sometimes it does this like a raccoon found a shell prompt and decided to modernize procurement.

The security model for agents must assume mistakes. It must assume prompt injection. It must assume unexpected tool use. It must assume that the agent is not evil, just helpful in the way a toddler with admin privileges might be helpful.

This is where per-user or per-session MicroVM isolation becomes important. If each agent run gets its own dedicated execution environment, the blast radius becomes easier to reason about. One confused agent should not be able to interfere with another user’s session. One malicious script should not get a friendly wave from a shared kernel and an invitation to explore the neighborhood.

But this is not magic safety dust. VM-level isolation is a stronger primitive, not a complete architecture.

You still need tight IAM roles. You still need network egress controls. You still need to decide whether the MicroVM can reach the internet, internal APIs, metadata services, package repositories or that one ancient endpoint nobody admits owns the business process. You still need secrets management, audit trails, logging, quotas, timeouts and cleanup.

The agent may now have its own padded serverless studio. You still need to check what tools you left in the drawers.

The security story is better, but not finished

The most dangerous sentence in cloud security is “the service handles that”.

Sometimes the service does handle that. Sometimes it handles one very important layer, and everything around it remains your problem, wearing a fake moustache.

Lambda MicroVMs improve the isolation layer. A dedicated MicroVM with no shared kernel between users is a meaningful architectural improvement for untrusted execution. It gives platform teams a stronger foundation than a basic container sandbox. It also reduces the amount of custom virtualization machinery teams need to build themselves.

But the surrounding design still decides whether the result is secure or just expensive theater.

Before production, architects should ask some unfashionable questions.

What can the MicroVM reach on the network?

What credentials exist inside the environment?

Can the workload call internal APIs?

Can it download arbitrary packages?

How is user code authenticated before execution?

How are sessions terminated?

What happens to files after suspension?

What gets written to logs?

How are snapshots built, versioned, and retired?

How are malicious or runaway workloads detected?

What prevents one customer from turning your lovely sandbox platform into a cryptocurrency-themed space heater?

These questions are not reasons to avoid the service. These are the reasons to use it properly. The stronger the primitive, the more tempting it is to stop thinking. Resist this. Architecture diagrams are very good at looking innocent.

The bill is also part of the architecture

Serverless has trained us to associate managed services with financial elegance. Sometimes this is true. Sometimes the invoice arrives carrying a clipboard and asking to speak with the adult who designed the retry policy.

Lambda MicroVMs have a cost model that deserves actual modeling. You pay for baseline resources while a MicroVM is running, and for additional active usage when the workload exceeds that baseline. That is reasonable, but it means usage patterns matter.

A short-lived scan that starts, does work, and terminates has one shape. An interactive coding session that runs, idles, suspends, resumes, and idles again has another. A swarm of AI agents that each receive their own environment has a third shape, one that may require finance to breathe into a paper bag if nobody adds quotas.

The point is not that this will be expensive. The point is that “serverless” does not automatically mean “cheap”. It means the operational model changes. Billing still obeys physics, usage, and the quiet vengeance of concurrency.

If you are building a platform on top of Lambda MicroVMs, model the boring details. Average session duration. Idle time. Resume frequency. Memory size. CPU requirements. Disk usage. Regional availability. Expected concurrency. Abuse scenarios. Free-tier fantasies, if your product team has recently been exposed to those.

Cloud bills are where optimistic architecture diagrams go to receive adult supervision.

Platform engineering gets a new building block

For platform teams, the most promising part of Lambda MicroVMs is not that they make one workload easier. It is that they could simplify a recurring pattern.

Many organizations need a safe way to run untrusted or semi-trusted code. Internal developer platforms need execution sandboxes. AI products need agent workspaces. Security platforms need isolated scanning environments. Data platforms need notebook-like sessions. Education platforms need per-student runtimes. CI systems need disposable workers.

Historically, building this well meant assembling a small private cloud inside your cloud. VM pools, schedulers, cleanup jobs, image builders, network isolation, capacity management, logging pipelines, and emergency scripts named things like final_cleanup_v7_really_final.sh.

Lambda MicroVMs do not remove all platform work. They move the boundary.

You still need golden images. You still need lifecycle policies. You still need guardrails. You still need quotas and tenant mapping. You still need observability. You still need deployment pipelines. You still need to decide how MicroVMs integrate with Lambda functions, S3, CloudWatch, IAM, and whatever control plane runs your product.

But you may not need to own as much of the virtualization substrate yourself. That is not glamorous. It is better. Glamour is what vendors put in keynote slides before someone asks about day two operations.

A sensible pattern might use regular Lambda functions for the event-driven control plane, APIs, and orchestration logic. Lambda MicroVMs handle the isolated execution path, where user code or AI-generated code actually runs. S3 stores build artifacts. CloudWatch captures logs. IAM defines what each environment can do. Network controls define what each environment can reach.

That division feels clean. Not simple, because nothing involving untrusted code is simple. But clean enough to be useful.

The uncomfortable production checklist

Before adopting Lambda MicroVMs, it is worth being deliberately boring. Boring is underrated in architecture. Boring systems sleep through the night.

Ask which workloads actually need VM-level isolation. Some do. Some are perfectly fine as regular Lambda functions, ECS tasks, or containers in Kubernetes. Using a stronger primitive everywhere can become expensive and operationally awkward.

Ask how long sessions really run. Not how long the demo runs. Not how long the product manager hopes they run. How long do users actually keep them alive while opening seven browser tabs and going to lunch?

Ask what state must survive suspension. Then ask what state must never survive suspension. Secrets, temporary credentials, one-time tokens, and sensitive intermediate files need careful treatment.

Ask how images are built and patched. A MicroVM image is still software. It needs supply chain controls, vulnerability scanning, versioning, and retirement. Yesterday’s golden image can become tomorrow’s antique security incident.

Ask what happens when a workload wants more. More CPU. More memory. More disk. GPU acceleration. x86 compatibility. A different region. A longer session. Some requests will fit the service. Some will not. That is not a failure. That is architecture doing its job before production does it for you with alarms.

Ask how abuse is stopped. Per-user quotas, concurrency limits, session expiration, network restrictions, and cost controls should be designed early. Waiting until after the first runaway agent is a traditional learning method, but not a relaxing one.

A better sandbox is not a finished safety plan

Lambda MicroVMs arrive at a useful moment. The industry is asking applications to execute more code that the application developer did not write. Some of that code comes from users. Some comes from AI systems. Some comes from a collaboration between both, which is how you get an incident report with a surprisingly emotional timeline.

The old choices were uncomfortable. Containers were fast but needed serious hardening for hostile workloads. Traditional VMs were safer but heavy. Regular serverless functions were elegant but not designed for long, stateful, interactive sessions.

Lambda MicroVMs offer a new middle ground. VM-level isolation. Fast launch and resume from snapshots. Stateful sessions. Idle suspension. Serverless lifecycle management. A dedicated environment per user, job, or agentic session.

That combination will not remove the need for security architecture. It will not absolve anyone from IAM, networking, secrets, observability, patching, or cost controls. It will not make AI agents safe by virtue of being placed inside a branded rectangle.

But it gives the sandbox real walls.

And that matters. Because the sandbox did not become safer because someone wrote “AI” on the diagram. It became safer because the walls got real.

June 28, 2026 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

Why multi cloud architecture is overrated

Somewhere along the way, multi cloud stopped being an architecture decision and became a personality trait. It shows up on slide decks the way kale shows up at weddings, nobody is quite sure who asked for it, but removing it now would feel like a confession. Say the words out loud in a meeting and watch the room nod with the solemn confidence of people who have read the headline but not the footnotes.

The pitch is seductive. Spread your workloads across two or three providers and you get freedom, resilience, and a vague aura of technical adulthood. What you actually get, in a surprising number of cases, is a second pager rotation and a networking bill that reads like a ransom note.

This is not an argument that multi cloud is always wrong. It is an argument that it is rarely the default answer to a question nobody bothered to ask. A lot of cloud decisions get made the way people buy treadmills, for the version of themselves they intend to become, not the one currently standing in the room.

What the multi cloud pitch is really selling

Strip away the architecture diagrams, and the multi cloud sales pitch is mostly emotional. It promises escape from vendor lock in, that primal fear of being trapped by a provider who knows it. It offers the warm feeling of resilience, the sense that if one cloud falls over, your business will calmly stroll to the next one. And it projects maturity, the impression that your team has graduated from amateur hour to serious infrastructure people.

These are real desires. The problem is that multi cloud answers them roughly the way buying a second house answers your fear of a leaky roof. Technically, you now have options. You also have two roofs.

Where the trouble starts

The fantasy version of multi cloud assumes a clean layer of abstraction sitting neatly on top of every provider, so your team writes once and runs anywhere. The practical version is that each cloud is its own country, with its own language, customs, and bewildering opinions about what a load balancer should be called.

Total abstraction almost never survives contact with production. Your engineers do not get to learn one platform deeply. They get to learn two platforms simultaneously, plus the seam between them, which is where the genuinely interesting bugs live. The architecture becomes harder to document, harder to operate, and harder to evolve, which is a polite way of saying nobody on the team fully understands all of it anymore, including the person who built it.

The costs nobody put on the slide

Here is where the war story usually turns grim. The expensive part of multi cloud is rarely the compute. It is everything wrapped around it.

You duplicate your observability stack, because the dashboards that work beautifully in one cloud are politely useless in the other. Your IAM and governance model doubles in surface area, since every permission, role, and policy now needs a twin that behaves identically and never quite does. Network egress, cross cloud replication, and data transfer charges accumulate quietly in the background like a subscription you forgot to cancel.

And then there is the human cost, which never appears in the cost calculator. Multi cloud raises the seniority floor of your entire team. The junior engineer who could safely ship in a single, well governed environment now needs to understand two of everything before they can be trusted near production. You do not just pay for more infrastructure. You pay for more expertise to keep that infrastructure from quietly drifting apart.

Security and compliance, now in stereo

Security people have a particular look they get when you mention multi cloud, the expression of someone who has been asked to childproof two houses for the price of one.

Every additional cloud is another attack surface, another set of policies to keep synchronized, and another generous opportunity for a misconfiguration to go unnoticed. Auditing becomes an exercise in translation, because a control that means one thing in the first provider means something subtly different in the second. The realistic end state is not airtight redundancy. It is an uneven security posture, strong where your team has invested attention and quietly soft where they ran out of hours.

Operations and the art of the unsolvable incident

Incidents are where multi cloud presents its invoice. At three in the morning, when something is on fire, you do not want your logs, metrics, and traces scattered across two providers like a crime scene split between jurisdictions.

Correlating an incident across clouds is slow, and slow is expensive when customers are watching. Mean time to detection creeps up. Mean time to resolution creeps up with it. Postmortems acquire an extra paragraph that always begins with some variation of “we lost time understanding which cloud was actually responsible.” To stitch the picture back together, you end up leaning on yet another external platform whose only job is to make your two clouds look like one, which, if you squint, is a strange amount of effort to undo a decision you made on purpose.

Portability you have versus portability you imagine

The intellectual cornerstone of multi cloud is portability, the comforting belief that you could pack up and move providers whenever you liked. In practice, portability tends to be the gym membership of architecture, fully paid for, rarely used, and quietly aspirational.

Most systems end up leaning on provider specific services anyway, because those services are good and saying no to them on principle is a luxury few teams can afford. Kubernetes and containers genuinely help, they smooth the edges, but they do not magically erase the dependency underneath. Real migrations are still slow, still costly, and still the kind of project that gets proposed with enthusiasm and abandoned with relief.

The honest goal is not total independence. It is a reasonable exit, an architecture where leaving would be painful but possible, rather than one where leaving is theoretically free and practically unthinkable. Designing for a sensible escape route beats promising a freedom you will never exercise.

When multi cloud actually earns its keep

None of this means multi cloud is a mistake. It means it is a tool with a narrow, legitimate set of jobs, and it deserves to be used for those rather than worn as a badge.

It earns its place when regulation or data sovereignty leaves you no choice, when the data legally must live in particular places. It makes sense when your organization is large enough that real negotiating leverage over providers translates into serious money. It is justified when the business risk is so high that genuine cross cloud redundancy is worth its considerable price. It is reasonable when a specific service on another cloud offers an advantage you cannot replicate elsewhere. And it works, crucially, only when the organization is mature enough to operate it, with the teams, processes, and discipline already in place rather than hopefully on order.

Notice that every one of these starts with a concrete business reason, not a vibe.

What I would do instead

If you handed me a blank slate and a reasonable budget, I would pick one primary cloud and govern it properly, because a single environment run well beats two environments run anxiously almost every time.

I would design for enough portability to sleep at night, not for the fantasy of frictionless migration. I would automate deployments and infrastructure from day one, before the shortcuts calcify into tradition. I would standardize observability, security, and cost management so the whole thing stays legible to the people who did not build it. And I would revisit the single cloud decision on a schedule, honestly, looking for a real reason to expand rather than a fashionable one.

That last part matters. The goal is not loyalty to one provider. It is refusing to add a second one until something other than anxiety is asking for it.

The question worth asking

Multi cloud can be genuinely useful. It just should not be the reflex, the thing you reach for because the alternative feels insufficiently ambitious. Far too often, what gets sold as strategy is simply complexity wearing a nicer outfit.

Cloud maturity is not measured in the number of providers on your invoice. It is measured by how well you use the one you actually need. The most sophisticated architecture in the room is frequently the one that resisted the urge to be impressive.

So the question is not how many clouds you run. It is whether each one earns its place.

June 16, 2026 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

Your Ingress resource is living on borrowed time

There is a special kind of grief reserved for infrastructure that works fine. Nobody writes eulogies for the broken stuff; that gets deleted with enthusiasm. The painful goodbyes are for the things that still do their job every day, quietly, while the rest of the industry has already decided they belong in a museum. Your Ingress resources are in that category now. They route traffic, they terminate TLS, and they have not paged you in months. And they are, officially and by design, a dead end.

The Kubernetes project has been remarkably polite about this. Ingress is “frozen”, which is the standards body equivalent of moving someone to a nice farm upstate. No new features, no spec evolution, no fixes for the design decisions everyone now regrets. The replacement is called Gateway API, it reached general availability back in 2023, and it is one of those rare cases where the new thing is not just the old thing with more YAML. It actually fixes the organizational problem that made Ingress miserable, which, as we will see, was never really a technical problem at all.

The Ingress spec was always a rough draft

Here is the part of the story that usually gets left out. When Ingress shipped in 2015, the Kubernetes maintainers did not believe they had solved HTTP routing. They believed, correctly, that they had no idea what HTTP routing should look like, and they shipped a minimal spec on purpose. Host, path, backend service. That was essentially it. Everything else, the maintainers figured, could be handled by annotations until the community figured out what it actually wanted.

The community figured out what it wanted, all right. It wanted everything, and it wanted it via annotations.

If you have ever operated an nginx ingress controller in production, you know the genre. nginx.ingress.kubernetes.io/rewrite-target. nginx.ingress.kubernetes.io/canary-weight. nginx.ingress.kubernetes.io/configuration-snippet, which is the annotation equivalent of a hole in the wall that you push raw nginx config through and hope for the best. Traefik grew its own dialect. HAProxy grew another. At some point, the nginx controller alone supported well over a hundred proprietary annotations, each one a small confession that the spec underneath could not do the job.

The practical consequence is one that every platform engineer has lived. Your routing configuration is portable in theory and welded to your controller in practice. Migrating from nginx to anything else means translating a folklore of annotations by hand, and some of them have no translation, because they were never features of Kubernetes. They were features of one specific reverse proxy, smuggled in through a string field.

None of this makes Ingress bad design. It makes Ingress an honest admission, in 2015, that nobody agreed on what routing should look like. Gateway API is what happened after roughly eight years of arguing, when they finally agreed.

Three resources instead of one, and that is the whole upgrade

Gateway API replaces the single Ingress object with three, and before your YAML fatigue kicks in, stay with me, because the count is not the point. The ownership is.

GatewayClass is the template. It declares what kind of gateway infrastructure your cluster offers (Envoy, Cilium, or a cloud load balancer), and it gets written approximately once, by whoever runs the platform, and then mostly forgotten.

Gateway is a running instance of that template. It is the actual listener, the thing with an IP address and open ports, and it lives in an infrastructure namespace where application developers cannot poke it.

HTTPRoute is the routing rule. It says “traffic for this hostname and this path goes to this service”, and it lives in the application’s own namespace, right next to the Deployment it serves, owned by the team that owns the app.

That is the entire model. Three objects, three different owners, three different namespaces if you want them. Every interesting thing about Gateway API follows from that separation, which brings us to the actual argument.

The hallway belongs to the platform team, and the door belongs to the app team

Think about what an Ingress object actually is, organizationally. It is one resource that contains both infrastructure concerns (TLS certificates, load balancer behavior, controller tuning) and application concerns (which path goes to which service). One object, two very different audiences, and Kubernetes RBAC can only draw permission lines around whole objects.

So every organization running Ingress at scale ends up choosing between two bad options. Option one, the platform team owns all Ingress resources, and application teams file tickets to change a path rule, which is a magnificent way to turn a thirty-second change into a three-day wait. Option two, application teams own their Ingress resources, which means application teams can now set controller-level annotations, and somewhere in your cluster, there is a configuration snippet written by an intern in 2022 that nobody dares to remove. Both options are workarounds for the same flaw. The spec crammed two jobs into one object, and org charts do not bend that way.

Gateway API splits the object along exactly the line where your teams already split. The platform engineer provisions the Gateway in the infra namespace. They decide which ports are open, which TLS policy applies, and, crucially, which namespaces are allowed to attach routes to it. The application developer writes an HTTPRoute in their own namespace that says, in effect, “attach me to the gateway named external-web”. The route references the gateway by name; the gateway grants permission by policy. Cross-namespace routing is not a hack here, it is the core mechanic of the spec, with an explicit handshake on both sides.

If you read my past RBAC article, this will feel familiar, because it is the same principle wearing a different hat. Least privilege stopped being just about who can “kubectl delete” things and started applying to the network path itself. App teams get exactly the surface they need (their routes, their namespace) and nothing else. The platform team stops being a ticket-processing bottleneck and goes back to doing platform work. Nobody negotiates over annotations in a Slack thread at 6 p.m. on a Friday, which I am told does wonders for retention.

There is also a quieter benefit that only shows up in the postmortem. When routing rules live next to the application, the blast radius of a bad change is the application. When everything lives in one shared Ingress layer, a typo in one team’s path rule can take an unrelated team’s traffic with it. Separation of concerns is usually sold as elegance. In production, it is mostly sold as smaller incidents.

What Ingress made you beg your controller to do

Now for the features, briefly, because the features are genuinely less interesting than the reframe behind them.

Take canary deployments. With Ingress on nginx, weight-based traffic splitting means creating a second Ingress object, blessing it with ‘canary: “true”’ and ‘canary-weight: “10”’ annotations, and trusting that the controller interprets your strings correctly. With Gateway API, an HTTPRoute simply lists two backends with weights, 90 and 10, as ordinary structured fields. The API server validates them. Your canary rollout is now plain YAML instead of an incantation, and you did not have to install a service mesh to get it.

Header-based routing gets the same treatment. Routing requests with ‘x-beta-user: true’ to a different backend is a match condition in the spec, not a regex pasted into a controller-specific snippet. URL rewriting is a filter. Request mirroring, the trick where you copy live traffic to a new version without affecting real responses, is a filter too. Timeouts, header manipulation, traffic redirection, all first-class citizens with schemas.

Here is the reframe. None of these capabilities are new. Your reverse proxy could do all of this in 2016; reverse proxies are old and wise. What was missing was a portable way to ask for it. Under Ingress, every feature beyond host-and-path routing required learning the proprietary annotation dialect of whichever controller you happened to inherit, and your hard-won fluency in nginx annotations was worth exactly nothing the day someone migrated to Traefik. Gateway API moves those features into the spec itself, where they are typed, validated, and identical across implementations. The knowledge finally transfers. So do the manifests.

GatewayClass is the new vendor coupling point, and that is a better deal

Time for the honest section, because every article praising a new standard owes you one.

Gateway API does not eliminate vendor lock-in, and anyone telling you otherwise is selling a controller. The GatewayClass is where you commit. You pick Cilium, or Envoy Gateway, or Istio, or nginx-gateway-fabric, and from that moment your gateways run on that implementation’s machinery, with that implementation’s performance profile and that implementation’s extension features. Conformance across implementations is real but not absolute; the spec has core features everyone must support and extended ones they may.

What changed is the geometry of the coupling. With Ingress, the vendor dependency was smeared across your entire estate, hiding inside opaque annotation strings on every single routing object. You could not see it, measure it, or contain it; you discovered its true size on migration day, which is the worst possible day to discover anything. With Gateway API, the coupling is compressed into one object type. Everything above the GatewayClass (your routes, your matches, your filters, your weights) is portable standard YAML. Everything below it is the vendor’s problem. Swapping implementations becomes “change the GatewayClass and re-test”, not “translate three hundred annotations from one dialect to another and pray”.

The ecosystem, for the record, is not a science fair. Cilium ships a Gateway implementation on eBPF. Envoy Gateway is the CNCF’s straightforward Envoy packaging. Istio treats Gateway API as its preferred configuration surface these days. nginx-gateway-fabric exists for the sizable demographic that would like to keep nginx but lose the annotations. All of these run in production at companies whose outages would make the news.

You do not need to migrate everything to start

The best property of Gateway API for anyone with an existing cluster is that it demands nothing of your existing cluster. Gateway API and Ingress run side by side indefinitely. The controllers do not fight, the resources do not overlap, and your hundred working Ingress objects can keep working while you experiment two namespaces away.

The sensible entry point is not a migration project (migration projects are where enthusiasm goes to file status reports). It is one new service, or one feature branch, routed through an HTTPRoute while everything else stays put. You get a feel for the model, your platform team writes its first Gateway, and the canary feature gets a real audition on something low-stakes.

Whether your cluster is already prepared takes one command to find out.

kubectl get crds | grep gateway.networking.k8s.io

If that returns a list of CRDs, the welcome mat is already out; managed offerings like GKE ship them preinstalled. If it returns nothing, the installation is a single manifest from the Gateway API releases page, and then the welcome mat is out.

Ingress will keep working for years. Frozen APIs in Kubernetes enjoy long, comfortable retirements, and nobody is coming to delete your manifests. But every new routing feature, every new controller capability, and increasingly every new piece of documentation is being written for the other API now. Borrowed time is still time. It is just no longer the kind you should be building on.

June 10, 2026 by Fernando SRE Cloud stuff DevOps stuff Kubernetes SRE stuff

RBAC is not least privilege, and your cluster is the proof

Your security scanner ran last night. It came back green. RBAC is configured, there are no critical findings, and you closed the tab with the quiet satisfaction of someone who has done the responsible thing. The cluster is locked down. You can go to lunch.

Here is the uncomfortable part. A green scanner answers the question “Is access controlled?” It does not answer the question “Is access minimal?” Those are different questions, and most teams conflate them because the first one is easy to check and the second one requires reading things nobody wants to read on a Tuesday.

RBAC answers the first. Least privilege requires answering both. And a perfectly valid RBAC configuration can be, at the very same time, a perfectly generous one. The scanner has no opinion about generosity.

The ClusterRole you inherited from a Helm chart in March

Kubernetes ships three aggregated ClusterRoles out of the box (admin, edit, view), and they have a quietly alarming property. They absorb permissions. Any ClusterRole carrying the label ‘rbac.authorization.k8s.io/aggregate-to-edit: “true”’ gets automatically folded into ‘edit’, with no human in the loop and no diff to review.

This is convenient right up until it is not. When you installed that operator back in March, its Helm chart shipped a CRD and a ClusterRole with the aggregation label attached, because that is the polite, idiomatic way to do it. From the moment ‘helm install’ finished, every subject bound to ‘edit’ in your cluster silently gained permissions over a brand new resource type. Nobody approved it. Nobody saw it. The controller did exactly what it was designed to do, which is the part that should worry you.

So the RoleBinding still says ‘edit’. The word has not changed. What it grants has, several times, across several chart upgrades, and the only record of the expansion is scattered across ClusterRole objects nobody has opened since they were applied.

The takeaway is small and annoying: every time you install a chart, check what it aggregated. ‘kubectl get clusterrole -l rbac.authorization.k8s.io/aggregate-to-edit=true’ is two minutes of your life and occasionally a genuine surprise.

That ServiceAccount reads secrets, all of them, probably

Consider a ServiceAccount with ‘get’ on secrets in a single namespace. On paper, this looks narrow and tidy. The reviewer who approved it was right to approve it. The problem is that RBAC grants do not live in isolation; they live next to whatever else is running in that namespace.

If that namespace also hosts External Secrets Operator, a Vault Agent sidecar, or a CSI secrets driver, the secrets sitting there are not application trivia. They are the synced, materialized credentials that those tools pulled from somewhere more important. A grant that reads “can view secrets in ‘team-a’” can, depending on the architecture around it, mean “can read the cloud provider credentials that External Secrets faithfully copied into ‘team-a’ thirty seconds ago.”

Nothing here is broken. Every component is behaving as documented. That is exactly why it slips past review: each piece is reasonable, and the risk only exists in the seam between them, where no single Role definition is looking.

So when you audit a secrets grant, do not read the Role. Read the room. Ask what else lives in that namespace and what those neighbors keep in their pockets.

Creating a Pod sometimes creates a root shell on the node

This is the one people refuse to believe until you show them.

If Pod Security Admission is not enforced in ‘restricted’ mode, a subject with ‘create’ on pods is, functionally, a subject with a path to the node. They can define a pod that mounts the host root filesystem as a volume, sets ‘hostPID: true’, runs ‘privileged: true’, or maps a host port to quietly intercept traffic. From inside that pod, the node is no longer a node; it is a directory.

None of this is a vulnerability. There is no CVE to patch, because Kubernetes is doing precisely what the spec permits. The escalation lives in the gap between two true statements: “we have RBAC” and “nobody can reach the node.” Both can be accurate. Together, they can still be a hole you could drive a cluster through.

The fix is not more RBAC. It is admission control. Enforce PSA ‘restricted’ as the namespace default, and treat every exception as a decision someone wrote down and owns, rather than a default nobody chose.

Three commands that will ruin your afternoon

Theory is comfortable. Here is the part where you actually look.

‘kubectl-who-can’ answers the blunt question: who can perform this verb on this resource, right now. ‘kubectl who-can create pods -n production’ is a fast way to find out that the list is longer than you remembered.

‘rakkess’ produces a full access matrix for a given subject, so you can stare at an entire grid of green checkmarks belonging to a ServiceAccount that, in principle, only needed to read a config map.

‘rbac-tool lookup’ lists everything a specific subject can do across the whole cluster, which is the tool you run when you have a name and a bad feeling.

I will set an honest expectation. The first time you run any of these against a cluster older than a year, you will find at least one thing nobody intended, and there is a decent chance it will be something you granted. This is not a moral failing. It is entropy. Permissions accrete the same way junk drawers do, one reasonable decision at a time.

The scanner will still be green, that is no longer the point

Here is where I am supposed to hand you a fix that makes the scary parts go away. I cannot, because least privilege in Kubernetes is not a configuration state you reach and then defend. It is a process you keep doing, slightly grudgingly, forever.

Start subjects at zero and grant only what the audit log proves they actually use. Tools like ‘audit2rbac’ can generate tight RBAC from real API server audit events, which is to say from evidence rather than from optimism. Enforce PSA ‘restricted’ by default. Audit aggregated ClusterRoles every time you install a chart. Rotate ServiceAccount tokens, because a credential that never expires is just a future incident with good patience.

Do all of that, and run the scanner again. It will still be green. It was always going to be green. The result has not changed at all. The only thing that has changed is the question you now know to ask, and that, inconveniently, was the whole job.

There is no universal answer here, only better-informed trade-offs, and the faint suspicion that your next audit will find something too. It usually does.

June 6, 2026 by Fernando SRE DevOps stuff Kubernetes SRE stuff

Why Kubernetes struggles with scheduled workloads

The day the cron job fought back

It was 3 AM on a Sunday when the primary database finally gave up. The alerts did not scream about a traffic spike or a rogue deployment. Instead, the culprit was a seemingly innocent Kubernetes CronJob named report-generator-weekly. Thanks to a subtle daylight saving time shift and a misunderstanding of how the kubelet interprets timezones, the cluster decided to spawn twenty-five concurrent instances of a highly unoptimized SQL aggregation query. The database wept, the API went down, and a dozen pagers ruined a perfectly good weekend.

Scheduled jobs can seem deceptively simple until they disrupt production. We tend to treat Kubernetes as a universal solvent for all our operational problems. If it runs containers, it should handle a little task scheduling, right? Sadly, the reality is far more complicated. Behind the scenes, the timezone handling often defaults to the kubelet’s local clock, concurrency policies are treated as polite suggestions under heavy control plane load, and failed jobs accumulate in your cluster like dirty dishes in a shared office sink. Observability is fractured across transient pods, ephemeral events, and missing logs. These “small” details are precisely how a benign script turns into an incident vector.

Resource contention and the illusion of isolation

Your scheduled job does not care about your Service Level Agreements. When you deploy a CronJob into a shared cluster, you are introducing a highly unpredictable workload into the same ring where your latency-sensitive APIs reside. Kubernetes tries its best to balance the scales, but the scheduler is only as good as the resource requests you provide.

Most scheduled jobs are spiky. They sleep for 23 hours, wake up, devour every megabyte of RAM they can reach, and then vanish. If you underprovision the memory requests to save money, the OutOfMemory killer will brutally terminate your job midway through a critical payment reconciliation. If you overprovision to stay safe, you end up paying for idle compute across your worker nodes all day long. We have all seen clusters where a five-minute data cleanup task was given the same IAM permissions and CPU priority as the core stateful application simply because “it is just a cron”. Taints, tolerations, and node affinity help, but they often create a false sense of isolation while increasing scheduling complexity.

Managed schedulers and workflow patterns to the rescue

The smartest way to run a CronJob in Kubernetes is usually to not run it in Kubernetes at all. If you are already building on a major cloud provider, you have access to tools that were specifically designed for this exact headache.

Consider decoupling the “schedule” from the “work”. Services like Cloud Scheduler in GCP, EventBridge in AWS, or Azure Logic Apps are practically bulletproof when it comes to firing a trigger at a specific time. You can wire these services to invoke a Cloud Run service, a Lambda function, or an ECS Fargate task. Yes, there is a minor trade-off regarding vendor lock-in. However, the operational peace of mind you gain by offloading scheduling logic to a managed service usually pays for itself during the first month.

For more complex scenarios where jobs depend on each other, you should look toward delay queues like SQS or Pub/Sub, or dedicated workflow engines like Temporal. When a scheduled task fails halfway through, a pure Kubernetes CronJob will either restart from scratch or die entirely. A proper workflow engine provides built-in backoff logic, idempotent execution tracking, and distributed state management. This pattern beats the traditional cron daemon every time you need reliable retries.

Hardening your jobs and the beauty of boring virtual machines

Of course, there are times when you absolutely must run scheduled jobs inside the cluster. I have also copied-pasted a CronJob YAML file at 2 AM out of sheer necessity (we have all been there). If you find yourself in this situation, you need to establish some defensive boundaries.

Start by enforcing resource quotas per cron namespace to contain the blast radius. Make idempotency checks mandatory in your application code, ensuring that a job running twice by accident does not charge a customer twice. Implement structured logging with clear correlation IDs so you can actually trace failures, and use preStop hooks for graceful shutdowns. The Kubernetes documentation clearly states that a CronJob might run a job zero times or multiple times under certain edge cases, so your code must defend itself against the infrastructure.

Alternatively, we should revive our appreciation for systemd timers on basic Virtual Machines. If you have a low-frequency, highly predictable job that is critical to your infrastructure, a lightweight VM is a fantastic home for it. Systemd timers are exceptionally reliable, the cost is completely transparent, and the blast radius is physically isolated from your stateless microservices. Reserving Kubernetes for your scalable workloads while putting the boring, critical timers on a VM is not a step backward. It is just good engineering.

A plea for architectural humility

Kubernetes is a brilliant piece of software. It is an exceptional orchestrator for stateless, auto-scaling microservices. However, it is not a magical alarm clock, and treating it as a universal cron daemon is a recipe for operational misery.

Architectural humility means picking the right tool for the job, rather than the trendiest one. The next time someone proposes adding a heavy, stateful CronJob to the production cluster, take a moment to ask the hard questions about resource profiles, failure tolerance, and observability. Sometimes, the most advanced cloud architecture decision you can make is to let a dedicated scheduling service handle the calendar, leaving your cluster free to do what it does best.

May 31, 2026 by Fernando SRE DevOps stuff Kubernetes SRE stuff

Your CI/CD pipeline just became an accomplice to a robbery

There is a special kind of morning reserved for DevOps teams. The coffee is still too hot, Slack is already too loud, and somewhere in the dependency tree, a package you have never consciously chosen has decided to become a tiny criminal enterprise.

Not a glamorous one. Not the cinematic kind with laser grids, violin music, and a morally complicated mastermind in a black turtleneck. This one wore the traditional uniform of modern software crime, a ‘package.json’ file, a lifecycle hook, and the quiet confidence of something that knows your CI/CD pipeline will execute almost anything if it arrives through the correct registry.

The Mini Shai-Hulud attack against the AntV npm ecosystem was not frightening because it was exotic. It was frightening because it was ordinary. A compromised maintainer account. A burst of malicious package versions. A ‘preinstall’ hook. A build server with secrets lying around like biscuits in a meeting room.

That is the part worth sitting with for a moment. Your pipeline did not fail because it was stupid. It failed because it behaved exactly as designed.

The morning npm trusted a stranger

On May 19, a maintainer account named ‘atool’, associated with the AntV visualization ecosystem and several widely used utility packages, was compromised. In a short automated burst, malicious versions were published across more than 300 npm packages. Some reports counted 314 packages tied to the compromised maintainer. Others counted a slightly broader set, depending on the package universe being measured. Either way, this was not a polite disturbance. It was an npm fire drill with the alarm wired directly into your build system.

The affected ecosystem included packages such as ‘size-sensor’, ‘echarts-for-react’, ‘timeago.js’, and many ‘@antv’ packages. Collectively, the package set represented roughly sixteen million weekly downloads. That number has the calm, bureaucratic feel of a spreadsheet cell, which is unfortunate, because the spreadsheet cell is quietly screaming.

The payload was not a kernel exploit. It was not a secret zero-day whispered into existence by a nation-state intern with excellent dental insurance. It was a preinstall hook that executed an obfuscated Bun script before the application had even reached the part of the day where tests pretend they are in charge.

That is the insult. The thief did not pick the lock. The thief rang the bell, wore a delivery jacket, and your pipeline said, “Of course, please come in. The cloud credentials are near the snacks.”

Why did your pipeline not see it coming?

Most CI/CD pipelines are optimized for speed, repeatability, and the pleasant fiction that dependencies are small sealed boxes of usefulness. A typical workflow clones the repository, restores a cache, runs ‘npm ci’, then moves on to tests, linters, SAST tools, dependency scanners, container builds, and finally deployment.

That order feels reasonable. It is also the problem.

The malicious ‘preinstall’ hook runs during dependency installation. It runs before your tests. Before your linter. Before the container image scanner gets to put on its tiny detective hat. Before most of the tools you bought, integrated, configured, and proudly presented in a security maturity slide deck have even entered the room.

By the time your scanner examines the artifact, the install phase may already have executed hostile code inside your build environment. The patient is now wearing the doctor’s coat.

This is the architectural blind spot. We often talk about CI/CD as plumbing, as if pipelines merely transport code from Git to production with the emotional depth of a garden hose. In practice, the build environment is one of the most privileged pieces of compute in the company.

It can read source code. It can fetch dependencies. It can publish artifacts. It can assume cloud roles. It can push containers. It can sign releases. It may have access to deployment tokens, package registry tokens, GitHub tokens, npm tokens, cloud credentials, vault credentials, and enough environment variables to make a compliance auditor age visibly.

Then, in the middle of that privileged environment, we run arbitrary community code as a normal business process.

We do this every day. We call it productivity because “ritualized trust falls with strangers” was apparently less attractive in Jira.

When your EC2 instance becomes a credential vending machine

The build server is only one part of the blast radius. Many organizations still run Node.js applications directly on EC2 instances, virtual machines, shared development servers, bastion hosts, or old pets with sentimental names and systemd units no one wants to touch.

If a malicious dependency runs during an install on one of those machines, the question becomes brutally simple. What can that machine see?

Mini Shai-Hulud style payloads are designed to ask exactly that. They look for AWS credentials in environment variables and local credential files. They probe cloud metadata services. They search for Kubernetes service account tokens mounted in predictable paths. They hunt for GitHub personal access tokens, npm tokens, HashiCorp Vault tokens, SSH keys, database connection strings, and local password manager material.

This is where the story stops being a malware story and becomes an architecture story.

The problem is not merely that the script is clever. The problem is that many machines are already arranged like vending machines for secrets. Insert malicious lifecycle hook. Receive access keys. Enjoy your snack.

If your EC2 user data script runs ‘npm install’ during bootstrap, you have given install-time code a front-row seat to the instance identity. If developers SSH into a shared VM and run package installs manually, you have blended local development, shared infrastructure, and cloud access into a smoothie with bits of glass in it. If a bastion host has credentials on disk because “it was only temporary”, congratulations, you have discovered the half-life of temporary infrastructure. It is forever, unless audited.

The uncomfortable lesson is not that EC2 is unsafe. EC2 is a perfectly respectable building block. The trouble begins when long-lived compute accumulates credentials the way kitchen drawers accumulate mysterious cables. After enough time, nobody knows what they are for, but everyone is afraid to throw them away.

The SaaS services you thought were sandboxed

Managed build platforms are not magically exempt from this pattern. Vercel, Netlify, Railway, Render, AWS Amplify, Google Cloud Build, and similar services often run dependency installation on your behalf. They do it in ephemeral containers, which sounds reassuring, because ephemeral is one of those cloud words that makes everything feel rinsed and hygienic.

But ephemeral does not mean harmless.

Those containers may still receive environment variables. They may still hold deployment credentials. They may still have API keys, database URLs, webhook secrets, third-party tokens, and production-adjacent configuration. A malicious ‘preinstall’ hook does not need a permanent server. It only needs a few seconds with the things you carefully injected into the build because the deployment would not work without them.

This is where the boundary between build time and runtime starts to look theatrical. We like to pretend they are separate kingdoms with guards and flags and polite customs inspections. In reality, build time often has enough access to affect runtime, and runtime secrets often leak backward into build time because somebody needed a preview deployment to talk to a real database “just for testing”.

The SaaS provider may provide isolation. It may provide clean containers. It may even provide excellent defaults. But your build environment is still your environment. You configured the secrets. You selected the dependencies. You allowed the install scripts. The sandbox is not a moral force. It is a container with permissions.

And containers, bless them, do not experience shame.

When the green badge smiles at the robber

The most unsettling part of Mini Shai-Hulud was not just credential theft. It was the way the attack interacted with modern supply chain trust.

Some malicious packages were observed with valid Sigstore and SLSA provenance signals. In plain English, the pipeline identity could be used to produce cryptographic evidence that looked legitimate. The signature was real. The attestation was real. The code was malicious.

This is a deeply unpleasant sentence for anyone who has spent the last few years building policies around signed artifacts, provenance, and supply chain gates.

Those controls still matter. They are not useless. But this attack is a reminder that provenance is not a spell. It tells you something about how an artifact was built, and sometimes where it was built. It does not automatically tell you that the person, process, maintainer account, or CI identity involved was trustworthy at that moment.

A green badge can prove that the robbery happened in a certified room with excellent lighting.

For cloud architects, that distinction matters. If your policy says “only deploy signed artifacts”, you have improved the baseline. If your mental model says “signed means safe”, the attacker has just found a very comfortable chair in your control plane.

The right question is not only whether an artifact is signed. It is whether the identity that signed it should have been allowed to sign it, whether the workflow that produced it was protected, whether the release path was expected, whether the maintainer account had strong controls, and whether the dependency version appeared with the behavior of a normal release or with the body language of a raccoon in a data center.

Signatures are evidence. They are not character witnesses.

What to change before the next deployment

There is no single magic fix, which is irritating, because single magic fixes are much easier to put on a roadmap. What you can do is reduce the number of places where arbitrary install-time code meets valuable credentials.

Start with the obvious rule that is somehow still controversial. Do not run npm install in production on long-lived machines. Build once in a controlled environment. Bake dependencies into immutable images or artifacts. Promote those artifacts across environments. Production should receive the finished meal, not a bag of groceries and a stranger with a knife.

Use lockfiles with discipline. Treat changes to ‘package-lock.json’, ‘pnpm-lock.yaml’, or ‘yarn.lock’ as meaningful code changes. Review them. Pin dependencies where it matters. Avoid allowing automatic minor or patch upgrades in privileged CI jobs without human review or a quarantine window. Freshly published packages are not necessarily fresh bread. Sometimes they are bread with a tiny radio transmitter inside.

Disable install scripts where you can. For many CI validation jobs, ‘npm ci –ignore-scripts’ is a reasonable default. When lifecycle scripts are genuinely required, make that an explicit exception rather than a silent assumption. Exceptions should feel slightly annoying. That is how you know they are doing their job.

Separate build secrets from runtime secrets. A build job should not need direct access to production databases. It should not carry cloud admin credentials. It should not have permission to do everything because it is easier than discovering the three actions it actually needs. Use short-lived credentials through OIDC where possible, scoped narrowly to the job, the repository, the branch, and the environment.

Treat the build environment as hostile until proven otherwise. Run builds in ephemeral, isolated environments. Avoid reusing caches between trusted and untrusted contexts. Restrict egress where practical. Monitor unusual outbound traffic from CI runners, especially to metadata endpoints, GitHub APIs, unknown domains, and places where stolen secrets go to begin their new life.

On AWS, enforce IMDSv2 and restrict access to instance metadata. Do not let random processes on a host treat the metadata service like a neighborhood tapas bar. On Kubernetes, avoid mounting default service account tokens into pods that do not need them. If a pod has no business speaking to the Kubernetes API, do not give it a tiny passport and a laminated badge.

Finally, treat developer workstations as part of the production risk surface. This is annoying because developers are humans, and humans enjoy installing things. But if a developer runs npm install on a laptop that has AWS SSO sessions, GitHub tokens, package registry credentials, SSH keys, and password manager integrations, that laptop is not merely a laptop. It is a small branch office with stickers.

The uncomfortable truth about convenience

The cloud industry has spent more than a decade optimizing for developer velocity. We made dependency installation fast. We made CI/CD pipelines automatic. We made SaaS build platforms beautifully simple. We taught ourselves to trust registries because the alternative was slow, manual, and socially unpopular.

Mini Shai-Hulud is not the end of that model. It is the invoice.

The convenience of ‘npm install’ is not free. It is a line of credit against your security posture, and the interest rate just went up.

This does not mean we should retreat into caves and compile everything by candlelight, although some incident response teams have looked into it. It means we need to stop treating dependency installation as a harmless clerical step. It is code execution. It happens early. It happens often. It happens in places where secrets live.

That is the part that should make every DevOps engineer, platform engineer, and cloud architect feel a small chill behind the neck. Not panic. Panic is noisy and usually produces dashboards. A chill is more useful. A chill asks better questions.

Why does this build job have access to production credentials?

Why can this runner reach the metadata service?

Why are install scripts enabled by default?

Why are we deploying from a machine where somebody also tests packages manually?

Why did the green badge make us stop thinking?

Modern DevOps was already a strange job. You were part sysadmin, part release engineer, part therapist for YAML, part barista for impatient microservices. Now, occasionally, you must also check whether your pipeline has become an accomplice to a robbery.

It will not look guilty. Pipelines never do. They fail with clean logs, pass with suspicious confidence, and continue brewing coffee while a stranger quietly empties the safe.

May 30, 2026 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

The social awakening of the Kubernetes scheduler

Human beings are notoriously bad at coordination, but we like to think our machines are better. They are not. For over a decade, Kubernetes, the undisputed king of cloud orchestration, has behaved like a blind restaurant host with a severe case of short-term memory loss.

If you arrived at this restaurant with a party of eight, the host would not look for a table of eight. Instead, they would grab the first person in your group, lead them to a random single stool in the corner, and tell them to wait. Then they would grab the second person and squeeze them between two strangers at the bar. If the remaining guests could not find seats, the host would simply shrug. The first seven would sit there forever, nursing their half-empty glasses of water, while the last person stood shivering in the rain outside.

In computer science, we call this tragedy a scheduling deadlock. In Kubernetes, it is just another Tuesday. But with the release of version 1.36, the system is finally learning some manners through a set of features known as workload-aware scheduling.

The tragedy of scheduling one shoe at a time

Historically, Kubernetes was designed to think in terms of individual pods. To the scheduler, a pod is a single, solitary unit of life, like a lonely left shoe. It does not know or care if there is a right shoe waiting in the queue. It just wants to put the left shoe on a foot, even if the owner of that foot has no legs.

This single-minded approach works beautifully for simple web servers. If you need ten copies of an application, they do not need to know each other. They do not talk, they do not share secrets, and they certainly do not need to hold hands.

But modern workloads, particularly those driving artificial intelligence, machine learning, and massive mathematical calculations, are different. They do not run on lonely, independent pods. They run on highly codependent troupes of containers that must work together or not at all. If you are running an eight-GPU training job, you might need all eight nodes to start at exactly the same microsecond. If seven show up and the eighth is stuck in the hallway because a node ran out of memory, the entire operation grinds to a halt. The active pods just sit there, chewing up expensive processor cycles and doing absolutely nothing useful.

To fix this, the open-source community decided to give Kubernetes some social intelligence. They wanted to teach the system how to recognize a group of friends and seat them all together.

Enter the PodGroup, a unit of social cohesion

To bring order to this chaos, Kubernetes v1.36 introduces a clever piece of psychological separation. It splits the concept of a multi-pod job into two distinct entities, namely a static blueprint called the Workload API and an active, fast-moving runtime object called the PodGroup API.

The separation is brilliant in its boringness. Imagine trying to coordinate a huge family reunion. The Workload is the official invitation list, a static piece of paper detailing who should theoretically show up. The PodGroup is the group text message where everyone argues in real-time about who is actually arriving, who is running late, and who went to the wrong address.

If the scheduler had to update the master blueprint every single time a single pod changed its status, the central API server would suffer the digital equivalent of a massive nervous breakdown. By keeping the blueprint quiet and letting the temporary PodGroup handle the frantic, fast-moving status updates, the system avoids data congestion. It is the architectural equivalent of having a calm office manager who handles the contracts while an assistant runs around screaming with a clipboard.

A basic PodGroup declaration is surprisingly simple, containing just enough information to tell the scheduler how many members actually make a quorum.

apiVersion: scheduling.k8s.io/v1alpha2
kind: PodGroup
metadata:
  name: neural-training-crew
spec:
  minMember: 8

In this little snippet, we are telling Kubernetes that unless all eight of our digital family members can be seated at the table at the exact same time, nobody gets seated at all. The scheduler takes one clean snapshot of the system and commits the whole gang, or nothing. It is, quite literally, collective bargaining for containers.

The art of polite eviction

Of course, life in the cloud is rarely empty. Most of the time, your cluster is already full of small, low-priority pods doing things like sending promotional emails or logging the temperature of the server room.

When your giant, expensive AI training workload arrives at the door, it needs space immediately. In the old days, the scheduler would look at the crowded room, see that there was no space for a group of eight pods, and simply give up.

With workload-aware preemption, the scheduler gains a more assertive personality. Instead of looking at individual pods, it evaluates the entire PodGroup as a single, powerful entity. If the group cannot fit, the scheduler can look at the low-priority pods currently occupying the nodes and decide to evict them.

Crucially, this is controlled by a setting called the disruptionMode. You can configure your PodGroup so that if it must be interrupted, it happens as an all-or-nothing event. Your pods can either be evicted one by one, or they can refuse to leave unless the entire group is taken down together, holding hands in a dramatic show of solidarity. This prevents a situation where half of your training job is evicted, leaving the remaining half running uselessly and burning through your cloud budget.

Putting the family in the same neighborhood

There is one final piece to this scheduling puzzle. In the world of high-performance computing, physical distance matters. If your pods are communicating constantly, placing half of them in an Oregon data center and the other half in Virginia is a recipe for terrible latency. It is like trying to have a conversation where every sentence takes three seconds to travel across the room.

To solve this, Kubernetes v1.36 introduces topology-aware workload scheduling. This ensures that the scheduler does not just find enough seats for your pods, but actually finds them close to one another, preferably on the same network switch, the same rack, or even the same physical machine.

It is the equivalent of booking hotel rooms for your family reunion and ensuring that everyone is on the third floor, rather than scattered across five different buildings in different zip codes.

A short conclusion for the caffeinated reader

We have spent years treating containers like isolated, disposable little boxes. We launched them, forgot about them, and let them fend for themselves. But as our software grows more complex and our artificial intelligence models require more computational power, we are discovering that our containers need to cooperate.

The changes in Kubernetes v1.36 are not just minor performance tweaks. They represent a fundamental shift in how the system understands work. By teaching the scheduler how to recognize groups, respect their physical proximity, and evict them gracefully, Kubernetes is growing up. It is no longer just a system for running individual applications. It is becoming a highly sophisticated, socially aware coordinator for the most complex computational tasks on the planet. And that is definitely worth raising a coffee mug to.

May 22, 2026 by Fernando SRE DevOps stuff Kubernetes SRE stuff

Cheating the continuous learning meat grinder with AI

Let us consider the cranium of the modern Cloud Architect. It is a finite biological container, roughly the size of a cantaloupe, filled with a squishy mass of fat and water. Yet, the tech industry operates under the hallucination that this cantaloupe can effortlessly absorb the entire AWS service catalog updates before your morning coffee. Trying to ingest the sheer volume of new DevOps tooling is a lot like watching a python try to swallow a double-door refrigerator. It is structurally impossible, deeply uncomfortable to witness, and usually ends with someone needing medical attention. We are practically obligated to evolve constantly, but our neurological hard drives have strict, unyielding limits.

The biological absurdity of keeping up with the CNCF landscape

The concept of “continuous improvement” in IT often feels less like an inspirational corporate poster and more like a slightly sadistic evolutionary mandate. You finally understand the esoteric routing logic of your Kubernetes networking setup. Your heart rate settles. You feel peace. Then, a cheerful newsletter arrives to inform you that your setup is obsolete and someone has thrown a brand new service mesh at your head.

The exhaustion you feel is not a character flaw. It is a standard biological response to an ecosystem that mutates faster than a flu virus in a crowded airport. Our brains were optimized for remembering which berries are poisonous, not for tracking the depreciation schedule of Helm charts.

Stop eating the trendy vegetables you hate

Then there is the fear of missing out, or FOMO, which drives otherwise rational engineers to do deeply irrational things. Let us be brutally honest here. If you absolutely despise Javascript or feel a physical wave of nausea when looking at a shiny new frontend framework, do not force yourself to learn them just because they are trending on Hacker News.

Trying to master disciplines outside your actual interests is like forcing a housecat to take up scuba diving. The cat will hate it, it will do a terrible job, and everyone involved will end up bleeding. Protect your cognitive load with ruthless aggression. As a DevOps professional, you have permission to focus solely on the infrastructure pipelines and Linux kernel quirks that actually bring you joy. Leave the trendy stuff to the people who actually like it.

Enter the hyperactive, infinitely patient robot intern

This brings us to the survival strategy. Artificial intelligence is often pitched as an omniscient overlord coming for our jobs. Right now, however, it is much more useful to view it as a hyperactive, infinitely patient intern. These LLMs exist to do the dirty work our cantaloupe brains reject.

They can read the soul-crushing, poorly translated documentation you desperately want to avoid. You can feed a brutal 50-page technical manual on IAM policies into an AI tool and instruct it to spit out a concise summary directly in your terminal. Or better yet, tell it to explain the concepts to you like you are a tired sysadmin who just wants to go home and play with their Mac. It saves hours of mental decay.

Curating your own survival kit

The trick is learning how to interrogate the AI properly. You do not just ask it “what is new in Terraform.” You demand it to extract the protein from the learning material and throw away the useless fat. You can ask it to summarize release notes, generate highly specific flashcards, or even act as a mock interviewer to test your knowledge on specific CI/CD pipelines before a migration. You are outsourcing the most painful parts of the learning curve to a machine that cannot feel pain or boredom.

The fine art of ignoring things

Ultimately, surviving this industry requires a liberating realization. You simply cannot know everything, and attempting to do so is a biological folly. To truly master the fine art of ignoring things, you need to implement a few practical, slightly ruthless habits.

First, practice strategic amnesia. Stop trying to memorize syntax. If an AI can generate the boilerplate YAML for a Kubernetes deployment in three seconds, your brain should actively refuse to store that information. Treat syntax like a disposable coffee cup; use it once and throw it away.

Second, stop hoarding documentation and start hoarding prompts. Your personal knowledge base should not be a graveyard of unread PDFs. It should be a collection of highly tuned, tested instructions that you can feed into an LLM to get exactly what you need, when you need it. Think of them as spells to summon your robot intern.

Third, politely decline the buffet. When a vendor announces a revolutionary new tool that solves a problem you do not actually have, just nod, smile, and walk away. Your cognitive load is precious cargo. Do not fill the cargo bay with garbage.

The ultimate architectural achievement is not memorizing every obscure command line flag. It is building a well structured mind that understands the core principles and knows exactly how to extract the rest of the answers from an AI assistant. Let the machines hold the heavy encyclopedias. We need our brain space for the truly important mysteries, like figuring out why the production database just mysteriously vanished.

May 17, 2026 by Fernando SRE Computer Science stuff

Chronicle of a death foretold for the EFK stack in high demand environments

Your monthly cloud infrastructure bill arrives in your inbox. You open the PDF document, and suddenly your left eyelid starts twitching uncontrollably. The finance department has started leaving passive-aggressive sticky notes on your monitor. You realize you are spending the equivalent of a small nation’s gross domestic product just to store text files that repeat “INFO: User logged in” three billion times a day. Welcome to the modern logging crisis.

For years, the Kubernetes logging ecosystem was basically on autopilot. You installed the EFK stack (Elasticsearch, Fluentd, and Kibana), and it just worked. It was the safest default in the industry. But as we navigate through 2026, something has fundamentally ruptured. EFK did not suddenly become toxic waste overnight. It simply became the victim of its own architecture in an era where log volumes have mutated into unrecognizable monsters.

The shift away from EFK is not driven by shiny object syndrome. It is driven by raw economics, hardware exhaustion, and the very human desire not to wake up sweating at 3 AM because a logging cluster ran out of disk space.

The golden retriever in the sausage factory

Let us start with Fluentd. Fluentd is incredibly stable, highly flexible, and has served the community well. However, it is written in Ruby.

Under moderate loads, Fluentd is a perfectly polite guest. But when you expose it to the high-demand environments of modern microservices, Fluentd exhibits the same impulse control as an unsupervised Golden Retriever locked inside a sausage factory. It just eats all your available CPU and RAM until it physically cannot hold any more, burps an Out Of Memory error, and then politely demands that you scale it horizontally.

This operational overhead becomes exhausting. The industry needed something leaner. Enter the OpenTelemetry Collector. Written in Go, it processes telemetry data with the cold, calculated efficiency of an IRS auditor. It handles metrics, traces, and logs in a unified pipeline without treating your server’s memory like an all you can eat buffet.

Here is what a modern, lightweight pipeline configuration looks like today, completely devoid of Ruby overhead:

# OpenTelemetry Collector routing logs without eating your RAM
receivers:
  filelog:
    include: [ /var/log/pods/*/*/*.log ]
exporters:
  clickhouse:
    endpoint: tcp://clickhouse-server:9000
    database: observability
service:
  pipelines:
    logs:
      receivers: [filelog]
      exporters: [clickhouse]

Packing your socks in industrial hangars

The real villain in your cloud bill, however, is not the collector. It is the storage layer. Elasticsearch is an absolute marvel of engineering if you are trying to build a complex search engine for an e-commerce website. But using it exclusively to store application logs is an architectural tragedy.

Storing logs in Elasticsearch is like packing a single pair of socks in an individual cardboard box, wrapping that box in three layers of industrial bubble wrap, and attaching a GPS tracker to it. Yes, the inverted index structure guarantees that you will find those specific socks at the speed of light. But your luggage now occupies three entire aviation hangars, and the monthly rent is absurd. The indexing process creates massive data bloat, multiplying your storage footprint and your anxiety levels simultaneously.

The bouillon cube of observability

This is where ClickHouse enters the scene and aggressively rewrites the rules. ClickHouse looks at your three hangars full of bubble-wrapped socks, throws them into an industrial shredder, and compresses the resulting mass into a super dense data bouillon cube.

ClickHouse relies on columnar storage and sparse indexes. It does not index every single word of your log lines. Instead, it compresses the data so tightly that your storage footprint shrinks to a fraction of what EFK required. And because developers already dream in SQL, querying this massive block of data feels entirely natural.

Instead of wrestling with Kibana’s proprietary query language just to find out why a payment failed, your team can simply run a query like this:

-- Finding errors without going bankrupt
SELECT
    toStartOfMinute(timestamp) AS minute,
    count() AS total_errors,
    dictGet('services', 'name', service_id) AS service_name
FROM application_logs
WHERE level = 'ERROR' AND timestamp > now() - INTERVAL 1 HOUR
GROUP BY minute, service_name
ORDER BY minute DESC;

Grafana sits on top of this SQL engine like a happy gargoyle, providing the exact same dashboarding capabilities you used to get from Kibana, but with the added benefit of seamlessly linking your logs directly to your OpenTelemetry metrics and traces.

Swapping tires on the highway

Now, a word of caution. The worst thing you can do after reading this is to march into your office and delete your Elasticsearch cluster.

Transitioning from EFK to the OpenTelemetry and ClickHouse stack overnight is the IT equivalent of trying to change your car tires while driving at 120 miles per hour down the highway. You will almost certainly lose the chassis in the process.

A migration requires a gradual cutover. You must deploy the OpenTelemetry Collector alongside your existing Fluentd setup. Route a small subset of non critical logs to ClickHouse. Compare the ingestion rates. Let your team practice writing SQL queries to find errors. Only when you are confident that the bouillon cube is holding its shape should you start decommissioning the old, expensive hangars.

When to completely ignore my advice

To be perfectly fair, EFK is not dead for everyone. If your daily log volume fits comfortably on a standard thumb drive, or if your company enjoys setting fire to piles of corporate cash to keep the server room warm, EFK remains a wonderfully easy solution. If your team has zero experience managing relational databases and relies heavily on managed Elasticsearch services, moving to ClickHouse might introduce more friction than it resolves.

But for the rest of the world, the verdict is clear. Do not migrate just because it is trendy. Migrate because your current system has become a financial bottleneck. If your Elasticsearch bill is the fastest growing metric in your entire company, that is your signal. Run the numbers, evaluate the OpenTelemetry stack, and stop paying hangar prices for your socks.

May 9, 2026 by Fernando SRE Cloud stuff DevOps stuff Kubernetes SRE stuff