CloudComputing

Understanding AWS Lambda Extensions beyond the hype

Lambda extensions are fascinating little tools. They’re like straightforward add-ons, but they bring their own set of challenges. Let’s explore what they are, how they work, and the realities behind using them in production.

Lambda extensions enhance AWS Lambda functions without changing your original application code. They’re essentially plug-and-play modules, which let your functions communicate better with external tools like monitoring, observability, security, and governance services.

Typically, extensions help you:

Retrieve configuration data or secrets securely.
Send logs and performance data to external monitoring services.
Track system-level metrics such as CPU and memory usage.

That sounds quite useful, but let’s look deeper at some hidden complexities.

The hidden risks of Lambda Extensions

Lambda extensions seem simple, but they do add potential risks. Three main areas to watch carefully are security, developer experience, and performance.

Security Concerns

Extensions can be helpful, but they’re essentially third-party software inside your AWS environment. You’re often not entirely sure what’s happening within these extensions since they work somewhat like black boxes. If the publisher’s account is compromised, malicious code could be silently deployed, potentially accessing your sensitive resources even before your security tools detect the problem.

In other words, extensions require vigilant security practices.

Developer experience isn’t always a walk in the park

Lambda extensions can sometimes make life harder for developers. Local testing, for instance, isn’t always straightforward due to external dependencies extensions may have. This discrepancy can result in surprises during deployment, and errors that show up only in production but not locally.

Additionally, updating extensions isn’t always seamless. Extensions use Lambda layers, which aren’t managed through a convenient package manager. You need to track and manually apply updates, complicating your workflow. On top of that, layers count towards Lambda’s total deployment size, capped at 250 MB, adding another layer of complexity.

Performance and cost considerations

Extensions do not come without cost. They consume CPU, memory, and storage resources, which can increase the duration and overall cost of your Lambda functions. Additionally, extensions may slightly slow down your function’s initial execution (cold start), particularly if they require considerable initialization.

When to actually use Lambda Extensions

Lambda extensions have their place, but they’re not universally beneficial. Let’s break down common scenarios:

Fetching configurations and secrets

Extensions initially retrieve configurations quickly. However, once data is cached, their advantage largely disappears. Unless you’re fetching a high volume of secrets frequently, the complexity isn’t likely justified.

Sending logs to external services

Using extensions to push logs to observability platforms is practical and efficient for many use cases. But at a large scale, it may be simpler, and often safer, to log centrally via AWS CloudWatch and forward logs from there.

Monitoring container metrics

Using extensions for monitoring container-level metrics (CPU, memory, disk usage) is highly beneficial. While ideally integrated directly by AWS, for now, extensions fulfill this role exceptionally well.

Chaos engineering experiments

Extensions shine particularly in chaos engineering scenarios. They let you inject controlled disruptions easily. You simply add them during testing phases and remove them afterward without altering your main Lambda codebase. It’s efficient, low-risk, and clean.

The power and practicality of Lambda Extensions

Lambda extensions can significantly boost your Lambda functions’ abilities, enabling advanced integrations effortlessly. However, it’s essential to weigh the added complexity, potential security risks, and extra costs against these benefits. Often, simpler approaches, like built-in AWS services or standard open-source libraries, offer a smoother path with fewer headaches.
Carefully consider your real-world requirements, team skills, and operational constraints. Sometimes the simplest solution truly is the best one.
Ultimately, Lambda extensions are powerful, but only when used wisely.

March 19, 2025 by Fernando SRE Cloud stuff DevOps stuff

Crucial AWS skills for developers in Cloud Computing

Cloud computing has transformed how applications are built and deployed, with AWS leading this technological revolution. For developers and architects, mastering essential AWS services is a competitive advantage and a necessity to thrive in today’s job market. This article will guide you through the key AWS skills you need to excel in cloud computing and fully leverage the opportunities this digital transformation offers.

AWS Lambda for serverless computing

AWS Lambda lets you execute your code in the cloud without worrying about server infrastructure. You run your code exactly when you need it, no more, no less. There’s no need to manage servers, maintain operating systems, or manually scale resources. AWS handles the heavy lifting behind the scenes, so you can concentrate on writing efficient code and solving meaningful problems. Lambda easily integrates with other AWS services, allowing you to create event-driven applications quickly and effectively.

Why You Should Learn It

Auto-Scaling: Automatically adjusts to demand.
Cost-Effective: Pay only for code execution time.
Microservices Friendly: Ideal for real-time events and modular architecture.

Essential Skills

Writing Lambda functions in Python or Node.js
Integrating Lambda with services like API Gateway, S3, and EventBridge
Optimizing for minimal latency and reduced costs

Real-world Examples

Backend API development
Real-time data processing
Task automation

Amazon S3 for robust cloud storage

Amazon S3 is an industry-standard storage solution known for its reliability, security, and scalability. Whether you’re managing small amounts of data or massive petabyte-scale datasets, S3 securely and efficiently handles your storage needs. Its seamless integration with other AWS services makes S3 indispensable for developers aiming to build anything from straightforward websites to complex analytics pipelines.

Why You Should Learn It

Exceptional Durability: Guarantees high-level data safety.
Flexible Storage Classes: Customizable based on performance and cost.
Advanced Security: Offers strong encryption and precise access management.

Common Use Cases

Hosting static websites
Data backups and archives
Multimedia content storage
Data lakes for analytics and machine learning

DynamoDB for powerful NoSQL databases

DynamoDB delivers ultra-fast database performance without management headaches. As a fully managed NoSQL service, DynamoDB effortlessly scales with your application’s changing needs. It handles heavy workloads with extremely low latency, providing developers with unmatched flexibility for managing structured and unstructured data. Its robust integration with other AWS services makes DynamoDB perfect for developing dynamic, high-performance applications.

Why It Matters

Fully Serverless: Zero server management required.
Dynamic Scaling: Automatically adjusts for varying traffic.
Superior Performance: Optimized for fast, consistent query results.

Critical Skills

Understanding NoSQL database concepts
Designing efficient data models
Leveraging indexes and DynamoDB Accelerator (DAX) for enhanced query performance

Typical Applications

Gaming leaderboards
Real-time analytics
User session management

Effortless containers with AWS ECS and Fargate

Containers have revolutionized how we package and deploy applications, and AWS simplifies this process remarkably. Amazon Elastic Container Service (ECS) allows straightforward orchestration and scaling of containerized applications. For those who prefer not to manage servers, AWS Fargate further streamlines the process by eliminating server management, freeing developers to focus purely on application development. ECS and Fargate combined allow developers to build, deploy, and scale modern applications rapidly and reliably.

Why It’s Essential

Managed Containers: No server maintenance headaches.
Automatic Scaling: Handles large-scale container deployments smoothly.
Serverless Deployment: Fargate simplifies your infrastructure workload.

Skills to Master

Building and deploying container images
ECS cluster management
Implementing serverless container solutions with Fargate

Common Uses

Deploying scalable web applications
Microservice-oriented architectures
Efficient batch processing

Automating infrastructure with AWS CloudFormation

AWS CloudFormation empowers you to automate and standardize infrastructure deployments through code. This ensures that every environment, be it development, staging, or production, is consistent, predictable, and reliable. Defining your infrastructure as code (IaC) reduces manual errors, saves time, and makes it easier to manage complex setups across multiple AWS accounts or regions.

Why You Need It

Clear Infrastructure Definitions: Simplifies complex setups into manageable code.
Deployment Consistency: Reduces errors and accelerates deployment.
Repeatable Deployments: Easily reproduce infrastructure setups anywhere.

Key Skills

Creating robust CloudFormation templates
Effectively managing stack lifecycles
Seamlessly integrating CloudFormation with other AWS services

Practical Scenarios

Quick setup of identical environments
Version control and management of infrastructure
Disaster recovery and multi-region infrastructure management

Boosting DynamoDB with AWS DynamoDB Accelerator (DAX)

AWS DynamoDB Accelerator (DAX) significantly enhances DynamoDB’s performance by adding a fully managed in-memory caching layer. DAX dramatically improves application responsiveness and query speed, making it an excellent addition to high-performance applications. It seamlessly integrates with DynamoDB, requiring no complex configurations or adjustments, which means developers can rapidly enhance application performance with minimal effort.

Why You Should Learn DAX

Superior Performance: Greatly reduces response times for data access.
Fully Managed Service: Effortless setup with zero infrastructure hassle.

Ideal Use Cases

Real-time gaming scenarios
High-throughput web applications
Transactional systems needing fast responses

In a few words

Mastering these essential AWS services positions you at the forefront of cloud computing innovation. By deeply understanding these tools, you’ll confidently build scalable, resilient, and secure applications that not only perform exceptionally well but also optimize costs effectively. Staying proficient in these AWS technologies ensures you remain adaptable to the evolving demands of the tech industry, empowering you to create solutions that meet the complex challenges of tomorrow. Keep learning, exploring, and experimenting, your enhanced skillset will make you invaluable in any development or architecture role

March 16, 2025 by Fernando SRE Cloud stuff DevOps stuff

Route 53 and Global Accelerator compared for AWS Multi-Region performance

Businesses operating globally face a fundamental challenge: ensuring fast and reliable access to applications, regardless of where users are located. A customer in Tokyo making a purchase should experience the same responsiveness as one in New York. If traffic is routed inefficiently or a region experiences downtime, user experience degrades, potentially leading to lost revenue and frustration. AWS offers two powerful solutions for multi-region routing, Route 53 and Global Accelerator. Understanding their differences is key to choosing the right approach.

How Route 53 enhances traffic management with Real-Time data

Route 53 is AWS’s DNS-based traffic routing service, designed to optimize latency and availability. Unlike traditional DNS solutions that rely on static geography-based routing, Route 53 actively measures real-time network conditions to direct users to the fastest available backend.

Key advantages:

Real-Time Latency Monitoring: Continuously evaluates round-trip times from AWS edge locations to backend servers, selecting the best-performing route dynamically.
Health Checks for Improved Reliability: Monitors endpoints every 10 seconds, ensuring rapid detection of outages and automatic failover.
TTL Configuration for Faster Updates: With a low Time-To-Live (TTL) setting (typically 60 seconds or less), updates propagate quickly to mitigate downtime.

However, DNS changes are not instantaneous. Even with optimized settings, some users might experience delays in failover as DNS caches gradually refresh.

How Global Accelerator uses AWS’s private network for speed and resilience

Global Accelerator takes a different approach, bypassing public internet congestion by leveraging AWS’s high-performance private backbone. Instead of resolving domains to changing IPs, Global Accelerator assigns static IP addresses and routes traffic intelligently across AWS infrastructure.

Key benefits:

Anycast Routing via AWS Edge Network: Directs traffic to the nearest AWS edge location, ensuring optimized performance before forwarding it over AWS’s internal network.
Near-Instant Failover: Unlike Route 53’s reliance on DNS propagation, Global Accelerator handles failover at the network layer, reducing downtime to seconds.
Built-In DDoS Protection: Enhances security with AWS Shield, mitigating large-scale traffic floods without affecting performance.

Despite these advantages, Global Accelerator does not always guarantee the lowest latency per user. It is also a more expensive option and offers fewer granular traffic control features compared to Route 53.

AWS best practices vs Real-World considerations

AWS officially recommends Route 53 as the primary solution for multi-region routing due to its ability to make real-time routing decisions based on latency measurements. Their rationale is:

Route 53 dynamically directs users to the lowest-latency endpoint, whereas Global Accelerator prioritizes the nearest AWS edge location, which may not always result in the lowest latency.
With health checks and low TTL settings, Route 53’s failover is sufficient for most use cases.

However, real-world deployments reveal that Global Accelerator’s failover speed, occurring at the network layer in seconds, outperforms Route 53’s DNS-based failover, which can take minutes. For mission-critical applications, such as financial transactions and live-streaming services, this difference can be significant.

When does Global Accelerator provide a better alternative?

Applications that require failover in milliseconds, such as fintech platforms and real-time communications.
Workloads that benefit from AWS’s private global network for enhanced stability and speed.
Scenarios where static IP addresses are necessary, such as enterprise security policies or firewall whitelisting.

Choosing the best Multi-Region strategy

Use Route 53 if:
- Cost-effectiveness is a priority.
- You require advanced traffic control, such as geolocation-based or weighted routing.
- Your application can tolerate brief failover delays (seconds rather than milliseconds).
Use Global Accelerator if:
- Downtime must be minimized to the absolute lowest levels, as in healthcare or stock trading applications.
- Your workload benefits from AWS’s private backbone for consistent low-latency traffic flow.
- Static IPs are required for security compliance or firewall rules.

Tip: The best approach often involves a combination of both services, leveraging Route 53’s flexible routing capabilities alongside Global Accelerator’s ultra-fast failover.

Making the right architectural choice

There is no single best solution. Route 53 functions like a versatile multi-tool, cost-effective, adaptable, and suitable for most applications. Global Accelerator, by contrast, is a high-speed racing car, optimized for maximum performance but at a higher price.

Your decision comes down to two essential questions: How much downtime can you tolerate? and What level of performance is required?

For many businesses, the most effective approach is a hybrid strategy that harnesses the strengths of both services. By designing a routing architecture that integrates both Route 53 and Global Accelerator, you can ensure superior availability, rapid failover, and the best possible user experience worldwide. When done right, users will never even notice the complex routing logic operating behind the scenes, just as it should be.

March 2, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

Optimizing ElastiCache to prevent Evictions

Your application needs to be fast. Fast. That’s where ElastiCache comes in, it’s like a super-charged, in-memory storage system, often powered by Memcached, that sits between your application and your database. Think of it as a readily accessible pantry with your most frequently used data. Instead of constantly going to the main database (a much slower trip), your application can grab what it needs from ElastiCache, making everything lightning-quick. Memcached, in particular, acts like a giant, incredibly efficient key-value store, a place to jot down important notes for your application to access instantly.

But what happens when this pantry gets too full? Things start getting tossed out. That’s an eviction. In the world of ElastiCache, evictions aren’t just a minor inconvenience; they can significantly slow down your application, leading to longer wait times for your users. Nobody wants that.

This article explores why these evictions occur and, more importantly, how to keep your ElastiCache running smoothly, ensuring your application stays responsive and your users happy.

Why is my ElastiCache fridge throwing things out?

There are a few usual suspects when it comes to evictions. Let’s take a look:

The fridge is too small (Insufficient Memory): This is the most common culprit. Memcached, the engine often used in ElastiCache, works with a fixed amount of memory. You tell it, “You get this much space and no more!” When you try to cram too many ingredients in, it has to start throwing out the older or less frequently used stuff to make room. It’s like having a tiny fridge for a big family, it’s just not going to work long-term.
Too much coming and going (High Cache Churn): Imagine you’re constantly swapping out ingredients in your fridge. You put in fresh tomatoes, then decide you need lettuce, then back to tomatoes, then onions… You’re creating a lot of activity! This “churn” can lead to evictions, even if the fridge isn’t full, because Memcached is constantly trying to keep up with the changes.
Giant watermelons (Large Item Sizes): Trying to store a whole watermelon in a small fridge? Good luck! Similarly, if you’re caching huge chunks of data (like massive images or videos), you’ll fill up your ElastiCache memory very quickly.
Expired milk (Expired Items): Even expired items take up space. While Memcached should eventually remove expired items (things with an expiration date, or TTL – Time To Live), if you have a lot of expired items piling up, they can contribute to the problem.

How do I know when evictions are happening?

You need a way to peek inside the fridge without opening the door every five seconds. That’s where AWS CloudWatch comes in. It’s like having a little dashboard that shows you what’s going on inside your ElastiCache. Here are the key things to watch:

Evictions (The Big One): This is the most direct measurement. It tells you, plain and simple, how many items have been kicked out of the cache. A high number here is a red flag.
BytesUsedForCache: This shows you how much of your fridge’s total capacity is currently being used. If this is consistently close to your maximum, you’re living dangerously close to eviction territory.
CurrItems: This is the number of sticky notes (items) currently in your cache. A sudden drop in CurrItems along with a spike in Evictions is a very strong indicator that things are being thrown out.
The stats Command (For the Curious): If you’re using Memcached, you can connect to your ElastiCache instance and run the stats command. This gives you a ton of information, including details about evictions, memory usage, and more. It’s like looking at the fridge’s internal diagnostic report.

Run this command to see memory usage, evictions, and more:

echo "stats" | nc <your-cache-endpoint> 11211

It’s like checking your fridge’s inventory list to see what’s still inside.

Okay, I’m getting evictions. What do I do?

Don’t panic! There are several ways to get things back under control:

Get a bigger fridge (Scaling Your Cluster):
- Vertical Scaling: This means getting a bigger node (a single server in your ElastiCache cluster). Think of it like upgrading from a mini-fridge to a full-size refrigerator. This is good if you consistently need more memory.
- Horizontal Scaling: This means adding more nodes to your cluster. Think of it like having multiple smaller fridges instead of one giant one. This is good if you have fluctuating demand or need to spread the load across multiple servers.
Be smarter about what you put in the fridge (Optimizing Cache Usage):
- TTL tuning: TTL (Time To Live) is like the expiration date on your food. Don’t store things longer than you need to. A shorter TTL means items get removed more frequently, freeing up space. But don’t make it too short, or you’ll be running to the market (database) too often! It’s a balancing act.
- Smaller portions (Reducing Item Size): Can you break down those giant watermelons into smaller, more manageable pieces? Can you compress your data before storing it? Smaller items mean more space.
- Eviction policy (LRU, LFU, etc.): Memcached usually uses an LRU (Least Recently Used) policy, meaning it throws out the items that haven’t been accessed in the longest time. There are other policies (like LFU – Least Frequently Used), but LRU is usually a good default. Understanding how your eviction policy works can help you predict and manage evictions.

How do I avoid this mess in the future?

The best way to deal with evictions is to prevent them in the first place.

Plan ahead (Capacity Planning): Think about how much data you’ll need to store in the future. Don’t just guess – try to make an educated estimate based on your application’s growth.
Keep an eye on things (Continuous Monitoring): Don’t just set up CloudWatch and forget about it! Regularly check your metrics. Look for trends. Are evictions slowly increasing over time? Is your memory usage creeping up?
Let the robots handle It (Automated Scaling): ElastiCache offers Auto Scaling, which can automatically adjust the size of your cluster based on demand. It’s like having a fridge that magically expands and contracts as needed! This is a great way to handle unpredictable workloads.

The bottom line

ElastiCache evictions are a sign that your cache is under pressure. By understanding the causes, monitoring the right metrics, and taking proactive steps, you can keep your “fridge” running smoothly and your application performing at its best. It’s all about finding the right balance between speed, efficiency, and resource usage. Think like a chef, plan your menu, manage your ingredients, and keep your kitchen running like a well-oiled machine 🙂

February 25, 2025 by Fernando SRE Cloud stuff SRE stuff

The easy way to persistent storage in ECS Fargate

Running containers in ECS Fargate is great until you need persistent storage. At first, it seems straightforward: mount an EFS volume, and you’re done. But then you hit a roadblock. The container fails to start because the expected directory in EFS doesn’t exist.

What do you do? You could manually create the directory from an EC2 instance, but that’s not scalable. You could try scripting something, but now you’re adding complexity. That’s where I found myself, going down the wrong path before realizing that AWS already had a built-in solution that simplified everything. Let’s walk through what I learned.

The problem with persistent storage in ECS Fargate

When you define a task in ECS Fargate, you specify a TaskDefinition. This includes your container settings, environment variables, and any volumes you want to mount. The idea is simple: attach an EFS volume and mount it inside the container.

But there’s a catch. The task won’t start if the mount path inside EFS doesn’t already exist. So if your container expects to write to /data, and you set it up to map to /my-task/data on EFS, you’ll get an error if /my-task/data hasn’t been created yet.

At first, I thought, Fine, I’ll just SSH into an EC2 instance, mount the EFS drive, and create the folder manually. That worked. But then I realized something: what happens when I need to deploy multiple environments dynamically? Manually creating directories every time was not an option.

A Lambda function as a workaround

My next idea was to automate the directory creation using a Lambda function. Here’s how it worked:

The Lambda function mounts the root of the EFS volume.
It creates the required directory (/my-task/data).
The ECS task waits for the directory to exist before starting.

To integrate this, I created a custom resource in AWS CloudFormation that triggered the Lambda function whenever I deployed the stack. The function ran, created the directory, and ensured everything was in place before the container started.

It worked. The container launched successfully, and I automated the setup. But something still felt off. I had just introduced an entirely new AWS service, Lambda, to solve what seemed like a simple storage issue. More moving parts mean more maintenance, more security considerations, and more things that can break.

The simpler solution with EFS Access Points

While working on the Lambda function, I stumbled upon EFS Access Points. I needed one to allow Lambda to mount EFS, but then I realized something, ECS Fargate supports EFS Access Points too.

Here’s why that’s important. Access Points in EFS let you:
✔ Automatically create a directory when it’s first used.
✔ Restrict access to specific paths and users.
✔ Set permissions so the container only sees the directory it needs.

Instead of manually creating directories or relying on Lambda, I set up an Access Point for /my-task/data and configured my ECS TaskDefinition to use it. That’s it, no extra code, no custom logic, just a built-in feature that solved the problem cleanly.

The key takeaway

My first instinct was to write more code. A Lambda function, a CloudFormation resource, and extra logic, all to create a folder. But the right answer was much simpler: use the tools AWS already provides.

The lesson? When working with cloud infrastructure, resist the urge to overcomplicate things. The easiest solution is often the best one. If you ever find yourself scripting something that feels like it should be built-in, take a step back because it probably is.

February 19, 2025 by Fernando SRE Cloud stuff DevOps stuff

Secure and simplify EC2 access with AWS Session Manager

Accessing EC2 instances used to be a hassle. Bastion hosts, SSH keys, firewall rules, each piece added another layer of complexity and potential security risks. You had to open ports, distribute keys, and constantly manage access. It felt like setting up an intricate vault just to perform simple administrative tasks.

AWS Session Manager changes the game entirely. No exposed ports, no key distribution nightmares, and a complete audit trail of every session. Think of it as replacing traditional keys and doors with a secure, on-demand teleportation system, one that logs everything.

How AWS Session Manager works

Session Manager is part of AWS Systems Manager, a fully managed service that provides secure, browser-based, and CLI-based access to EC2 instances without needing SSH or RDP. Here’s how it works:

An SSM Agent runs on the instance and communicates outbound to AWS Systems Manager.
When you start a session, AWS verifies your identity and permissions using IAM.
Once authorized, a secure channel is created between your local machine and the instance, without opening any inbound ports.

This approach significantly reduces the attack surface. There is no need to open port 22 (SSH) or 3389 (RDP) for bastion hosts. Moreover, since authentication and authorization are managed by IAM policies, you no longer have to distribute or rotate SSH keys.

Setting up AWS Session Manager

Getting started with Session Manager is straightforward. Here’s a step-by-step guide:

1. Ensure the SSM agent is installed

Most modern Amazon Machine Images (AMIs) come with the SSM Agent pre-installed. If yours doesn’t, install it manually using the following command (for Amazon Linux, Ubuntu, or RHEL):

sudo yum install -y amazon-ssm-agent
sudo systemctl enable amazon-ssm-agent
sudo systemctl start amazon-ssm-agent

2. Create an IAM Role for EC2

Your EC2 instance needs an IAM role to communicate with AWS Systems Manager. Attach a policy that grants at least the following permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ssm:StartSession"
      ],
      "Resource": [
        "arn:aws:ec2:REGION:ACCOUNT_ID:instance/INSTANCE_ID"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:TerminateSession",
        "ssm:ResumeSession"
      ],
      "Resource": [
        "arn:aws:ssm:REGION:ACCOUNT_ID:session/${aws:username}-*"
      ]
    }
  ]
}

Replace REGION, ACCOUNT_ID, and INSTANCE_ID with your actual values. For best security practices, apply the principle of least privilege by restricting access to specific instances or tags.

3. Connect to your instance

Once the IAM role is attached, you’re ready to connect.

From the AWS Console: Navigate to EC2 > Instances, select your instance, click Connect, and choose Session Manager.

From the AWS CLI: Run:

aws ssm start-session --target i-xxxxxxxxxxxxxxxxx

That’s it, no SSH keys, no VPNs, no open ports.

Built-in security and auditing

Session Manager doesn’t just improve security, it also enhances compliance and auditing. Every session can be logged to Amazon S3 or CloudWatch Logs, capturing a full record of all executed commands. This ensures complete visibility into who accessed which instance and what actions were taken.

To enable logging, navigate to AWS Systems Manager > Session Manager, configure Session Preferences, and enable logging to an S3 bucket or CloudWatch Log Group.

Why Session Manager is better than traditional methods

Let’s compare Session Manager with traditional access methods:

Feature	Bastion Host & SSH	AWS Session Manager
Open inbound ports	Yes (22, 3389)	No
Requires SSH keys	Yes	No
Key rotation required	Yes	No
Logs session activity	Manual setup	Built-in
Works for on-premises	No	Yes

Session Manager removes unnecessary complexity. No more juggling bastion hosts, no more worrying about expired SSH keys, and no more open ports that expose your infrastructure to unnecessary risks.

Real-World applications and operational Benefits

Session Manager is not just a theoretical improvement, it delivers real-world value in multiple scenarios:

Developers can quickly access production or staging instances without security concerns.
System administrators can perform routine maintenance without managing SSH key distribution.
Security teams gain complete visibility into instance access and command history.
Hybrid cloud environments benefit from unified access across AWS and on-premises infrastructure.

With these advantages, Session Manager aligns perfectly with modern cloud-native security principles, helping teams focus on operations rather than infrastructure headaches.

In summary

AWS Session Manager isn’t just another tool, it’s a fundamental shift in how we access EC2 instances securely. If you’re still relying on bastion hosts and SSH keys, it’s time to rethink your approach.Try it out, configure logging, and experience a simpler, more secure way to manage your instances. You might never go back to the old ways.

February 14, 2025 by Fernando SRE Cloud stuff SRE stuff

Boost Performance and Resilience with AWS EC2 Placement Groups

There’s a hidden art to placing your EC2 instances in AWS. It’s not just about spinning up machines and hoping for the best, where they land in AWS’s vast infrastructure can make all the difference in performance, resilience, and cost. This is where Placement Groups come in.

You might have deployed instances before without worrying about placement, and for many workloads, that’s perfectly fine. But when your application needs lightning-fast communication, fault tolerance, or optimized performance, Placement Groups become a critical tool in your AWS arsenal.

Let’s break it down.

What are Placement Groups?

AWS Placement Groups give you control over how your EC2 instances are positioned within AWS’s data centers. Instead of leaving it to chance, you can specify how close, or how far apart, your instances should be placed. This helps optimize either latency, fault tolerance, or a balance of both.

There are three types of Placement Groups: Cluster, Spread, and Partition. Each serves a different purpose, and choosing the right one depends on your application’s needs.

Types of Placement Groups and when to use them

Cluster Placement Groups for speed over everything

Think of Cluster Placement Groups like a Formula 1 pit crew. Every millisecond counts, and your instances need to communicate at breakneck speeds. AWS achieves this by placing them on the same physical hardware, minimizing latency, and maximizing network throughput.

This is perfect for:
✅ High-performance computing (HPC) clusters
✅ Real-time financial trading systems
✅ Large-scale data processing (big data, AI, and ML workloads)

⚠️ The Trade-off: While these instances talk to each other at lightning speed, they’re all packed together on the same hardware. If that hardware fails, everything inside the Cluster Placement Group goes down with it.

Spread Placement Groups for maximum resilience

Now, imagine you’re managing a set of VIP guests at a high-profile event. Instead of seating them all at the same table (risking one bad spill ruining their night), you spread them out across different areas. That’s what Spread Placement Groups do, they distribute instances across separate physical machines to reduce the impact of hardware failure.

Best suited for:
✅ Mission-critical applications that need high availability
✅ Databases requiring redundancy across multiple nodes
✅ Low-latency, fault-tolerant applications

⚠️ The Limitation: AWS allows only seven instances per Availability Zone in a Spread Placement Group. If your application needs more, you may need to rethink your architecture.

Partition Placement Groups, the best of both worlds approach

Partition Placement Groups work like a warehouse with multiple sections, each with its power supply. If one section loses power, the others keep running. AWS follows the same principle, grouping instances into multiple partitions spread across different racks of hardware. This provides both high performance and resilience, a sweet spot between Cluster and Spread Placement Groups.

Best for:
✅ Distributed databases like Cassandra, HDFS, or Hadoop
✅ Large-scale analytics workloads
✅ Applications needing both performance and fault tolerance

⚠️ AWS’s Partitioning Rule: The number of partitions you can use depends on the AWS Region, and you must carefully plan how instances are distributed.

How to Configure Placement Groups

Setting up a Placement Group is straightforward, and you can do it using the AWS Management Console, AWS CLI, or an SDK.

Example using AWS CLI

Let’s create a Cluster Placement Group:

aws ec2 create-placement-group --group-name my-cluster-group --strategy cluster

Now, launch an instance into the group:

aws ec2 run-instances --image-id ami-12345678 --count 1 --instance-type c5.large --placement GroupName=my-cluster-group

For Spread and Partition Placement Groups, simply change the strategy:

aws ec2 create-placement-group --group-name my-spread-group --strategy spread
aws ec2 create-placement-group --group-name my-partition-group --strategy partition

Best practices for using Placement Groups

🚀 Combine with Multi-AZ Deployments: Placement Groups work within a single Availability Zone, so consider spanning multiple AZs for maximum resilience.

📊 Monitor Network Performance: AWS doesn’t guarantee placement if your instance type isn’t supported or there’s insufficient capacity. Always benchmark your performance after deployment.

💰 Balance Cost and Performance: Cluster Placement Groups give the fastest network speeds, but they also increase failure risk. If high availability is critical, Spread or Partition Groups might be a better fit.

Final thoughts

AWS Placement Groups are a powerful but often overlooked feature. They allow you to maximize performance, minimize downtime, and optimize costs, but only if you choose the right type.

The next time you deploy EC2 instances, don’t just launch them randomly, placement matters. Choose wisely, and your infrastructure will thank you for it.

February 11, 2025 by Fernando SRE Cloud stuff DevOps stuff SRE stuff

Building a strong cloud foundation with Landing Zones

The cloud is a dream come true for businesses. Agility, scalability, global reach, it’s all there. But, jumping into the cloud without a solid foundation is like setting up a city without roads, plumbing, or electricity. Sure, you can start building skyscrapers, but soon enough, you’ll be dealing with chaos, no clear way to manage access, tangled networking, security loopholes, and spiraling costs.

That’s where Landing Zones come in. They provide the blueprint, the infrastructure, and the guardrails so you can grow your cloud environment in a structured, scalable, and secure way. Let’s break it down.

What is a Landing Zone?

Think of a Landing Zone as the cloud’s equivalent of a well-planned neighborhood. Instead of letting houses pop up wherever they fit, you lay down roads, set up electricity, define zoning rules, and ensure there’s proper security. This way, when new residents move in, they have everything they need from day one.

In technical terms, a Landing Zone is a pre-configured cloud environment that enforces best practices, security policies, and automation from the start. You’re not reinventing the wheel every time you deploy a new application; instead, you’re working within a structured, repeatable framework.

Key components of any Landing Zone:

Identity and Access Management (IAM): Who has the keys to which doors?
Networking: The plumbing and wiring of your cloud city.
Security: Built-in alarms, surveillance, and firewalls.
Compliance: Ensuring regulations like GDPR or HIPAA are followed.
Automation: Infrastructure as Code (IaC) sets up resources predictably.
Governance: Rules that ensure consistency and control.

Why do you need a Landing Zone?

Why not just create cloud resources manually as you go? That’s like building a house without a blueprint, you’ll get something up, but sooner or later, it will collapse under its complexity.

Landing Zones save you from future headaches:

Faster Cloud Adoption: Everything is pre-configured, so teams can deploy applications quickly.
Stronger Security: Policies and guardrails are in place from day one, reducing risks.
Cost Efficiency: Prevents the dreaded “cloud sprawl” where resources are created haphazardly, leading to uncontrolled expenses.
Focus on Innovation: Teams spend less time on setup and more time on building.
Scalability: A well-structured cloud environment grows effortlessly with your needs.

It’s the difference between a well-organized toolbox and a chaotic mess of scattered tools. Which one lets you work faster and with fewer mistakes?

Different types of Landing Zones

Not all businesses need the same kind of cloud setup. The structure of your Landing Zone depends on your workloads and goals.

Cloud-Native: Designed for applications built specifically for the cloud.
Lift-and-Shift: Migrating legacy applications without significant changes.
Containerized: Optimized for Kubernetes and Docker-based workloads.
Data Science & AI/ML: Tailored for heavy computational and analytical tasks.
Hybrid Cloud: Bridging on-premises infrastructure with cloud resources.
Multicloud: Managing workloads across multiple cloud providers.

Each approach serves a different need, just like different types of buildings, offices, factories, and homes, serve different purposes in a city.

Landing Zones in AWS

AWS provides tools to make Landing Zones easier to implement, whether you’re a beginner or an advanced cloud architect.

Key AWS services for Landing Zones:

AWS Organizations: Manages multiple AWS accounts under a unified structure.
AWS Control Tower: Automates Landing Zone set up with best practices.
IAM, VPC, CloudTrail, Config, Security Hub, Service Catalog, CloudFormation: The building blocks that shape your environment.

Two ways to set up a Landing Zone in AWS:

AWS Control Tower (Recommended) – Provides an automated, guided setup with guardrails and best practices.
Custom-built Landing Zone – Built manually using CloudFormation or Terraform, offering more flexibility but requiring expertise.

Basic setup with Control Tower:

Plan your cloud structure.
Set up AWS Organizations to manage accounts.
Deploy Control Tower to automate governance and security.
Customize it to match your specific needs.

A well-structured AWS Landing Zone ensures that accounts are properly managed, security policies are enforced, and networking is set up for future growth.

Scaling and managing your Landing Zone

Setting up a Landing Zone is not a one-time task. It’s a continuous process that evolves as your cloud environment grows.

Best practices for ongoing management:

Automate Everything: Use Infrastructure as Code (IaC) to maintain consistency.
Monitor Continuously: Use AWS CloudWatch and AWS Config to track changes.
Manage Costs Proactively: Keep cloud expenses under control with AWS Budgets and Cost Explorer.
Stay Up to Date: Cloud best practices evolve, and so should your Landing Zone.

Think of your Landing Zone like a self-driving car. You might have set it up with the best configuration, but if you never update the software or adjust its sensors, you’ll eventually run into problems.

Summarizing

A strong Landing Zone isn’t just a technical necessity, it’s a strategic advantage. It ensures that your cloud journey is smooth, secure, and cost-effective.

Many businesses rush into the cloud without a plan, only to find themselves overwhelmed by complexity and security risks. Don’t be one of them. A well-architected Landing Zone is the difference between a cloud environment that thrives and one that turns into a tangled mess of unmanaged resources.

Set up your Landing Zone right, and you won’t just land in the cloud, you’ll be ready to take off.

February 10, 2025 by Fernando SRE Cloud stuff DevOps stuff

Lower costs with Valkey on Amazon ElastiCache

Amazon ElastiCache is a fully managed, in-memory caching service that helps you boost your application performance by retrieving information from fast, managed, in-memory caches, instead of relying solely on slower disk-based databases. Until now, you’ve had a couple of main choices for your caching engine: Memcached and Redis. Memcached is the simple, no-frills option, while Redis is the powerful, feature-rich one. Many companies, including mine, skip Memcached entirely due to its limitations. Now, there’s a new kid on the block: Valkey. And it’s not here to replace either of them but to give us more options. So, what’s the big deal?

What’s the deal with Valkey and why should we care?

Valkey is essentially a fork of Redis, meaning it branched off from the Redis codebase. It’s open-source, under the BSD 3-Clause license, and developed by a community of developers. Think of it like this: Redis was a popular open-source project, but its licensing changed slightly. So, a group of folks decided to take the core idea and continue developing it with a more open and community-focused approach. That’s Valkey in a nutshell. Importantly, Valkey uses the same protocol as Redis. This means you can use the same Redis clients to interact with Valkey, making it easy to switch or try out.

Now, you might be thinking, “Another caching engine? Why bother?”. Well, the interesting part about Valkey is that it claims to be just as powerful as Redis, but potentially more cost-effective. This is achieved by focusing on performance and resource usage. While Valkey has similarities with Redis, its community is putting in effort to improve some internal aspects. The end goal is to offer performance comparable to Redis but with better resource utilization. This can lead to significant cost savings in the long term. Also, being open source means no hefty licensing fees, unlike some commercial versions of Redis. This makes Valkey a compelling option, especially for applications that rely heavily on caching.

Valkey vs. Redis? As powerful as Redis but with a better price tag

This is where things get interesting. Valkey is designed to be compatible with the Redis protocol. This is crucial because it means migrating from Redis to Valkey should be relatively straightforward. You can keep using your existing Redis client libraries, which is a huge plus.

Now, when it comes to speed, early benchmarks suggest that Valkey can go toe-to-toe with Redis, and sometimes even surpass it, depending on the workload. This could be due to some clever optimizations under the hood in how Valkey handles memory or manages data structures.

But the real kicker is the potential for cost savings. How does Valkey achieve this? It boils down to efficiency. It seems that Valkey might be able to do more with less. For example, it could potentially store more data in the same instance size compared to Redis, meaning you pay less for the same amount of cached data. Or, it might use less CPU power for the same workload, allowing you to choose smaller, cheaper instances.

Why choose Valkey on ElastiCache? The key benefits

Let’s break down the main advantages of using Valkey:

Cost reduction: This is probably the biggest draw. Valkey’s efficiency, combined with its open-source nature, can lead to a smaller AWS bill. Imagine needing fewer or smaller instances to handle the same caching load. That’s money back in your pocket.
Scalable performance: Valkey is built to scale horizontally, just like Redis. You can add more nodes to your cluster to handle increased demand, ensuring your application remains snappy even under heavy load. It supports replication and high availability, so your data is safe and your application keeps running smoothly.
Flexibility and control: Because Valkey is open source, you have more transparency and control over the software you’re using. You can peek under the hood, understand how it works, and even contribute to its development if you’re so inclined.
Active community: Valkey is driven by a passionate community. This means continuous development, quick bug fixes, and a wealth of shared knowledge. It’s like having a global team of experts working to make the software better.

So, when should you pick Valkey over Redis?

Valkey seems particularly well-suited for a few scenarios:

Cost-sensitive applications: If you’re looking to optimize your infrastructure costs without sacrificing performance, Valkey is worth considering.
High-Throughput workloads: Applications that do a lot of reading and writing to the cache can benefit from Valkey’s efficiency.
Open source preference: Companies that prefer using open-source software for philosophical or practical reasons will find Valkey appealing.

Of course, it’s important to remember that Valkey is relatively new. While it’s showing great promise, it’s always a good idea to keep an eye on its development and adoption within the industry. Redis remains a solid, battle-tested option, so the choice ultimately depends on your specific needs and priorities.

The bottom line

Adding Valkey to ElastiCache is like getting a new, potentially more efficient tool in your toolbox. It doesn’t replace Redis, but it gives you another option, one that could save you money while delivering excellent performance. So, why not give Valkey a try on ElastiCache and see if it’s the right fit for your application? You might be pleasantly surprised. Remember, the best way to know is to test it yourself and see those cost savings firsthand.

February 4, 2025 by Fernando SRE Cloud stuff DevOps stuff

Unlocking efficiency with Amazon S3 Batch Operations

Suppose you’re a librarian, but instead of books, you’ve got millions, maybe billions, of files stored in the cloud. That’s what it’s like for many folks using Amazon S3 (Simple Storage Service). It’s a fantastic place to keep your digital stuff, but managing those files, especially in bulk, can be a real headache. It’s like trying to reshelve a whole library by hand, one book at a time. Tedious, right? That’s where S3 Batch Operations steps in, like a team of super-efficient robot librarians.

What is Amazon S3 Batch Operations?

Think of S3 Batch Operations as a powerful command center tool that lets you tell S3, “Hey, I need you to do something to a whole bunch of files, not just one.” You create what’s called a “job.” In this job, you specify:

The Inventory: A list of all the objects you want to work on. You can use an S3 inventory report or even a simple CSV.
The Operation: What you want to do with those objects: copy them, tag them, restore them from the archive, process them using lambda functions, and modify their lifecycle retention policies.

Then, you just let it run. S3 Batch Operations takes care of the rest, processing your files automatically.

Key features of Amazon S3 Batch Operations

This isn’t just about doing things in bulk. It’s about doing them smartly. Here’s what makes S3 Batch Operations stand out:

Copying Objects: Need to duplicate objects across buckets or regions? Maybe for backup or to move data closer to your users? Batch Operations handles it. You can specify the destination, storage class, and other settings.
Setting Tags: Tags are like labels on your files. They help you organize, search, and manage your data. Batch Operations lets you add, modify, or delete tags on millions of objects at once. Imagine tagging all your customer invoices with a specific project ID, in one go.
Restoring Objects from Glacier: Glacier is like the deep archive of S3, cheap but slow. Batch Operations can initiate the restoration of objects from Glacier in bulk.
Invoking Lambda Functions: This is where it gets really interesting. You can trigger Lambda functions for each object. Imagine automatically resizing images, converting file formats, or extracting metadata. The possibilities are endless! For example, you can invoke a Lambda function with Batch Operations to analyze web server logs, extract relevant information, and load it into a data warehouse for further analysis.
Applying Retention Policies: Need to comply with regulations that require you to keep data for a certain period, or automatically delete it after a while? Batch Operations can apply or modify retention policies on large datasets.

Some use cases

Let’s get practical. Here are some scenarios where S3 Batch Operations becomes a lifesaver:

Metadata Updates: Suppose you need to change the tags on millions of objects to reflect a new categorization scheme or comply with updated policies. For example, renaming a tag that was used with the category “Client X” to be replaced with a tag with the category “Company Y”. Batch Operations makes this a breeze.
Data Migration: Want to move old files to a cheaper storage class like Glacier to save costs? Batch Operations can automate this, and you can selectively restore files as needed.
Large-Scale Data Processing: Need to run analytics, transform data, or enrich your datasets? Batch Operations, combined with Lambda, lets you do this on a massive scale, automatically.
Disaster Recovery Replication: Set up automatic object replication to another region as part of your disaster recovery strategy.
Compliance and Audits: Easily apply or modify retention policies to comply with regulations like GDPR or HIPAA. No more manual work or worrying about missing something.
Implementing Data Lakes or Data Warehouses: In this use case, Batch Operations is used for data transformation (ETL) tasks and for ingesting and transforming large amounts of unstructured data into a structured format within the data lake. For example, converting JSON files without a standard format to a structured format, such as Parquet.

Benefits of using S3 Batch Operations

Why bother with all this? Because it makes your life easier and your operations more efficient. Let’s break it down:

Automatic Retries: If an operation fails for some reason, S3 Batch Operations will automatically retry it. No need to babysit the process.
Detailed Progress Reports: You get detailed reports on the status of your job. You can see which operations succeeded, which failed, and why.
Operation Status Tracking: You can monitor the progress of your job in real time.
Automatic Scaling: It doesn’t matter if you’re processing a thousand objects or a billion. S3 Batch Operations scales automatically to handle the load.
Time and Resource Savings: Automate tasks that would otherwise take days or weeks to do manually.
Error Reduction: Minimize the risk of human error in managing your data.
Enhanced Operational Efficiency: Optimize your use of AWS resources.
Improved Data Governance: Make it easier to apply policies and comply with regulations.

In a few words

Amazon S3 Batch Operations isn’t just another feature; it’s a game-changer for anyone dealing with large amounts of data in S3. It’s like having a superpower that lets you manage your data with efficiency and precision.

January 30, 2025 by Fernando SRE Cloud stuff DevOps stuff