Coding with Jesse

Does your web server scale down?

A laptop computer sleeping in the moonlight

Are you paying for servers sitting idle in the middle of the night?

When we talk about scaling a web server, we often focus on scaling up. Can your server handle a spike in traffic? As your business grows, can your database handle the growth?

There's less focus on scaling down. It makes sense, because most businesses are focused on growth. Not too many are looking to shrink. But if you're not careful, your server costs might go up and never come back down.

No web traffic is completely consistent. It grows during the day when people are awake. It shrinks at night when people sleep. It spikes with a popular marketing campaign. It retracts after a marketing campaign winds down.

A simple approach to scaling is to turn up the dial when a server gets overwhelmed. Upgrade to a server with a more powerful CPU. Increase the memory available. Unfortunately, this approach only moves in one direction.

A better solution is to have a dial that can turn both up and down. The way to achieve this is through a pool of servers and a load balancer. When traffic increases, start up new servers. When traffic decreases, terminate the excess capacity. Keep all your servers as busy as possible.

For lower volume sites, serverless deployments handle this beautifully. When nobody is using the server, you don't pay anything. When there's a spike, it can scale up to handle it.

At some point, it becomes cheaper and faster to run your own servers. If you do, you'll want an autoscaling pool and a load balancer. It might only have a small server in it most of the time. You'll need to define some rules so that it scales up when it gets overwhelmed. When things calm down, make sure it scales back down to one server.

You'll sleep better at night knowing that your servers and costs are resting too.

Published on June 12nd, 2024. © Jesse Skinner

Goldilocks and the Three Developers

Goldilocks was the lead of a software development team. She needed to review pull requests from three of her team members.

The first developer's code was a mess. It relied on some deprecated features of an outdated library. The few modules were long and complex, trying to do too many different things. There were no tests, so it was impossible to be sure the code was bug-free. The architecture needed to run on a single server, so it could never scale up. There was no way to know whether it did what it was supposed to do.

The second developer's code was also a mess. The system was built using some brand new libraries and coding paradigms. The system comprised of a dozen different interconnected microservices. There was a very thorough test suite, testing every implementation detail. The system included infrastructure as code, but couldn't run on a single computer. There was no way to know whether it did what it was supposed to do.

The third developer's code was just right. It used the latest versions of libraries the team was familiar with. The system was split up into a dozen simple modules. It was obvious what each module did, and how it fit within the business requirements. There were a few tests for the core functionality, so she knew that it was working. The system was easy to get running, but was simple enough to scale up infinitely. It was very easy to understand, and to know that it did what it was supposed to do.

Goldilocks then had a meeting with the three developers.

She told the first developer their code was under-engineered. She said they should take some time to simplify it and make it easier for other developers to understand and work with it.

She told the second developer their code was over-engineered. She said they should take some time to simplify it and make it easier for other developers to understand and work with it.

She said well done to the third developer and approved the pull request.

Published on June 5th, 2024. © Jesse Skinner

Unable to locate credentials in AWS

The Problem

If you have servers in AWS doing a high volume of AWS service requests, you may come across some rare but frustrating sporadic credential errors like these:

"Unable to locate credentials"

or if you're using aws-sdk in Node.js:

"CredentialsProviderError: Could not load credentials from any providers"

I'm not totally sure why these errors happen, but typically I see them happen across multiple services, accounts and regions around the same time, which leads me to believe that there can be some sporadic flakiness in the metadata service used for fetching IAM credentials.

I tried using metadata retries and other configuration parameters to prevent this, but they didn't seem to make any difference.

The Solution

Looking for a solution, I found this buried in the AWS documentation for instance metadata retrieval:

"If you're using the IMDS to retrieve AWS security credentials, avoid querying for credentials during every transaction or concurrently from a high number of threads or processes, as this might lead to throttling. Instead, we recommend that you cache the credentials until they start approaching their expiry time."

Now, I don't think this throttling was the source of all the errors I was seeing, but it may be playing a role. Maybe the metadata service tolerance for throttling changes over time as demand changes, I don't know.

Either way, this gave me an idea to write a bash script to cache the IAM credentials in ~/.aws/credentials so they could be used by both the AWS CLI, and also any Node.js or Python clients accessing the AWS services:

#!/bin/bash

IMDS_URL="http://169.254.169.254/latest/meta-data/iam/security-credentials/"
AWS_CREDENTIALS_PATH="~/.aws/credentials"
PROFILE_NAME="default"

# 4.5 minutes, because new credentials appear 5 minutes before expiry
EXPIRY_BUFFER=270

get_aws_credentials() {
    local role_name=$(curl -s $IMDS_URL)
    local credentials_url="${IMDS_URL}${role_name}"
    local response=$(curl -s $credentials_url)

    local access_key_id=$(echo $response | jq -r '.AccessKeyId')
    local secret_access_key=$(echo $response | jq -r '.SecretAccessKey')
    local token=$(echo $response | jq -r '.Token')
    local expiration=$(echo $response | jq -r '.Expiration')
    local expiration_time=$(date -d "$expiration" +%s)

    echo "[$PROFILE_NAME]" > $AWS_CREDENTIALS_PATH
    echo "aws_access_key_id = $access_key_id" >> $AWS_CREDENTIALS_PATH
    echo "aws_secret_access_key = $secret_access_key" >> $AWS_CREDENTIALS_PATH
    echo "aws_session_token = $token" >> $AWS_CREDENTIALS_PATH
    echo "expiration = $expiration_time" >> $AWS_CREDENTIALS_PATH
}

should_fetch_credentials() {
    if [[ ! -f $AWS_CREDENTIALS_PATH ]]; then
        return 0
    fi

    local expiration_time=$(grep 'expiration' $AWS_CREDENTIALS_PATH | cut -d ' ' -f 3)
    local current_time=$(date +%s)

    if (( $current_time + $EXPIRY_BUFFER > $expiration_time )); then
        return 0
    fi

    return 1
}

if should_fetch_credentials; then
    get_aws_credentials
fi

Since the credentials have to be refreshed every few hours, I set it up to run in a cron job every minute, to check if the expiration time has come:

* * * * * /home/ec2-user/credentials.sh > /dev/null 2>&1

Voila! No more credential errors! I hope that helps. Let me know if you've run into the same errors, and if you found this approach useful.

Published on May 30th, 2024. © Jesse Skinner

If an error is logged in the cloud, does it make a sound?

If a user sees an error message on your web server, how do you find out?

Does the user report it directly, if they're friendly enough? Do you read about it on social media, if they're frustrated enough? Or do you receive a notification directly from the server?

I remember one of my first jobs at Strato, a web hosting company in Germany. When we deployed a new version of our content management system, we'd log onto the web server and tail the logs to see if any errors appeared. We wanted to see if we broke anything.

I remember scrolling through this web server error log. I'd see hundreds of different errors from before we deployed. When I asked the team about them, they weren't really sure about them. "Nobody has complained about these things," they said.

Seeing errors and warnings in logs made me uncomfortable. Something was broken, and we weren't doing anything about it. I made it my mission and took a few days to go and fix every error I could. I felt that unhandled errors were unacceptable. I wanted the error log to be empty.

Ever since then, I've always set up error monitoring on any server I manage. Whenever an error is logged, I want it emailed to me immediately. I usually run a cron job that scans the error log and sends out emails any new entries every minute. It might strip out some common, unavoidable networking errors. But for any unexpected errors, I want to be the first to know.

Whenever a user trips over an obscure bug, I know before the user has time to tell me. When things break badly, and the error log starts filling up, I know immediately.

Whether you're running Apache, Nginx or serverless functions, have you looked at your error logs lately? If not, go take a look. You might be surprised what you find.

Make it your team's goal to get those errors down to zero. Set up a cron job so those errors get sent to your inbox. Or, sign up for a log monitoring service that makes this easier for you.

Don't wait for your users to complain.

Published on May 29th, 2024. © Jesse Skinner

You don't need permission

Watercolour illustration of a construction worker building a structure

You don't need permission to write the highest quality code you can. You don't need permission to design a reliable server architecture that won't crash. You don't need permission to develop a suite of tests to ensure bugs are caught early. You don't need permission to upgrade your dependencies, to ensure your system stays secure and modern.

Your boss, manager or client will never ask you to take time to refactor your code. They'll never ask you to set up a test suite for the code you wrote. They'll never ask you to upgrade your framework.

It's not that they don't want you to do things to improve the quality of the code. It's because they expect you to write high quality code from the start. They believe you to know what it takes to design a reliable system. They trust you to build web apps that last.

They'll never ask you to make sure your code has no bugs. They'll never ask you to make sure the new system doesn't crash. They'll never ask you to make sure your code will be understood by other developers. It all goes without saying.

So, don't put these things off for later. Don't put writing tests on the backlog. Don't put refactoring on your list of nice-to-haves. Don't put improving the reliability of your servers on the wishlist. Don't wait for permission.

Do these things every day. Make them a part of your process. You don't need permission.

Published on May 22nd, 2024. © Jesse Skinner
<< older posts newer posts >> All posts