Coding with Jesse

Unable to locate credentials in AWS

The Problem

If you have servers in AWS doing a high volume of AWS service requests, you may come across some rare but frustrating sporadic credential errors like these:

"Unable to locate credentials"

or if you're using aws-sdk in Node.js:

"CredentialsProviderError: Could not load credentials from any providers"

I'm not totally sure why these errors happen, but typically I see them happen across multiple services, accounts and regions around the same time, which leads me to believe that there can be some sporadic flakiness in the metadata service used for fetching IAM credentials.

I tried using metadata retries and other configuration parameters to prevent this, but they didn't seem to make any difference.

The Solution

Looking for a solution, I found this buried in the AWS documentation for instance metadata retrieval:

"If you're using the IMDS to retrieve AWS security credentials, avoid querying for credentials during every transaction or concurrently from a high number of threads or processes, as this might lead to throttling. Instead, we recommend that you cache the credentials until they start approaching their expiry time."

Now, I don't think this throttling was the source of all the errors I was seeing, but it may be playing a role. Maybe the metadata service tolerance for throttling changes over time as demand changes, I don't know.

Either way, this gave me an idea to write a bash script to cache the IAM credentials in ~/.aws/credentials so they could be used by both the AWS CLI, and also any Node.js or Python clients accessing the AWS services:

#!/bin/bash

IMDS_URL="http://169.254.169.254/latest/meta-data/iam/security-credentials/"
AWS_CREDENTIALS_PATH="~/.aws/credentials"
PROFILE_NAME="default"

# 4.5 minutes, because new credentials appear 5 minutes before expiry
EXPIRY_BUFFER=270

get_aws_credentials() {
    local role_name=$(curl -s $IMDS_URL)
    local credentials_url="${IMDS_URL}${role_name}"
    local response=$(curl -s $credentials_url)

    local access_key_id=$(echo $response | jq -r '.AccessKeyId')
    local secret_access_key=$(echo $response | jq -r '.SecretAccessKey')
    local token=$(echo $response | jq -r '.Token')
    local expiration=$(echo $response | jq -r '.Expiration')
    local expiration_time=$(date -d "$expiration" +%s)

    echo "[$PROFILE_NAME]" > $AWS_CREDENTIALS_PATH
    echo "aws_access_key_id = $access_key_id" >> $AWS_CREDENTIALS_PATH
    echo "aws_secret_access_key = $secret_access_key" >> $AWS_CREDENTIALS_PATH
    echo "aws_session_token = $token" >> $AWS_CREDENTIALS_PATH
    echo "expiration = $expiration_time" >> $AWS_CREDENTIALS_PATH
}

should_fetch_credentials() {
    if [[ ! -f $AWS_CREDENTIALS_PATH ]]; then
        return 0
    fi

    local expiration_time=$(grep 'expiration' $AWS_CREDENTIALS_PATH | cut -d ' ' -f 3)
    local current_time=$(date +%s)

    if (( $current_time + $EXPIRY_BUFFER > $expiration_time )); then
        return 0
    fi

    return 1
}

if should_fetch_credentials; then
    get_aws_credentials
fi

Since the credentials have to be refreshed every few hours, I set it up to run in a cron job every minute, to check if the expiration time has come:

* * * * * /home/ec2-user/credentials.sh > /dev/null 2>&1

Voila! No more credential errors! I hope that helps. Let me know if you've run into the same errors, and if you found this approach useful.

Published on May 30th, 2024. © Jesse Skinner

If an error is logged in the cloud, does it make a sound?

If a user sees an error message on your web server, how do you find out?

Does the user report it directly, if they're friendly enough? Do you read about it on social media, if they're frustrated enough? Or do you receive a notification directly from the server?

I remember one of my first jobs at Strato, a web hosting company in Germany. When we deployed a new version of our content management system, we'd log onto the web server and tail the logs to see if any errors appeared. We wanted to see if we broke anything.

I remember scrolling through this web server error log. I'd see hundreds of different errors from before we deployed. When I asked the team about them, they weren't really sure about them. "Nobody has complained about these things," they said.

Seeing errors and warnings in logs made me uncomfortable. Something was broken, and we weren't doing anything about it. I made it my mission and took a few days to go and fix every error I could. I felt that unhandled errors were unacceptable. I wanted the error log to be empty.

Ever since then, I've always set up error monitoring on any server I manage. Whenever an error is logged, I want it emailed to me immediately. I usually run a cron job that scans the error log and sends out emails any new entries every minute. It might strip out some common, unavoidable networking errors. But for any unexpected errors, I want to be the first to know.

Whenever a user trips over an obscure bug, I know before the user has time to tell me. When things break badly, and the error log starts filling up, I know immediately.

Whether you're running Apache, Nginx or serverless functions, have you looked at your error logs lately? If not, go take a look. You might be surprised what you find.

Make it your team's goal to get those errors down to zero. Set up a cron job so those errors get sent to your inbox. Or, sign up for a log monitoring service that makes this easier for you.

Don't wait for your users to complain.

Published on May 29th, 2024. © Jesse Skinner

You don't need permission

Watercolour illustration of a construction worker building a structure

You don't need permission to write the highest quality code you can. You don't need permission to design a reliable server architecture that won't crash. You don't need permission to develop a suite of tests to ensure bugs are caught early. You don't need permission to upgrade your dependencies, to ensure your system stays secure and modern.

Your boss, manager or client will never ask you to take time to refactor your code. They'll never ask you to set up a test suite for the code you wrote. They'll never ask you to upgrade your framework.

It's not that they don't want you to do things to improve the quality of the code. It's because they expect you to write high quality code from the start. They believe you to know what it takes to design a reliable system. They trust you to build web apps that last.

They'll never ask you to make sure your code has no bugs. They'll never ask you to make sure the new system doesn't crash. They'll never ask you to make sure your code will be understood by other developers. It all goes without saying.

So, don't put these things off for later. Don't put writing tests on the backlog. Don't put refactoring on your list of nice-to-haves. Don't put improving the reliability of your servers on the wishlist. Don't wait for permission.

Do these things every day. Make them a part of your process. You don't need permission.

Published on May 22nd, 2024. © Jesse Skinner