AWS ECR Auto-Login with Autoscaling EC2 Gitlab-Runner and Docker-in-Docker
This is such a long title, so again here is what we want to achieve.
- Autoscaling AWS EC2-based Gitlab-Runner (Spot) instances using docker+machine executors
- We're planning to use Docker-in-Docker
- We also have private AWS ECR Repositories which we want to auto-login to for each docker pull/push in our CI Pipelines
Why is this so hard?
You probably came here because you already tried yourself and you found a ton of other people suggesting solutions to this problem. And everyone probably solved it for themselves somehow but none of their solutions seem to work for you.
That was me before I got around it.
Now this blog post is my journey and how I solved it but also I want to give some background knowledge of what I learned how Autoscaling EC2 Gitlab-Runners with the docker+machine executors work and how my solution to the ECR Auto-Login problem looks like and made sense to me.
The basic setup
I don't really want to get into details of my terraform setup that does the autoscaling on EC2 as this would make this post way larger than it needs to be. I'd rather want to explain you the setup, so you can understand how it works.
The basic setup in general is this: https://docs.gitlab.com/runner/configuration/runner_autoscale_aws/
The concept is, that you only have one very small gitlab-runner manager instance running 24/7 that can even be a free tier option (e.g. t3.micro or t4g.micro as I tend to use ARM Arch here).
You register this machine and you tell it to automatically scale up bigger runner instances (e.g. c8g.medium) to do the work and then scale them down again to save precious money. Also, since those instances usually run only for a limited time we use Spot instances which can save you between 25 and 70% of the costs of on-demand.
The key parts you need on the gitlab-runner manager
The manager instance needs several crucial components for ECR authentication to work properly:
1. Docker Machine for Instance Provisioning
The manager uses Docker Machine with the amazonec2
driver to automatically create and destroy worker instances:
# Install Docker Machine
curl -L https://github.com/docker/machine/releases/download/v0.16.2/docker-machine-$(uname -s)-$(uname -m) >/tmp/docker-machine
chmod +x /tmp/docker-machine
mv /tmp/docker-machine /usr/local/bin/docker-machine
2. SSH Key Management for Worker Communication
The manager needs to generate its own SSH key pair for communicating with worker instances:
# Generate dedicated ED25519 SSH key pair for GitLab runner machine communication
sudo -u admin ssh-keygen -t ed25519 -f /home/admin/.ssh/id_ed25519 -N "" -C "gitlab-runner-manager@${environment}"
# Upload the public key to AWS as a key pair for worker instances
aws ec2 import-key-pair \
--region ${aws_region} \
--key-name "gitlab-runner-manager-${environment}" \
--public-key-material fileb:///home/admin/.ssh/id_ed25519.pub
3. ECR Credential Helper on Manager
Install the ECR credential helper on the manager instance:
# Download and install ECR credential helper (ARM64 example)
curl -Lo /usr/local/bin/docker-credential-ecr-login https://amazon-ecr-credential-helper-releases.s3.us-east-2.amazonaws.com/0.7.1/linux-arm64/docker-credential-ecr-login
chmod +x /usr/local/bin/docker-credential-ecr-login
# Configure Docker to use ECR credential helper
mkdir -p /root/.docker
cat > /root/.docker/config.json << EOF
{
"credHelpers": {
"public.ecr.aws": "ecr-login",
"<aws_account_id>.dkr.ecr.<aws_region>.amazonaws.com": "ecr-login"
}
}
EOF
The ECR Auto-Login Solution
The real challenge is ensuring that every dynamically created worker instance can authenticate with ECR. Here's my multi-layered approach:
Layer 1: Worker Setup Script
Create a script that will run on each freshly spawned worker instance:
cat > /home/admin/worker_setup.sh << 'WORKER_SCRIPT'
#!/bin/bash
# Worker instance setup script for ECR authentication
set -e
# Update system
apt-get update -y
# Download and install ECR credential helper
curl -Lo /usr/local/bin/docker-credential-ecr-login https://amazon-ecr-credential-helper-releases.s3.us-east-2.amazonaws.com/0.7.1/linux-arm64/docker-credential-ecr-login
chmod +x /usr/local/bin/docker-credential-ecr-login
# Add docker config
mkdir -p /root/.docker
echo '{ "credHelpers": { "public.ecr.aws": "ecr-login", "<aws_account_id>.dkr.ecr.<aws_region>.amazonaws.com": "ecr-login" }}' >> /root/.docker/config.json
echo "ECR credential helper installation completed"
WORKER_SCRIPT
Layer 2: Volume Mounting in GitLab Runner Configuration
When registering the GitLab runner, mount the ECR credential helper and config into the Docker containers. For Docker-in-Docker the privileged
mode is also required:
gitlab-runner register \
--non-interactive \
--url "${gitlab_url}" \
--registration-token "${gitlab_token}" \
--executor "docker+machine" \
--docker-image "alpine:latest" \
--docker-privileged \
--docker-volumes "/usr/local/bin/docker-credential-ecr-login:/usr/local/bin/docker-credential-ecr-login:ro" \
--docker-volumes "/root/.docker/config.json:/root/.docker/config.json:ro" \
# ... other machine options
--machine-machine-options "amazonec2-userdata=/home/admin/worker_setup.sh"
Layer 3: IAM Permissions
Ensure your worker instances have the proper IAM permissions for ECR:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecr:BatchCheckLayerAvailability",
"ecr:BatchGetImage",
"ecr:GetDownloadUrlForLayer",
"ecr:GetAuthorizationToken"
],
"Resource": "*"
}
]
}
Key Insights and Gotchas
1. Docker-in-Docker Requirements
When using DinD, the ECR credentials need to be available both on the host and inside the nested Docker daemon. The volume mounting approach ensures this.
2. ARM64 vs AMD64 Architecture
Make sure to download the correct ECR credential helper binary for your architecture. The example uses ARM64 for Graviton instances, as they're cheaper and more powerful for the same money.
3. Timing Issues
The worker setup script runs during instance boot. For large images or slow network connections, you might need to add retry logic or delays. Usually the gitlab-runner manager node waits about 30 seconds for a runner to boot.
Testing Your Setup
To verify ECR authentication is working:
- Check manager instance:
docker pull <aws_account_id>.dkr.ecr.<aws_region>.amazonaws.com/your-image:tag
- Monitor worker logs:
sudo journalctl -u gitlab-runner -f
- Test in a CI job: Create a simple
.gitlab-ci.yml
that pulls from ECR
test_ecr:
image: <aws_account_id>.dkr.ecr.<aws_region>.amazonaws.com/your-image:latest
script:
- echo "ECR authentication successful!"
Conclusion
The key to solving ECR auto-login with autoscaling GitLab runners is understanding that you need ECR authentication at multiple levels:
- Manager level: For managing the runner itself
- Worker level: For the actual CI jobs
- Container level: For Docker-in-Docker scenarios
By implementing the worker setup script, proper volume mounting, and ensuring correct IAM permissions, you can achieve seamless ECR authentication across your entire autoscaling CI infrastructure.
This solution has been successfully tested in production environments and handles the dynamic nature of autoscaling runners.