GPU ML Pipeline on AWS with Terraform | From infrastructure code to GPU-accelerated ML jobs at scale
<h2 id=the-problem>The Problem</h2><p>Running machine learning jobs on GPUs shouldn’t require clicking through the AWS console or writing the same CloudFormation templates over and over. I wanted something declarative, reproducible, and cheap enough that I wouldn’t think twice about tearing it down between experiments.<p>The requirements were straightforward:<ul><li>GPU compute that scales to zero when idle<li>S3 for data in/out<li>Support for both Python scripts and Jupyter notebooks<li>Spot instances by default (because GPU hours add up)<li>No manual intervention after deployment</ul><h2 id=architecture-overview>Architecture Overview</h2><p>The solution uses four main AWS services orchestrated by Terraform:<pre><code>S3 Upload → Lambda → AWS Batch (GPU) → S3 Results</code></pre><p><strong>S3 buckets</strong> handle input data and store outputs. When you drop a <code>.py</code> file or <code>.ipynb</code> notebook into the input bucket, a Lambda function catches the event and submits it to AWS Batch.<p><strong>AWS Batch</strong> manages the GPU compute environment. It provisions EC2 instances from the g4dn or g5 families, runs your code in a Docker container with PyTorch and TensorFlow pre-installed, then scales back to zero when done.<p><strong>Lambda functions</strong> handle job orchestration and monitoring. One watches S3 for uploads, another monitors job status via EventBridge and sends SNS notifications.<p>The entire infrastructure lives in Terraform modules that you can deploy with a few environment variables.<h2 id=what-i-learned-building-this>What I Learned Building This</h2><h3 id=docker-images-are-heavy-but-worth-getting-right>Docker Images Are Heavy (But Worth Getting Right)</h3><p>The initial Docker image ballooned to 8GB trying to include every ML library I might need. I eventually split it into two images:<ul><li><code>ml-python</code>: Production jobs (PyTorch 2.1, TensorFlow 2.15, CUDA 12.1)<li><code>ml-notebook</code>: Jupyter execution with Papermill</ul><p>Both use NVIDIA’s CUDA base images. Getting the CUDA/cuDNN versions aligned with PyTorch and TensorFlow requirements took more attempts than I’d like to admit. The key was sticking with <code>nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04</code> as the base and letting the frameworks detect GPU support automatically.<h3 id=aws-batch-has-opinions-about-launch-templates>AWS Batch Has Opinions About Launch Templates</h3><p>AWS Batch’s relationship with EC2 launch templates is… complicated. You can’t specify instance types in both the launch template and the compute environment. I went back and forth on this before settling on defining instance types directly in the compute environment configuration and letting Batch handle the orchestration.<p>The <code>instance_type</code> parameter (singular) actually accepts a list, which wasn’t obvious from the documentation. This lets Batch pick from <code>["g4dn.xlarge", "g4dn.2xlarge", "g5.xlarge"]</code> based on spot availability.<h3 id=spot-instances-change-everything>Spot Instances Change Everything</h3><p>Switching to spot instances dropped compute costs by ~70%. The trick is using <code>SPOT_CAPACITY_OPTIMIZED</code> as the allocation strategy, which picks instance types less likely to be interrupted. In practice, interruptions have been rare, and when they happen, Batch automatically retries on a different instance.<p>For production workloads that can’t tolerate interruptions, you can flip <code>use_spot_instances = false</code> in the variables and pay full price for on-demand.<h3 id=state-management-matters-early>State Management Matters Early</h3><p>I initially used local Terraform state, which was fine until I wanted to work from a different machine. Setting up remote state with S3 and DynamoDB locking from the start would have saved some headaches.<p>The backend configuration can’t use variables, so it’s initialized with CLI flags:<div class=sourceCode id=cb2><pre class="sourceCode bash"><code class="sourceCode bash"><span id=cb2-1><a href=#cb2-1 aria-hidden=true tabindex=-1></a><span class=ex>terraform</span> init <span class=dt>\</span></span>
<span id=cb2-2><a href=#cb2-2 aria-hidden=true tabindex=-1></a> <span class=at>-backend-config</span><span class=op>=</span><span class=st>"bucket=</span><span class=va>${PROJECT_NAME}</span><span class=st>-</span><span class=va>${ENVIRONMENT}</span><span class=st>-terraform-state"</span> <span class=dt>\</span></span>
<span id=cb2-3><a href=#cb2-3 aria-hidden=true tabindex=-1></a> <span class=at>-backend-config</span><span class=op>=</span><span class=st>"dynamodb_table=</span><span class=va>${PROJECT_NAME}</span><span class=st>-</span><span class=va>${ENVIRONMENT}</span><span class=st>-terraform-lock"</span> <span class=dt>\</span></span>
<span id=cb2-4><a href=#cb2-4 aria-hidden=true tabindex=-1></a> <span class=at>-backend-config</span><span class=op>=</span><span class=st>"region=</span><span class=va>$AWS_REGION</span><span class=st>"</span></span></code></pre></div><p>A one-time setup cost that pays off immediately when you need to collaborate or switch contexts.<h2 id=using-it>Using It</h2><p>Once deployed, the workflow is simple:<div class=sourceCode id=cb3><pre class="sourceCode bash"><code class="sourceCode bash"><span id=cb3-1><a href=#cb3-1 aria-hidden=true tabindex=-1></a><span class=co># Upload a training script</span></span>
<span id=cb3-2><a href=#cb3-2 aria-hidden=true tabindex=-1></a><span class=ex>aws</span> s3 cp train.py s3://ml-pipeline-ml-input/jobs/</span>
<span id=cb3-3><a href=#cb3-3 aria-hidden=true tabindex=-1></a></span>
<span id=cb3-4><a href=#cb3-4 aria-hidden=true tabindex=-1></a><span class=co># Check logs</span></span>
<span id=cb3-5><a href=#cb3-5 aria-hidden=true tabindex=-1></a><span class=ex>aws</span> logs tail /aws/batch/job <span class=at>--follow</span></span>
<span id=cb3-6><a href=#cb3-6 aria-hidden=true tabindex=-1></a></span>
<span id=cb3-7><a href=#cb3-7 aria-hidden=true tabindex=-1></a><span class=co># Grab results</span></span>
<span id=cb3-8><a href=#cb3-8 aria-hidden=true tabindex=-1></a><span class=ex>aws</span> s3 sync s3://ml-pipeline-ml-output/results/ ./results/</span></code></pre></div><p>The Lambda function catches the S3 upload, submits the job to Batch, and you get an SNS notification when it completes. No clicking through consoles, no manual job submission.<p>For notebooks, Papermill executes them with parameters you can pass via the filename or metadata. Useful for running the same analysis across different datasets.<h2 id=whats-next>What’s Next</h2><p>A few things I’m considering for v2:<ul><li><strong>Cost tracking:</strong> CloudWatch dashboards showing spend per job<li><strong>Multi-region support:</strong> Failover to different regions when spot capacity is tight<li><strong>Model registry:</strong> Integration with something like MLflow for versioning trained models<li><strong>Auto-cleanup:</strong> Lifecycle policies to delete old jobs and outputs</ul><p>But for now, it does what I need: run GPU jobs without thinking about infrastructure.<h2 id=code>Code</h2><p>The full Terraform module is on GitHub: <a href=https://github.com/hiram-labs/terraform-aws-ml>hiram-labs/terraform-aws-ml</a><p>If you’re running ML experiments on AWS and tired of manual setup, it might save you some time. The README has deployment instructions and examples.<h2 id=final-thoughts>Final Thoughts</h2><p>This started as a weekend project to avoid clicking through the AWS console and turned into something I actually use regularly. The infrastructure cost when idle is effectively zero (just S3 storage), and spinning up GPU instances for training happens automatically.<p>Most of the complexity is in getting Docker images right and understanding AWS Batch’s quirks. Once that’s sorted, Terraform makes the rest reproducible.