GPU ML Pipeline on AWS with Terraform | From infrastructure code to GPU-accelerated ML jobs at scale
I built this Terraform-managed AWS ML pipeline to run GPU and CPU jobs on demand without paying for idle compute. In this post, I walk through the architecture, the key design choices, and the operational pitfalls to watch for before using this pattern in production.
<details id="toc-wrapper" style="border: 1px solid var(--primary-color, #444444); padding: 0.75rem 1rem; margin: 1rem 0; border-radius: 0.25em;">
<summary style="cursor: pointer; font-weight: 600;">Table of Contents</summary>
<nav id="TOC" role="doc-toc" style="margin-top: 0.75rem; text-align: left;">
<ul>
<li><a href="#introduction" id="toc-introduction"><span class="toc-section-number">1</span> Introduction</a></li>
<li><a href="#architecture" id="toc-architecture"><span class="toc-section-number">2</span> Architecture</a></li>
<li><a href="#trigger-flow" id="toc-trigger-flow"><span class="toc-section-number">3</span> Trigger Flow</a></li>
<li><a href="#compute" id="toc-compute"><span class="toc-section-number">4</span> Compute</a></li>
<li><a href="#docker-images" id="toc-docker-images"><span class="toc-section-number">5</span> Docker Images</a></li>
<li><a href="#job-scripts" id="toc-job-scripts"><span class="toc-section-number">6</span> Job Scripts</a></li>
<li><a href="#admin-tools" id="toc-admin-tools"><span class="toc-section-number">7</span> Admin Tools</a></li>
<li><a href="#state-and-backend" id="toc-state-and-backend"><span class="toc-section-number">8</span> State and Backend</a></li>
<li><a href="#gpu-quotas" id="toc-gpu-quotas"><span class="toc-section-number">9</span> GPU Quotas</a></li>
<li><a href="#code" id="toc-code"><span class="toc-section-number">10</span> Code</a></li>
</ul>
</nav>
</details>
<h1 data-number="1" id="introduction"><span class="header-section-number">1</span> Introduction</h1>
<p>The project is a Terraform-managed infrastructure for running GPU and CPU ML workloads on AWS. You publish a message to an SNS topic, a Lambda function routes it to AWS Batch, Batch spins up an EC2 instance, runs your code in a Docker container, uploads the outputs to S3, and shuts down. When there are no jobs, the compute environments sit at zero vCPUs and cost nothing.</p>
<h1 data-number="2" id="architecture"><span class="header-section-number">2</span> Architecture</h1>
<img width="100%" src="data:image/svg+xml;base64,<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
 "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<!-- Generated by graphviz version 2.43.0 (0)
 -->
<!-- Title: AWS_ML_Pipeline Pages: 1 -->
<svg width="1080pt" height="612pt"
 viewBox="0.00 0.00 1079.80 611.80" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(14.4 597.4)">
<title>AWS_ML_Pipeline</title>
<polygon fill="white" stroke="transparent" points="-14.4,14.4 -14.4,-597.4 1065.4,-597.4 1065.4,14.4 -14.4,14.4"/>
<g id="clust1" class="cluster">
<title>cluster_terraform</title>
<path fill="white" stroke="#cba3bc" d="M355,-448C355,-448 1031,-448 1031,-448 1037,-448 1043,-454 1043,-460 1043,-460 1043,-508 1043,-508 1043,-514 1037,-520 1031,-520 1031,-520 355,-520 355,-520 349,-520 343,-514 343,-508 343,-508 343,-460 343,-460 343,-454 349,-448 355,-448"/>
<text text-anchor="middle" x="693" y="-507.2" font-family="Helvetica,sans-Serif" font-size="11.00">Terraform Managed Infrastructure</text>
</g>
<!-- user -->
<g id="node1" class="node">
<title>user</title>
<path fill="#e9f3ff" stroke="#222222" d="M314,-583C314,-583 238,-583 238,-583 232,-583 226,-577 226,-571 226,-571 226,-559 226,-559 226,-553 232,-547 238,-547 238,-547 314,-547 314,-547 320,-547 326,-553 326,-559 326,-559 326,-571 326,-571 326,-577 320,-583 314,-583"/>
<text text-anchor="middle" x="276" y="-562.5" font-family="Helvetica,sans-Serif" font-size="10.00">User or Admin UI</text>
</g>
<!-- trig_topic -->
<g id="node2" class="node">
<title>trig_topic</title>
<path fill="#e9f3ff" stroke="#222222" d="M314,-492C314,-492 238,-492 238,-492 232,-492 226,-486 226,-480 226,-480 226,-468 226,-468 226,-462 232,-456 238,-456 238,-456 314,-456 314,-456 320,-456 326,-462 326,-468 326,-468 326,-480 326,-480 326,-486 320,-492 314,-492"/>
<text text-anchor="middle" x="276" y="-471.5" font-family="Helvetica,sans-Serif" font-size="10.00">SNS Trigger Topic</text>
</g>
<!-- user&#45;&gt;trig_topic -->
<g id="edge1" class="edge">
<title>user&#45;&gt;trig_topic</title>
<path fill="none" stroke="#444444" d="M276,-546.84C276,-533.3 276,-514.24 276,-499.15"/>
<polygon fill="#444444" stroke="#444444" points="278.45,-499.11 276,-492.11 273.55,-499.11 278.45,-499.11"/>
</g>
<!-- dispatcher -->
<g id="node3" class="node">
<title>dispatcher</title>
<path fill="#e9f3ff" stroke="#222222" d="M731.5,-419C731.5,-419 644.5,-419 644.5,-419 638.5,-419 632.5,-413 632.5,-407 632.5,-407 632.5,-395 632.5,-395 632.5,-389 638.5,-383 644.5,-383 644.5,-383 731.5,-383 731.5,-383 737.5,-383 743.5,-389 743.5,-395 743.5,-395 743.5,-407 743.5,-407 743.5,-413 737.5,-419 731.5,-419"/>
<text text-anchor="middle" x="688" y="-398.5" font-family="Helvetica,sans-Serif" font-size="10.00">Lambda Dispatcher</text>
</g>
<!-- trig_topic&#45;&gt;dispatcher -->
<g id="edge2" class="edge">
<title>trig_topic&#45;&gt;dispatcher</title>
<path fill="none" stroke="#444444" d="M313.22,-455.97C320.96,-452.94 329.15,-450.09 337,-448 435.77,-421.76 554.33,-410.2 625.29,-405.32"/>
<polygon fill="#444444" stroke="#444444" points="625.61,-407.76 632.43,-404.84 625.28,-402.87 625.61,-407.76"/>
</g>
<!-- gpu_queue -->
<g id="node4" class="node">
<title>gpu_queue</title>
<path fill="#ecf9ef" stroke="#222222" d="M726.5,-337C726.5,-337 647.5,-337 647.5,-337 641.5,-337 635.5,-331 635.5,-325 635.5,-325 635.5,-313 635.5,-313 635.5,-307 641.5,-301 647.5,-301 647.5,-301 726.5,-301 726.5,-301 732.5,-301 738.5,-307 738.5,-313 738.5,-313 738.5,-325 738.5,-325 738.5,-331 732.5,-337 726.5,-337"/>
<text text-anchor="middle" x="687" y="-316.5" font-family="Helvetica,sans-Serif" font-size="10.00">Batch GPU Queue</text>
</g>
<!-- dispatcher&#45;&gt;gpu_queue -->
<g id="edge3" class="edge">
<title>dispatcher&#45;&gt;gpu_queue</title>
<path fill="none" stroke="#444444" d="M646.36,-382.89C639.41,-378.14 633.13,-372.24 629,-365 623.82,-355.91 627.93,-348.03 635.89,-341.51"/>
<polygon fill="#444444" stroke="#444444" points="637.66,-343.26 641.95,-337.21 634.83,-339.26 637.66,-343.26"/>
<text text-anchor="middle" x="674.5" y="-357.8" font-family="Helvetica,sans-Serif" font-size="9.00">compute_type=gpu</text>
</g>
<!-- cpu_queue -->
<g id="node5" class="node">
<title>cpu_queue</title>
<path fill="#ecf9ef" stroke="#222222" d="M854,-337C854,-337 776,-337 776,-337 770,-337 764,-331 764,-325 764,-325 764,-313 764,-313 764,-307 770,-301 776,-301 776,-301 854,-301 854,-301 860,-301 866,-307 866,-313 866,-313 866,-325 866,-325 866,-331 860,-337 854,-337"/>
<text text-anchor="middle" x="815" y="-316.5" font-family="Helvetica,sans-Serif" font-size="10.00">Batch CPU Queue</text>
</g>
<!-- dispatcher&#45;&gt;cpu_queue -->
<g id="edge4" class="edge">
<title>dispatcher&#45;&gt;cpu_queue</title>
<path fill="none" stroke="#444444" d="M715.24,-382.84C734.75,-370.55 761.11,-353.95 781.84,-340.89"/>
<polygon fill="#444444" stroke="#444444" points="783.18,-342.94 787.8,-337.13 780.57,-338.79 783.18,-342.94"/>
<text text-anchor="middle" x="804.5" y="-357.8" font-family="Helvetica,sans-Serif" font-size="9.00">compute_type=cpu</text>
</g>
<!-- gpu_ce -->
<g id="node6" class="node">
<title>gpu_ce</title>
<path fill="#ecf9ef" stroke="#222222" d="M679,-264C679,-264 577,-264 577,-264 571,-264 565,-258 565,-252 565,-252 565,-240 565,-240 565,-234 571,-228 577,-228 577,-228 679,-228 679,-228 685,-228 691,-234 691,-240 691,-240 691,-252 691,-252 691,-258 685,-264 679,-264"/>
<text text-anchor="middle" x="628" y="-249" font-family="Helvetica,sans-Serif" font-size="10.00">AWS Batch GPU</text>
<text text-anchor="middle" x="628" y="-238" font-family="Helvetica,sans-Serif" font-size="10.00">Compute Environment</text>
</g>
<!-- gpu_queue&#45;&gt;gpu_ce -->
<g id="edge5" class="edge">
<title>gpu_queue&#45;&gt;gpu_ce</title>
<path fill="none" stroke="#444444" d="M672.72,-300.81C664.89,-291.39 655.1,-279.61 646.67,-269.47"/>
<polygon fill="#444444" stroke="#444444" points="648.51,-267.85 642.15,-264.03 644.74,-270.98 648.51,-267.85"/>
</g>
<!-- cpu_ce -->
<g id="node7" class="node">
<title>cpu_ce</title>
<path fill="#ecf9ef" stroke="#222222" d="M830,-264C830,-264 728,-264 728,-264 722,-264 716,-258 716,-252 716,-252 716,-240 716,-240 716,-234 722,-228 728,-228 728,-228 830,-228 830,-228 836,-228 842,-234 842,-240 842,-240 842,-252 842,-252 842,-258 836,-264 830,-264"/>
<text text-anchor="middle" x="779" y="-249" font-family="Helvetica,sans-Serif" font-size="10.00">AWS Batch CPU</text>
<text text-anchor="middle" x="779" y="-238" font-family="Helvetica,sans-Serif" font-size="10.00">Compute Environment</text>
</g>
<!-- cpu_queue&#45;&gt;cpu_ce -->
<g id="edge6" class="edge">
<title>cpu_queue&#45;&gt;cpu_ce</title>
<path fill="none" stroke="#444444" d="M806.29,-300.81C801.64,-291.66 795.86,-280.26 790.82,-270.32"/>
<polygon fill="#444444" stroke="#444444" points="792.99,-269.16 787.63,-264.03 788.61,-271.38 792.99,-269.16"/>
</g>
<!-- gpu_job -->
<g id="node8" class="node">
<title>gpu_job</title>
<path fill="#ecf9ef" stroke="#222222" d="M359.5,-191C359.5,-191 266.5,-191 266.5,-191 260.5,-191 254.5,-185 254.5,-179 254.5,-179 254.5,-167 254.5,-167 254.5,-161 260.5,-155 266.5,-155 266.5,-155 359.5,-155 359.5,-155 365.5,-155 371.5,-161 371.5,-167 371.5,-167 371.5,-179 371.5,-179 371.5,-185 365.5,-191 359.5,-191"/>
<text text-anchor="middle" x="313" y="-176" font-family="Helvetica,sans-Serif" font-size="10.00">Container Entrypoint</text>
<text text-anchor="middle" x="313" y="-165" font-family="Helvetica,sans-Serif" font-size="10.00">+ Job Script</text>
</g>
<!-- gpu_ce&#45;&gt;gpu_job -->
<g id="edge7" class="edge">
<title>gpu_ce&#45;&gt;gpu_job</title>
<path fill="none" stroke="#444444" d="M564.63,-230.72C510.62,-218.54 433.28,-201.11 378.63,-188.79"/>
<polygon fill="#444444" stroke="#444444" points="379.11,-186.39 371.74,-187.24 378.03,-191.17 379.11,-186.39"/>
</g>
<!-- eb -->
<g id="node15" class="node">
<title>eb</title>
<path fill="#e9f3ff" stroke="#222222" d="M761,-191C761,-191 667,-191 667,-191 661,-191 655,-185 655,-179 655,-179 655,-167 655,-167 655,-161 661,-155 667,-155 667,-155 761,-155 761,-155 767,-155 773,-161 773,-167 773,-167 773,-179 773,-179 773,-185 767,-191 761,-191"/>
<text text-anchor="middle" x="714" y="-176" font-family="Helvetica,sans-Serif" font-size="10.00">EventBridge</text>
<text text-anchor="middle" x="714" y="-165" font-family="Helvetica,sans-Serif" font-size="10.00">Batch State Changes</text>
</g>
<!-- gpu_ce&#45;&gt;eb -->
<g id="edge17" class="edge">
<title>gpu_ce&#45;&gt;eb</title>
<path fill="none" stroke="#444444" d="M648.82,-227.81C660.54,-218.13 675.3,-205.95 687.81,-195.63"/>
<polygon fill="#444444" stroke="#444444" points="689.54,-197.37 693.37,-191.03 686.42,-193.6 689.54,-197.37"/>
</g>
<!-- cpu_job -->
<g id="node9" class="node">
<title>cpu_job</title>
<path fill="#ecf9ef" stroke="#222222" d="M566.5,-191C566.5,-191 473.5,-191 473.5,-191 467.5,-191 461.5,-185 461.5,-179 461.5,-179 461.5,-167 461.5,-167 461.5,-161 467.5,-155 473.5,-155 473.5,-155 566.5,-155 566.5,-155 572.5,-155 578.5,-161 578.5,-167 578.5,-167 578.5,-179 578.5,-179 578.5,-185 572.5,-191 566.5,-191"/>
<text text-anchor="middle" x="520" y="-176" font-family="Helvetica,sans-Serif" font-size="10.00">Container Entrypoint</text>
<text text-anchor="middle" x="520" y="-165" font-family="Helvetica,sans-Serif" font-size="10.00">+ Job Script</text>
</g>
<!-- cpu_ce&#45;&gt;cpu_job -->
<g id="edge8" class="edge">
<title>cpu_ce&#45;&gt;cpu_job</title>
<path fill="none" stroke="#444444" d="M716.96,-227.99C677.35,-217.13 626.06,-203.07 585.74,-192.02"/>
<polygon fill="#444444" stroke="#444444" points="586.26,-189.62 578.86,-190.14 584.97,-194.35 586.26,-189.62"/>
</g>
<!-- cpu_ce&#45;&gt;eb -->
<g id="edge18" class="edge">
<title>cpu_ce&#45;&gt;eb</title>
<path fill="none" stroke="#444444" d="M763.27,-227.81C754.64,-218.39 743.85,-206.61 734.57,-196.47"/>
<polygon fill="#444444" stroke="#444444" points="736.12,-194.54 729.59,-191.03 732.51,-197.85 736.12,-194.54"/>
</g>
<!-- s3_input -->
<g id="node10" class="node">
<title>s3_input</title>
<path fill="#fff4dc" stroke="#222222" d="M247,-109C247,-109 179,-109 179,-109 173,-109 167,-103 167,-97 167,-97 167,-85 167,-85 167,-79 173,-73 179,-73 179,-73 247,-73 247,-73 253,-73 259,-79 259,-85 259,-85 259,-97 259,-97 259,-103 253,-109 247,-109"/>
<text text-anchor="middle" x="213" y="-88.5" font-family="Helvetica,sans-Serif" font-size="10.00">S3 Input Bucket</text>
</g>
<!-- gpu_job&#45;&gt;s3_input -->
<g id="edge9" class="edge">
<title>gpu_job&#45;&gt;s3_input</title>
<path fill="none" stroke="#444444" d="M291.3,-154.64C276.26,-142.61 256.12,-126.5 240.04,-113.63"/>
<polygon fill="#444444" stroke="#444444" points="241.31,-111.51 234.31,-109.05 238.25,-115.34 241.31,-111.51"/>
</g>
<!-- s3_output -->
<g id="node11" class="node">
<title>s3_output</title>
<path fill="#fff4dc" stroke="#222222" d="M465.5,-36C465.5,-36 388.5,-36 388.5,-36 382.5,-36 376.5,-30 376.5,-24 376.5,-24 376.5,-12 376.5,-12 376.5,-6 382.5,0 388.5,0 388.5,0 465.5,0 465.5,0 471.5,0 477.5,-6 477.5,-12 477.5,-12 477.5,-24 477.5,-24 477.5,-30 471.5,-36 465.5,-36"/>
<text text-anchor="middle" x="427" y="-15.5" font-family="Helvetica,sans-Serif" font-size="10.00">S3 Output Bucket</text>
</g>
<!-- gpu_job&#45;&gt;s3_output -->
<g id="edge11" class="edge">
<title>gpu_job&#45;&gt;s3_output</title>
<path fill="none" stroke="#444444" d="M310.13,-154.96C307.54,-133.82 306.31,-97.59 323,-73 334.19,-56.51 352.05,-44.73 369.83,-36.47"/>
<polygon fill="#444444" stroke="#444444" points="370.95,-38.66 376.37,-33.59 368.97,-34.17 370.95,-38.66"/>
</g>
<!-- s3_models -->
<g id="node12" class="node">
<title>s3_models</title>
<path fill="#fff4dc" stroke="#222222" d="M129.5,-109C129.5,-109 50.5,-109 50.5,-109 44.5,-109 38.5,-103 38.5,-97 38.5,-97 38.5,-85 38.5,-85 38.5,-79 44.5,-73 50.5,-73 50.5,-73 129.5,-73 129.5,-73 135.5,-73 141.5,-79 141.5,-85 141.5,-85 141.5,-97 141.5,-97 141.5,-103 135.5,-109 129.5,-109"/>
<text text-anchor="middle" x="90" y="-88.5" font-family="Helvetica,sans-Serif" font-size="10.00">S3 Models Bucket</text>
</g>
<!-- gpu_job&#45;&gt;s3_models -->
<g id="edge13" class="edge">
<title>gpu_job&#45;&gt;s3_models</title>
<path fill="none" stroke="#444444" d="M265.43,-154.94C229.91,-142.19 181.31,-124.76 144.27,-111.47"/>
<polygon fill="#444444" stroke="#444444" points="144.91,-109.1 137.49,-109.04 143.25,-113.71 144.91,-109.1"/>
</g>
<!-- s3_vault -->
<g id="node13" class="node">
<title>s3_vault</title>
<path fill="#fff4dc" stroke="#222222" d="M536,-109C536,-109 468,-109 468,-109 462,-109 456,-103 456,-97 456,-97 456,-85 456,-85 456,-79 462,-73 468,-73 468,-73 536,-73 536,-73 542,-73 548,-79 548,-85 548,-85 548,-97 548,-97 548,-103 542,-109 536,-109"/>
<text text-anchor="middle" x="502" y="-88.5" font-family="Helvetica,sans-Serif" font-size="10.00">S3 Vault Bucket</text>
</g>
<!-- gpu_job&#45;&gt;s3_vault -->
<g id="edge14" class="edge">
<title>gpu_job&#45;&gt;s3_vault</title>
<path fill="none" stroke="#444444" d="M368.38,-154.93C383.97,-149.66 400.81,-143.51 416,-137 432.67,-129.86 450.52,-120.77 465.62,-112.63"/>
<polygon fill="#444444" stroke="#444444" points="466.93,-114.7 471.91,-109.21 464.59,-110.4 466.93,-114.7"/>
</g>
<!-- efs_cache -->
<g id="node14" class="node">
<title>efs_cache</title>
<path fill="#fff4dc" stroke="#222222" d="M418.5,-109C418.5,-109 347.5,-109 347.5,-109 341.5,-109 335.5,-103 335.5,-97 335.5,-97 335.5,-85 335.5,-85 335.5,-79 341.5,-73 347.5,-73 347.5,-73 418.5,-73 418.5,-73 424.5,-73 430.5,-79 430.5,-85 430.5,-85 430.5,-97 430.5,-97 430.5,-103 424.5,-109 418.5,-109"/>
<text text-anchor="middle" x="383" y="-88.5" font-family="Helvetica,sans-Serif" font-size="10.00">EFS /opt/models</text>
</g>
<!-- gpu_job&#45;&gt;efs_cache -->
<g id="edge16" class="edge">
<title>gpu_job&#45;&gt;efs_cache</title>
<path fill="none" stroke="#444444" d="M332.82,-149.35C342.36,-138.45 353.76,-125.41 363.29,-114.53"/>
<polygon fill="#444444" stroke="#444444" points="330.95,-147.76 328.19,-154.64 334.64,-150.99 330.95,-147.76"/>
<polygon fill="#444444" stroke="#444444" points="365.32,-115.93 368.08,-109.05 361.63,-112.7 365.32,-115.93"/>
<text text-anchor="middle" x="381" y="-129.8" font-family="Helvetica,sans-Serif" font-size="9.00">model cache</text>
</g>
<!-- cpu_job&#45;&gt;s3_input -->
<g id="edge10" class="edge">
<title>cpu_job&#45;&gt;s3_input</title>
<path fill="none" stroke="#444444" d="M461.22,-162.23C427.05,-156.06 383.3,-147.31 345,-137 318.18,-129.78 288.89,-119.99 264.73,-111.39"/>
<polygon fill="#444444" stroke="#444444" points="265.48,-109.06 258.06,-109 263.82,-113.67 265.48,-109.06"/>
</g>
<!-- cpu_job&#45;&gt;s3_output -->
<g id="edge12" class="edge">
<title>cpu_job&#45;&gt;s3_output</title>
<path fill="none" stroke="#444444" d="M536.76,-154.86C554.81,-134.27 578.7,-99.21 561,-73 543.92,-47.7 512.8,-34.26 484.85,-27.11"/>
<polygon fill="#444444" stroke="#444444" points="485.2,-24.67 477.82,-25.42 484.06,-29.44 485.2,-24.67"/>
</g>
<!-- cpu_job&#45;&gt;s3_vault -->
<g id="edge15" class="edge">
<title>cpu_job&#45;&gt;s3_vault</title>
<path fill="none" stroke="#444444" d="M516.09,-154.64C513.55,-143.35 510.2,-128.46 507.41,-116.04"/>
<polygon fill="#444444" stroke="#444444" points="509.76,-115.34 505.84,-109.05 504.98,-116.42 509.76,-115.34"/>
</g>
<!-- monitor -->
<g id="node16" class="node">
<title>monitor</title>
<path fill="#e9f3ff" stroke="#222222" d="M729,-109C729,-109 657,-109 657,-109 651,-109 645,-103 645,-97 645,-97 645,-85 645,-85 645,-79 651,-73 657,-73 657,-73 729,-73 729,-73 735,-73 741,-79 741,-85 741,-85 741,-97 741,-97 741,-103 735,-109 729,-109"/>
<text text-anchor="middle" x="693" y="-88.5" font-family="Helvetica,sans-Serif" font-size="10.00">Lambda Monitor</text>
</g>
<!-- eb&#45;&gt;monitor -->
<g id="edge19" class="edge">
<title>eb&#45;&gt;monitor</title>
<path fill="none" stroke="#444444" d="M709.44,-154.64C706.48,-143.35 702.57,-128.46 699.31,-116.04"/>
<polygon fill="#444444" stroke="#444444" points="701.62,-115.2 697.48,-109.05 696.88,-116.44 701.62,-115.2"/>
</g>
<!-- monitor&#45;&gt;s3_output -->
<g id="edge21" class="edge">
<title>monitor&#45;&gt;s3_output</title>
<path fill="none" stroke="#444444" d="M655.89,-72.95C640.41,-66.33 622.08,-59.12 605,-54 565.57,-42.18 520,-33.31 484.84,-27.45"/>
<polygon fill="#444444" stroke="#444444" points="484.81,-24.96 477.51,-26.25 484.02,-29.8 484.81,-24.96"/>
</g>
<!-- notify_topic -->
<g id="node17" class="node">
<title>notify_topic</title>
<path fill="#e9f3ff" stroke="#222222" d="M742.5,-36C742.5,-36 643.5,-36 643.5,-36 637.5,-36 631.5,-30 631.5,-24 631.5,-24 631.5,-12 631.5,-12 631.5,-6 637.5,0 643.5,0 643.5,0 742.5,0 742.5,0 748.5,0 754.5,-6 754.5,-12 754.5,-12 754.5,-24 754.5,-24 754.5,-30 748.5,-36 742.5,-36"/>
<text text-anchor="middle" x="693" y="-15.5" font-family="Helvetica,sans-Serif" font-size="10.00">SNS Notification Topic</text>
</g>
<!-- monitor&#45;&gt;notify_topic -->
<g id="edge20" class="edge">
<title>monitor&#45;&gt;notify_topic</title>
<path fill="none" stroke="#444444" d="M693,-72.81C693,-63.92 693,-52.91 693,-43.17"/>
<polygon fill="#444444" stroke="#444444" points="695.45,-43.03 693,-36.03 690.55,-43.03 695.45,-43.03"/>
</g>
<!-- vpc -->
<g id="node18" class="node">
<title>vpc</title>
<path fill="#ffeff6" stroke="#222222" d="M590,-492C590,-492 458,-492 458,-492 452,-492 446,-486 446,-480 446,-480 446,-468 446,-468 446,-462 452,-456 458,-456 458,-456 590,-456 590,-456 596,-456 602,-462 602,-468 602,-468 602,-480 602,-480 602,-486 596,-492 590,-492"/>
<text text-anchor="middle" x="524" y="-471.5" font-family="Helvetica,sans-Serif" font-size="10.00">VPC + Public Subnets + IGW</text>
</g>
<!-- vpc&#45;&gt;gpu_ce -->
<g id="edge22" class="edge">
<title>vpc&#45;&gt;gpu_ce</title>
<path fill="none" stroke="#7a7a7a" stroke-dasharray="5,2" d="M525.04,-455.79C527.95,-420.19 538.78,-337.2 578,-282 582.91,-275.09 589.63,-269.08 596.56,-264.08"/>
</g>
<!-- vpc&#45;&gt;cpu_ce -->
<g id="edge23" class="edge">
<title>vpc&#45;&gt;cpu_ce</title>
<path fill="none" stroke="#7a7a7a" stroke-dasharray="5,2" d="M532.92,-455.98C554.63,-414.77 609.46,-312.2 623,-301 637.69,-288.86 679.22,-274.75 715.74,-264.01"/>
</g>
<!-- iam -->
<g id="node19" class="node">
<title>iam</title>
<path fill="#ffeff6" stroke="#222222" d="M1022.5,-492C1022.5,-492 929.5,-492 929.5,-492 923.5,-492 917.5,-486 917.5,-480 917.5,-480 917.5,-468 917.5,-468 917.5,-462 923.5,-456 929.5,-456 929.5,-456 1022.5,-456 1022.5,-456 1028.5,-456 1034.5,-462 1034.5,-468 1034.5,-468 1034.5,-480 1034.5,-480 1034.5,-486 1028.5,-492 1022.5,-492"/>
<text text-anchor="middle" x="976" y="-471.5" font-family="Helvetica,sans-Serif" font-size="10.00">IAM Roles + Policies</text>
</g>
<!-- iam&#45;&gt;dispatcher -->
<g id="edge24" class="edge">
<title>iam&#45;&gt;dispatcher</title>
<path fill="none" stroke="#7a7a7a" stroke-dasharray="5,2" d="M930.45,-455.98C922.05,-453.12 913.3,-450.33 905,-448 850.87,-432.8 787.9,-419.98 743.78,-411.76"/>
</g>
<!-- iam&#45;&gt;gpu_ce -->
<g id="edge26" class="edge">
<title>iam&#45;&gt;gpu_ce</title>
<path fill="none" stroke="#7a7a7a" stroke-dasharray="5,2" d="M971.52,-455.9C961.4,-420.86 933.32,-340.84 879,-301 814.9,-253.98 781.68,-280.9 704,-264 699.77,-263.08 695.41,-262.12 691.02,-261.16"/>
</g>
<!-- iam&#45;&gt;cpu_ce -->
<g id="edge27" class="edge">
<title>iam&#45;&gt;cpu_ce</title>
<path fill="none" stroke="#7a7a7a" stroke-dasharray="5,2" d="M980.73,-455.87C989.79,-418.94 1005.01,-331.22 961,-282 945.34,-264.49 888.35,-255.57 842.26,-251.13"/>
</g>
<!-- iam&#45;&gt;monitor -->
<g id="edge25" class="edge">
<title>iam&#45;&gt;monitor</title>
<path fill="none" stroke="#7a7a7a" stroke-dasharray="5,2" d="M1007.21,-455.85C1025.09,-443.53 1044,-425.04 1044,-402 1044,-402 1044,-402 1044,-172 1044,-110.59 837.2,-96.32 741.07,-93"/>
</g>
<!-- batch_defs -->
<g id="node20" class="node">
<title>batch_defs</title>
<path fill="#ffeff6" stroke="#222222" d="M880,-492C880,-492 786,-492 786,-492 780,-492 774,-486 774,-480 774,-480 774,-468 774,-468 774,-462 780,-456 786,-456 786,-456 880,-456 880,-456 886,-456 892,-462 892,-468 892,-468 892,-480 892,-480 892,-486 886,-492 880,-492"/>
<text text-anchor="middle" x="833" y="-471.5" font-family="Helvetica,sans-Serif" font-size="10.00">Batch Job Definitions</text>
</g>
<!-- batch_defs&#45;&gt;gpu_queue -->
<g id="edge28" class="edge">
<title>batch_defs&#45;&gt;gpu_queue</title>
<path fill="none" stroke="#7a7a7a" stroke-dasharray="5,2" d="M841.15,-455.95C851.84,-430.84 866.62,-383.75 843,-355 816.83,-323.14 792.18,-346.22 752,-337 747.68,-336.01 743.21,-334.94 738.73,-333.83"/>
</g>
<!-- batch_defs&#45;&gt;cpu_queue -->
<g id="edge29" class="edge">
<title>batch_defs&#45;&gt;cpu_queue</title>
<path fill="none" stroke="#7a7a7a" stroke-dasharray="5,2" d="M847.73,-455.75C866.12,-432 893,-388.36 874,-355 869.87,-347.76 863.59,-341.86 856.64,-337.11"/>
</g>
<!-- buckets -->
<g id="node21" class="node">
<title>buckets</title>
<path fill="#ffeff6" stroke="#222222" d="M409,-492C409,-492 363,-492 363,-492 357,-492 351,-486 351,-480 351,-480 351,-468 351,-468 351,-462 357,-456 363,-456 363,-456 409,-456 409,-456 415,-456 421,-462 421,-468 421,-468 421,-480 421,-480 421,-486 415,-492 409,-492"/>
<text text-anchor="middle" x="386" y="-471.5" font-family="Helvetica,sans-Serif" font-size="10.00">S3 Buckets</text>
</g>
<!-- buckets&#45;&gt;s3_input -->
<g id="edge30" class="edge">
<title>buckets&#45;&gt;s3_input</title>
<path fill="none" stroke="#7a7a7a" stroke-dasharray="5,2" d="M355.5,-455.87C350.08,-453.09 344.43,-450.35 339,-448 280.83,-422.81 204,-465.39 204,-402 204,-402 204,-402 204,-172 204,-150.2 207.34,-125.26 209.95,-109.1"/>
</g>
<!-- buckets&#45;&gt;s3_output -->
<g id="edge31" class="edge">
<title>buckets&#45;&gt;s3_output</title>
<path fill="none" stroke="#7a7a7a" stroke-dasharray="5,2" d="M358.48,-455.92C352.27,-452.77 345.57,-449.89 339,-448 265.92,-427.04 0,-478.02 0,-402 0,-402 0,-402 0,-90 0,-52.24 263.23,-30 376.34,-22.19"/>
</g>
<!-- buckets&#45;&gt;s3_models -->
<g id="edge32" class="edge">
<title>buckets&#45;&gt;s3_models</title>
<path fill="none" stroke="#7a7a7a" stroke-dasharray="5,2" d="M358.2,-455.87C352.07,-452.77 345.47,-449.92 339,-448 291.73,-434.01 122,-451.29 122,-402 122,-402 122,-402 122,-172 122,-148.96 110.42,-124.69 101.19,-109.01"/>
</g>
<!-- buckets&#45;&gt;s3_vault -->
<g id="edge33" class="edge">
<title>buckets&#45;&gt;s3_vault</title>
<path fill="none" stroke="#7a7a7a" stroke-dasharray="5,2" d="M400.84,-455.9C411.13,-442.26 423,-422.15 423,-402 423,-402 423,-402 423,-209 423,-167.58 457.43,-129.86 480.83,-109"/>
</g>
<!-- lambdas -->
<g id="node22" class="node">
<title>lambdas</title>
<path fill="#ffeff6" stroke="#222222" d="M736.5,-492C736.5,-492 639.5,-492 639.5,-492 633.5,-492 627.5,-486 627.5,-480 627.5,-480 627.5,-468 627.5,-468 627.5,-462 633.5,-456 639.5,-456 639.5,-456 736.5,-456 736.5,-456 742.5,-456 748.5,-462 748.5,-468 748.5,-468 748.5,-480 748.5,-480 748.5,-486 742.5,-492 736.5,-492"/>
<text text-anchor="middle" x="688" y="-471.5" font-family="Helvetica,sans-Serif" font-size="10.00">Dispatcher + Monitor</text>
</g>
<!-- lambdas&#45;&gt;dispatcher -->
<g id="edge34" class="edge">
<title>lambdas&#45;&gt;dispatcher</title>
<path fill="none" stroke="#7a7a7a" stroke-dasharray="5,2" d="M688,-455.81C688,-444.65 688,-430.16 688,-419.03"/>
</g>
<!-- lambdas&#45;&gt;monitor -->
<g id="edge35" class="edge">
<title>lambdas&#45;&gt;monitor</title>
<path fill="none" stroke="#7a7a7a" stroke-dasharray="5,2" d="M660.47,-455.87C646.89,-446.33 631.09,-433.48 620,-419 565.28,-347.59 517.97,-310.86 553,-228 553.1,-227.76 637.64,-145.71 675.28,-109.19"/>
</g>
</g>
</svg>
" />
<p>Four S3 buckets:</p>
<pre><code>- input (scripts and payloads)
- output (results)
- models (cached HuggingFace weights)
- vault (cookies and API keys)</code></pre>
<p>An EFS volume is mounted at <code>/opt/models</code> across Batch instances so models downloaded on one job are cached for the next.</p>
<p>The VPC uses public subnets with an internet gateway rather than private subnets with a NAT gateway. Batch instances need to reach ECR to pull images and S3 for data. NAT gateways run ~$32/month even with no traffic. Public subnets with proper security groups accomplish the same thing at no standing cost.</p>
<h1 data-number="3" id="trigger-flow"><span class="header-section-number">3</span> Trigger Flow</h1>
<p>A job submission is a JSON message published to the SNS trigger topic:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode json"><code class="sourceCode json"><span id="cb2-1"><a href="#cb2-1" aria-hidden="true" tabindex="-1"></a><span class="fu">{</span></span>
<span id="cb2-2"><a href="#cb2-2" aria-hidden="true" tabindex="-1"></a> <span class="dt">"trigger_type"</span><span class="fu">:</span> <span class="st">"batch_job"</span><span class="fu">,</span></span>
<span id="cb2-3"><a href="#cb2-3" aria-hidden="true" tabindex="-1"></a> <span class="dt">"data"</span><span class="fu">:</span> <span class="fu">{</span></span>
<span id="cb2-4"><a href="#cb2-4" aria-hidden="true" tabindex="-1"></a> <span class="dt">"script_key"</span><span class="fu">:</span> <span class="st">"jobs/transcribe_processor.py"</span><span class="fu">,</span></span>
<span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a> <span class="dt">"compute_type"</span><span class="fu">:</span> <span class="st">"gpu"</span><span class="fu">,</span></span>
<span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a> <span class="dt">"input_key"</span><span class="fu">:</span> <span class="st">"audio/interview.wav"</span><span class="fu">,</span></span>
<span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a> <span class="dt">"output_key"</span><span class="fu">:</span> <span class="st">"transcripts/interview.json"</span></span>
<span id="cb2-8"><a href="#cb2-8" aria-hidden="true" tabindex="-1"></a> <span class="fu">},</span></span>
<span id="cb2-9"><a href="#cb2-9" aria-hidden="true" tabindex="-1"></a> <span class="dt">"metadata"</span><span class="fu">:</span> <span class="fu">{</span></span>
<span id="cb2-10"><a href="#cb2-10" aria-hidden="true" tabindex="-1"></a> <span class="dt">"user"</span><span class="fu">:</span> <span class="st">"nana"</span><span class="fu">,</span></span>
<span id="cb2-11"><a href="#cb2-11" aria-hidden="true" tabindex="-1"></a> <span class="dt">"project"</span><span class="fu">:</span> <span class="st">"transcription"</span></span>
<span id="cb2-12"><a href="#cb2-12" aria-hidden="true" tabindex="-1"></a> <span class="fu">}</span></span>
<span id="cb2-13"><a href="#cb2-13" aria-hidden="true" tabindex="-1"></a><span class="fu">}</span></span></code></pre></div>
<p>The Lambda dispatcher validates the message, resolves resource defaults (vCPUs, memory, GPUs) from Terraform variables if the caller didn’t specify them, generates a unique job name, and submits it to the appropriate Batch queue. The full SNS payload is passed into the container via <code>SNS_MESSAGE</code> so the Python script sees everything the caller sent.</p>
<p>The entrypoint script inside the container downloads the <code>.py</code> file from S3 using <code>SCRIPT_KEY</code>, runs it, then uploads everything in <code>/workspace/output/</code> and <code>/workspace/logs/</code> back to S3.</p>
<p>A separate Lambda monitors Batch job state changes via EventBridge. When a job finishes, it sends an SNS notification and writes a summary JSON to S3.</p>
<h1 data-number="4" id="compute"><span class="header-section-number">4</span> Compute</h1>
<p>Two Batch compute environments:</p>
<ul>
<li><strong>GPU</strong>: g4dn.xlarge, g4dn.2xlarge, g5.xlarge. Spot by default. Allocation strategy is <code>SPOT_CAPACITY_OPTIMIZED</code>. Min vCPUs: 0, max: 256.</li>
<li><strong>CPU</strong>: m5.large, c6a.large, t3 variants. Same spot setup. Min: 0, max: 128.</li>
</ul>
<p>Default job resources are set in Terraform and can be overridden per job submission. GPU jobs default to 4 vCPU, 16GB RAM, 1 GPU. CPU jobs default to 2 vCPU, 4GB RAM.</p>
<h1 data-number="5" id="docker-images"><span class="header-section-number">5</span> Docker Images</h1>
<p>Two images built locally and pushed to ECR:</p>
<ul>
<li><strong>cpu-slim</strong>: Based on <code>python:3.11-slim</code>. Installs FFmpeg, Node.js, yt-dlp. Around 500MB. Builds in about 2 minutes.</li>
<li><strong>gpu-slim</strong>: Based on <code>nvidia/cuda:12.4.1-cudnn-runtime-ubuntu22.04</code>. Installs Python 3.11, PyTorch 2.1, faster-whisper, pyannote.audio, transformers. Around 5GB. Takes 5-10 minutes to build.</li>
</ul>
<p>Getting CUDA, cuDNN, PyTorch, and pyannote versions to agree took several iterations. The current combination locks to CUDA 12.4.1 with cuDNN runtime and lets PyTorch detect GPU support at runtime.</p>
<h1 data-number="6" id="job-scripts"><span class="header-section-number">6</span> Job Scripts</h1>
<p>The pipeline does not require a fixed set of scripts. You can bring your own Python jobs as long as each one is uploaded to the input S3 bucket and referenced by <code>script_key</code> in the trigger payload.</p>
<p>The scripts below are examples from this project, not required components:</p>
<ul>
<li><strong>transcribe_processor.py</strong>: Runs faster-whisper for speech-to-text and pyannote for speaker diarization. Outputs JSON with segments, speaker labels, and timestamps. Downloads models from HuggingFace on first run and caches them on EFS.</li>
<li><strong>video_processor.py</strong>: Extracts audio from video using FFmpeg. Outputs 16kHz mono WAV normalized for Whisper input.</li>
<li><strong>download_processor.py</strong>: Downloads video from YouTube via yt-dlp. Reads cookies from the vault bucket for authenticated requests.</li>
<li><strong>scoring_processor.py</strong>: Scores transcript segments using an LLM (Bedrock, OpenAI, or Anthropic). Takes segments JSON as input, returns scored segments. Provider is pluggable.</li>
<li><strong>cleanup_processor.py</strong>: Deletes cached models from EFS. Useful when you want to free EFS storage between projects.</li>
</ul>
<p>In practice, the contract is simple: your script receives context via environment variables (<code>ML_INPUT_BUCKET</code>, <code>ML_OUTPUT_BUCKET</code>, <code>OUTPUT_PREFIX</code>, etc.), does whatever workload you need, and writes results to <code>/workspace/output/</code>. After the script exits, the container entrypoint syncs outputs (and logs) back to S3.</p>
<h1 data-number="7" id="admin-tools"><span class="header-section-number">7</span> Admin Tools</h1>
<p>A small FastAPI app (<code>admin/ui/app.py</code>) running locally at port 8000 gives a UI for building and submitting job payloads. It has presets for common job types, a JSON override editor, an S3 bucket selector, and publishes directly to SNS.</p>
<p>CLI scripts in <code>admin/scripts/</code> cover the same ground plus a few utilities: uploading all job scripts to S3, downloading models to the vault bucket before the first run, and exporting Firefox cookies to the Netscape format yt-dlp expects.</p>
<h1 data-number="8" id="state-and-backend"><span class="header-section-number">8</span> State and Backend</h1>
<p>This project stores Terraform state remotely in S3 with DynamoDB locking, so state is shared and protected from concurrent writes. One practical detail is that backend configuration cannot use regular Terraform variables, so these values are passed as CLI flags during <code>terraform init</code>:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode bash"><code class="sourceCode bash"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="ex">terraform</span> init <span class="dt">\</span></span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a> <span class="at">-backend-config</span><span class="op">=</span><span class="st">"bucket=</span><span class="va">${PROJECT_NAME}</span><span class="st">-terraform-state"</span> <span class="dt">\</span></span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a> <span class="at">-backend-config</span><span class="op">=</span><span class="st">"dynamodb_table=</span><span class="va">${PROJECT_NAME}</span><span class="st">-terraform-lock"</span> <span class="dt">\</span></span>
<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a> <span class="at">-backend-config</span><span class="op">=</span><span class="st">"region=</span><span class="va">$AWS_REGION</span><span class="st">"</span></span></code></pre></div>
<h1 data-number="9" id="gpu-quotas"><span class="header-section-number">9</span> GPU Quotas</h1>
<p>AWS Batch can’t launch GPU instances if the account’s EC2 vCPU quota for G/VT instances is at the default (which is often 0 in new accounts). This isn’t manageable through Terraform. You have to request an increase through AWS Service Quotas.</p>
<p>Quota codes that are consistent across all accounts:</p>
<ul>
<li><code>L-DB2E81BA</code>: Running On-Demand G and VT instances</li>
<li><code>L-3819A6DF</code>: All G and VT Spot Instances</li>
</ul>
<p>Jobs will sit in <code>RUNNABLE</code> status indefinitely without sufficient quota.</p>
<h1 data-number="10" id="code"><span class="header-section-number">10</span> Code</h1>
<p><a href="https://github.com/hiram-labs/terraform-aws-ml">hiram-labs/terraform-aws-ml</a></p>
Comments
No comments yet
Be the first to comment!