AI / InfrastructureWorking experiment

GPU Orchestration Lab

I split model serving, image generation, translation, and embedding work into GPU pools that can scale and fail independently.

The job

One monolithic GPU image makes unrelated workloads slow to ship and expensive to run. Orchestration state, asset handoff, and GPU-heavy execution need different lifecycles.

The hard part

Worker pools must scale independently, tolerate interruption, and pass large artifacts without forcing the controller to become a data plane.

How I built it

Keep a small Prefect control layer and isolate workload-specific Docker images.
Use separate pools for embeddings, vLLM serving, tagging, generation, and translation.
Move artifacts through S3-compatible storage rather than the orchestration database.
Tie RunPod worker scaling to queue demand and the hardware profile of each stage.

What I verified

Workload-specific worker images
Independent GPU pools
Object-storage handoff
Queue-driven scaling controls

Current state: The controller, worker boundaries, and artifact handoff are implemented. No public uptime or cost-saving claim is attached.