Training the full 70B model, not a laptop-sized compromise
A 4-person team trains a full 70B fine-tune per run for $118 instead of a standing $21K/mo GPU box, and ships +18 accuracy points.
The challenge
A 70B-class model needs ~220 GB of accelerator memory, impossible on a Mac. The team's only options were to quantize down to an 8B proxy (and lose quality) or pay for a standing multi-GPU box around the clock.
Workload understanding
One profiled the job at ~220 GB accelerator memory, GPU, 8-way parallel, far past the Mac's headroom, so it routed to a cloud burst.
Best hardware for the job
Matched 8× NVIDIA A100 80GB (640 GB), the best performance-per-dollar that fits. A T4 pick wouldn't fit (would need 14 chips); an 8× H100 box would finish faster but cost more for the same result.
Benchmark
| Pick | Hardware | Time | Cost | Verdict |
|---|---|---|---|---|
| undersized | T4 | , | , | won't fit (needs 14× T4) |
| One's match | 8× A100 80GB | ~242 min | $118.42 | best perf-per-dollar that fits |
| oversized | 8× H100 80GB | ~100 min | $130.90 | faster, but more $ for the same result |
Completion
One provisioned the cluster in the team's own GCP project, streamed progress, returned the checkpoint, and tore the cluster down, no idle spend.
Outcome
“We were about to put a GPU server on the company card. Instead we pay per run, and we ship the real model, not the one that fit.”
Representative composite story. Placement, hardware matching, and completion are real system behavior; figures are transparent, editable model inputs.