Training the full 70B model, not a laptop-sized compromise

A 4-person team trains a full 70B fine-tune per run for $118 instead of a standing $21K/mo GPU box, and ships +18 accuracy points.

The challenge

A 70B-class model needs ~220 GB of accelerator memory, impossible on a Mac. The team's only options were to quantize down to an 8B proxy (and lose quality) or pay for a standing multi-GPU box around the clock.

Workload understanding

One profiled the job at ~220 GB accelerator memory, GPU, 8-way parallel, far past the Mac's headroom, so it routed to a cloud burst.

Best hardware for the job

Matched 8× NVIDIA A100 80GB (640 GB), the best performance-per-dollar that fits. A T4 pick wouldn't fit (would need 14 chips); an 8× H100 box would finish faster but cost more for the same result.

Benchmark

Pick	Hardware	Time	Cost	Verdict
undersized	T4	,	,	won't fit (needs 14× T4)
One's match	8× A100 80GB	~242 min	$118.42	best perf-per-dollar that fits
oversized	8× H100 80GB	~100 min	$130.90	faster, but more $ for the same result

Completion

One provisioned the cluster in the team's own GCP project, streamed progress, returned the checkpoint, and tore the cluster down, no idle spend.

Outcome

+18 accuracy pts (91% vs 73% proxy)

Output quality

$118.42 (pay-per-use)

Cost per run

~$940/mo vs ~$21,455/mo

Vs standing box

~242 min, fully managed

Time-to-result

“We were about to put a GPU server on the company card. Instead we pay per run, and we ship the real model, not the one that fit.”

Run your workload in your own cloud.

Meet the Burst Agent All stories

Representative composite story. Placement, hardware matching, and completion are real system behavior; figures are transparent, editable model inputs.