
Published Date: Jan 16, 2025
Page Length: 8
This article breaks down what it actually takes to build AI instead of calling someone else's API. You'll see why compute replaces token billing, how GPU architecture choices (A6000 vs A100 vs H100) change your development strategy, and what multi-service orchestration looks like when you run Whisper, Llama, F5-TTS, and a 14B avatar model on shared hardware.
I walk through the real engineering decisions: CUDA memory contention, tensor parallelism across multiple GPUs, NVSwitch/Fabric Manager pitfalls, and the cost model that emerges when you own the models and the infrastructure. If you care about AI systems, GPUs, and the tradeoffs behind production-grade inference, this is the deep dive you want.