AMD Threadripper Pro · 4-GPU rack server · Proxmox
128 PCIe 4.0 lanes — four AI GPUs at x8 each uses only 32 lanes, leaving ample bandwidth for NVMe SSDs, NIC, HBA, and future expansion. Clean IOMMU topology — each PCIe slot gets its own IOMMU group, enabling clean per-GPU passthrough without ACS override hacks. ECC RAM support — critical for a 24/7 system with persistent memory stores. Silent memory corruption in the wrong key causes subtle misbehaviour that is very hard to trace. High core count — enough to run all LXC containers without CPU contention anywhere in the stack.
Dual 1000W rack PSUs in active load-sharing — one PSU failure does not take down the system. Both units run at ~50% load, their most efficient and longest-lasting operating point. All AI inference tasks run exclusively on GPUs — each GPU is passed through to a dedicated LXC instance. Inference workloads are fully isolated from each other and from the orchestration layer. PCIe 4.0 x8 provides ~16 GB/s per slot — more than sufficient for inference workloads where each model loads independently per card with no multi-GPU weight synchronisation.
4U Rack Server — AMD Threadripper Pro (Proxmox)
|
+-- LXC: N8N (orchestration) [CPU]
+-- LXC: Python reasoning loop [CPU]
+-- LXC: Redis + PostgreSQL + Qdrant [CPU + NVMe]
|
+-- LXC: llama.cpp main reasoning model [GPU 1 - datacenter, high VRAM]
| Magistral Small 1.2 / Mistral Large 3
+-- LXC: llama.cpp specialist LLM pool [GPU 2 - datacenter, high VRAM]
| Codestral, Devstral, Pixtral, Mistral Small 3.x
+-- LXC: vLLM audio processing [GPU 3 - pro-sumer AI]
| Voxtral Realtime 4B + Voxtral Small 24B + SpeechBrain
+-- LXC: Vision sensor pipelines [GPU 3 - shared]
| YOLOv8, MediaPipe, InsightFace (2-5fps, 24/7)
+-- LXC: Safety review model [GPU 4 - pro-sumer AI]
| Mistral Moderation always warm
|
+-- VM: AI monitoring & audit system
(isolated, encrypted NAS block, no network path to Jarvis)
The architecture is identical at every phase — only available compute changes
Phase 8 target — zero contention between layers
+-- GPU 1 -- Datacenter (high VRAM) -> Main reasoning model
| NVIDIA A100 / A40 / L40S Magistral Small 1.2 / Mistral Large 3
|
+-- GPU 2 -- Datacenter (high VRAM) -> Specialist LLM pool
| Same family as GPU 1 Codestral, Devstral, Pixtral, Mistral Small 3.x
| Multiple models loaded simultaneously
|
+-- GPU 3 -- Pro-sumer AI -> Sensor pipelines (24/7)
| AMD Radeon AI Pro R9700/R9500 Vision (YOLOv8, MediaPipe, InsightFace)
| Audio (Voxtral Realtime + Voxtral Small)
|
+-- GPU 4 -- Pro-sumer -> Safety review model + overflow
RTX / equivalent Mistral Moderation always warm