Hardware Architecture

AMD Threadripper Pro · 4-GPU rack server · Proxmox

Why Threadripper Pro

128 PCIe 4.0 lanes — four AI GPUs at x8 each uses only 32 lanes, leaving ample bandwidth for NVMe SSDs, NIC, HBA, and future expansion. Clean IOMMU topology — each PCIe slot gets its own IOMMU group, enabling clean per-GPU passthrough without ACS override hacks. ECC RAM support — critical for a 24/7 system with persistent memory stores. Silent memory corruption in the wrong key causes subtle misbehaviour that is very hard to trace. High core count — enough to run all LXC containers without CPU contention anywhere in the stack.

Power & Isolation

Dual 1000W rack PSUs in active load-sharing — one PSU failure does not take down the system. Both units run at ~50% load, their most efficient and longest-lasting operating point. All AI inference tasks run exclusively on GPUs — each GPU is passed through to a dedicated LXC instance. Inference workloads are fully isolated from each other and from the orchestration layer. PCIe 4.0 x8 provides ~16 GB/s per slot — more than sufficient for inference workloads where each model loads independently per card with no multi-GPU weight synchronisation.

Server Layout

4U Rack Server — Proxmox LXC allocation

4U Rack Server — AMD Threadripper Pro (Proxmox)
|
+-- LXC: N8N (orchestration)                   [CPU]
+-- LXC: Python reasoning loop                  [CPU]
+-- LXC: Redis + PostgreSQL + Qdrant            [CPU + NVMe]
|
+-- LXC: llama.cpp main reasoning model         [GPU 1 - datacenter, high VRAM]
|         Magistral Small 1.2 / Mistral Large 3
+-- LXC: llama.cpp specialist LLM pool          [GPU 2 - datacenter, high VRAM]
|         Codestral, Devstral, Pixtral, Mistral Small 3.x
+-- LXC: vLLM audio processing                  [GPU 3 - pro-sumer AI]
|         Voxtral Realtime 4B + Voxtral Small 24B + SpeechBrain
+-- LXC: Vision sensor pipelines                [GPU 3 - shared]
|         YOLOv8, MediaPipe, InsightFace (2-5fps, 24/7)
+-- LXC: Safety review model                    [GPU 4 - pro-sumer AI]
|         Mistral Moderation always warm
|
+-- VM:  AI monitoring & audit system
          (isolated, encrypted NAS block, no network path to Jarvis)

Acquisition Phases

The architecture is identical at every phase — only available compute changes

Phase 0 — Base System
Added: AMD Ryzen (consumer), RTX 3070 8GB, consumer DDR4/DDR5, consumer NVMe SSD.
Unlocks: Full architecture validation — N8N, reasoning loop, HomeAssistant specialist, memory layers, audit system. Intelligence limited but entire plumbing is real and testable.
Phase 1 — 16GB GPU
Added: RTX 5060 Ti 16GB or RTX 4060 Ti 16GB added as GPU 2.
Unlocks: First real parallelism. 13B class models at Q5/Q8 or small 32B at Q4. Main model quality jumps meaningfully. Specialist responses no longer block the reasoning loop.
Phase 2 — 32GB Pro-sumer GPU
Added: High-VRAM pro-sumer GPU (32GB, e.g. RTX 6000 Ada) added as GPU 3.
Unlocks: 70B class models at Q4 or 32B at Q8. Three-way GPU split — reasoning / specialists / sensors. No GPU contention between layers. Significant intelligence improvement.
Phase 3 — Threadripper Migration
Added: AMD Threadripper Pro CPU + motherboard. Edge node hardware built and tested locally.
Unlocks: Full PCIe lane budget (128 lanes), clean IOMMU topology, rack-ready form factor. Edge nodes validated and ready to deploy — Phase 4 becomes a software integration task.
Phase 4 — Edge Nodes Deployed
Added: Raspberry Pi microphone nodes (per room) + camera nodes (per room) deployed.
Unlocks: Ambient intelligence — Jarvis gains real-time awareness of the home environment. Presence detection, voice recognition, speaker identification, gesture recognition (basic), sensor-based triggers.
Phase 5 — ECC RAM + DB Migration
Added: ECC registered DDR5 RAM. Redis, PostgreSQL, Qdrant move from NAS to local NVMe SSD.
Unlocks: Faster memory reads/writes — noticeably reduces latency on memory-heavy operations like context assembly and episodic retrieval. NAS freed from database duties.
Phase 6 — First Datacenter GPU
Added: Datacenter GPU (NVIDIA A40 48GB, A100 40/80GB, or L40S 48GB) added.
Unlocks: Magistral or Mistral Large at Q6/Q8 or near-unquantised. Reasoning quality approaches the target ceiling. Specialist pool now has two GPUs — multiple specialists loaded simultaneously.
Phase 7 — Pro-sumer AI GPU 2
Added: AMD Radeon AI Pro R9700/R9500 (or equivalent). RTX 3070 decommissioned.
Unlocks: Sensor pipelines (audio + video, 24/7) have dedicated hardware. Safety review model gets a clean resource slice. Stack simplified to 3 active GPUs before the final datacenter addition.
Phase 8 — Full Target Architecture
Added: Second datacenter GPU (same family as Phase 6).
Unlocks: All four GPU roles have dedicated hardware. Zero contention between layers. All primary models resident in VRAM simultaneously — no weight swapping in steady state. Maximum intelligence, maximum parallelism.

Final GPU Layout

Phase 8 target — zero contention between layers

+-- GPU 1 -- Datacenter (high VRAM)       -> Main reasoning model
|            NVIDIA A100 / A40 / L40S         Magistral Small 1.2 / Mistral Large 3
|
+-- GPU 2 -- Datacenter (high VRAM)       -> Specialist LLM pool
|            Same family as GPU 1             Codestral, Devstral, Pixtral, Mistral Small 3.x
|                                             Multiple models loaded simultaneously
|
+-- GPU 3 -- Pro-sumer AI                 -> Sensor pipelines (24/7)
|            AMD Radeon AI Pro R9700/R9500    Vision (YOLOv8, MediaPipe, InsightFace)
|                                             Audio (Voxtral Realtime + Voxtral Small)
|
+-- GPU 4 -- Pro-sumer                    -> Safety review model + overflow
             RTX / equivalent                 Mistral Moderation always warm
Explore Features →