Coordinator + Specialists
Jarvis uses a coordinator + specialists pattern. Rather than one monolithic model trying to do everything, Jarvis is the single intelligent orchestrator — it understands intent and delegates work to purpose-built specialist agents.
Specialists are smaller, faster models with highly restrictive pre-prompts and tightly scoped capabilities. The restrictive design is intentional: it makes them fast, predictable, and safe. A HomeAssistant specialist knows only how to query and control the smart home — nothing else.
A dedicated specialist-creator agent is planned for the long term to allow Jarvis to propose and draft new capabilities autonomously. Human validation will always be required before any new specialist goes live.
Each llama.cpp instance is containerised independently. A specialist model can be processing an act request while the main reasoning model is already preparing its next iteration — no blocking, no waiting.
User Input / Automatic Trigger
|
N8N Main Workflow
(input normalisation + routing)
|
External Reasoning Loop
(think -> ask/act/delegate -> observe -> think)
/ | \
ASK DELEGATE ACT request
Specialists Sub-Agent |
(read-only, (Mistral Large) Safety Review Model
returns unload GPU #2 (paranoia-scored)
context) load Mistral |
| Large approve/block/
| run full escalate
| internal loop |
| return result ACT Specialist
| | (execute,
\ | returns result)
Results back to loop
[ iterates until decision reached ]
|
Response to User / Action Executed
Five strictly typed outputs — no free-form reasoning
Request information — read-only, no side effects. The reasoning model can call ask freely and safely. Multiple asks can be batched in a single output and fire in parallel.
Request execution — always passes through the safety review model before reaching the specialist. Irreversible actions require explicit approval from the secondary model.
Delegate a complex multi-step task to the Mistral Large sub-agent. Highest cost output — displaces the full specialist pool for 30–90 seconds. Used only when standard iteration is clearly insufficient.
Conversation complete — return the answer to the user. The loop terminates cleanly.
Insufficient context — ask the user before proceeding rather than guessing. Prevents confident wrong actions.
A hard iteration ceiling prevents infinite loops. The current iteration and ceiling are passed in context — the model becomes more decisive as the ceiling approaches.
Main reasoning model — llama.cpp on a dedicated high-VRAM GPU. Specialist models — llama.cpp instances, shared or dedicated GPUs. Audio processing — vLLM on a dedicated GPU (only exception to the unified llama.cpp stack). Safety review model — llama.cpp on a low-VRAM instance, always warm. Model family: Mistral AI throughout — Magistral Small 1.2 (24B, primary), Mistral Large 3 (675B MoE, high-VRAM phases), Voxtral (audio), Codestral (code), Devstral (agentic).
Orchestration — N8N workflow engine for trigger handling and routing. Reasoning loop — Python + llama-cpp-python, runs externally from N8N. Short-term memory — Redis (3-slot circular conversation buffer). Notepad memory — Redis + Qdrant async (persistent working memory). Conversation history — Qdrant vector DB with Mistral Embed. Episodic memory — PostgreSQL (user preferences, long-term history). Semantic memory — Qdrant (NAS files, archives, docs — search by meaning).
Software scaling is free. Hardware scaling costs money.
The system prompt and injected manifests are identical across requests. llama.cpp caches the KV state for this static prefix — prefill latency drops from seconds to milliseconds on repeated calls.
The reasoning model can request multiple ask calls in a single output. All fire in parallel. Complex multi-source requests compress from 4–5 iterations to 2–3.
TTS audio streams as the reasoning model generates text — the user hears the first sentence while the model is still generating the rest. The most impactful perceived latency improvement.
A cheap rule-based classifier pre-selects plausible specialists before the loop starts. Only relevant manifests are injected — a lighting request gets the HomeAssistant manifest, not the full registry.
Quantising the KV cache to 8-bit halves VRAM consumption with minimal quality impact. For a 32K context this frees several GB of VRAM for larger batch sizes or longer effective context.
Fast path first — Redis at under 1ms, then PostgreSQL at ~10ms, then Qdrant ANN at 50–200ms. Qdrant is only queried when the task genuinely involves files, archives, or cross-session continuity.