Jarvis — Architecture

Jarvis uses a coordinator + specialists pattern. Rather than one monolithic model trying to do everything, Jarvis is the single intelligent orchestrator — it understands intent and delegates work to purpose-built specialist agents.

Specialists are smaller, faster models with highly restrictive pre-prompts and tightly scoped capabilities. The restrictive design is intentional: it makes them fast, predictable, and safe. A HomeAssistant specialist knows only how to query and control the smart home — nothing else.

A dedicated specialist-creator agent is planned for the long term to allow Jarvis to propose and draft new capabilities autonomously. Human validation will always be required before any new specialist goes live.

Each llama.cpp instance is containerised independently. A specialist model can be processing an act request while the main reasoning model is already preparing its next iteration — no blocking, no waiting.

User Input / Automatic Trigger
            |
  N8N Main Workflow
  (input normalisation + routing)
            |
  External Reasoning Loop
  (think -> ask/act/delegate -> observe -> think)
       /            |                 \
  ASK           DELEGATE            ACT request
  Specialists   Sub-Agent               |
  (read-only,   (Mistral Large)   Safety Review Model
  returns       unload GPU #2     (paranoia-scored)
  context)      load Mistral          |
       |        Large               approve/block/
       |        run full            escalate
       |        internal loop           |
       |        return result      ACT Specialist
       |              |            (execute,
       \              |            returns result)
          Results back to loop
     [ iterates until decision reached ]
            |
  Response to User / Action Executed

💬

ask(specialist, query)

Request information — read-only, no side effects. The reasoning model can call ask freely and safely. Multiple asks can be batched in a single output and fire in parallel.

⚡

act(specialist, instruction)

Request execution — always passes through the safety review model before reaching the specialist. Irreversible actions require explicit approval from the secondary model.

🤖

delegate(sub_agent, task)

Delegate a complex multi-step task to the Mistral Large sub-agent. Highest cost output — displaces the full specialist pool for 30–90 seconds. Used only when standard iteration is clearly insufficient.

✅

respond(text)

Conversation complete — return the answer to the user. The loop terminates cleanly.

❓

clarify(question)

Insufficient context — ask the user before proceeding rather than guessing. Prevents confident wrong actions.

⏱️

Iteration Budget

A hard iteration ceiling prevents infinite loops. The current iteration and ceiling are passed in context — the model becomes more decisive as the ceiling approaches.

Inference Layer

Main reasoning model — llama.cpp on a dedicated high-VRAM GPU. Specialist models — llama.cpp instances, shared or dedicated GPUs. Audio processing — vLLM on a dedicated GPU (only exception to the unified llama.cpp stack). Safety review model — llama.cpp on a low-VRAM instance, always warm. Model family: Mistral AI throughout — Magistral Small 1.2 (24B, primary), Mistral Large 3 (675B MoE, high-VRAM phases), Voxtral (audio), Codestral (code), Devstral (agentic).

Memory & Orchestration

Orchestration — N8N workflow engine for trigger handling and routing. Reasoning loop — Python + llama-cpp-python, runs externally from N8N. Short-term memory — Redis (3-slot circular conversation buffer). Notepad memory — Redis + Qdrant async (persistent working memory). Conversation history — Qdrant vector DB with Mistral Embed. Episodic memory — PostgreSQL (user preferences, long-term history). Semantic memory — Qdrant (NAS files, archives, docs — search by meaning).

⚡

Prefix Caching

The system prompt and injected manifests are identical across requests. llama.cpp caches the KV state for this static prefix — prefill latency drops from seconds to milliseconds on repeated calls.

🔀

Multi-Specialist Batching

The reasoning model can request multiple ask calls in a single output. All fire in parallel. Complex multi-source requests compress from 4–5 iterations to 2–3.

🎙️

Streaming TTS

TTS audio streams as the reasoning model generates text — the user hears the first sentence while the model is still generating the rest. The most impactful perceived latency improvement.

📋

Lazy Manifest Injection

A cheap rule-based classifier pre-selects plausible specialists before the loop starts. Only relevant manifests are injected — a lighting request gets the HomeAssistant manifest, not the full registry.

💾

KV Cache Quantisation

Quantising the KV cache to 8-bit halves VRAM consumption with minimal quality impact. For a 32K context this frees several GB of VRAM for larger batch sizes or longer effective context.

🔍

Tiered Memory Reads

Fast path first — Redis at under 1ms, then PostgreSQL at ~10ms, then Qdrant ANN at 50–200ms. Qdrant is only queried when the task genuinely involves files, archives, or cross-session continuity.

Architecture Overview

Orchestration Flow

Reasoning Loop