Jarvis acts proactively โ without waiting for user input
Fires at a scheduled time or interval. Morning briefing at 07:00 โ weather, emails, server alerts, RSS โ delivered as a voice summary or text notification.
Fires when a HomeAssistant sensor crosses a threshold. Door opened, motion detected, temperature out of range, power consumption spike.
Fires based on inferred user location. Phone connects to home Wi-Fi โ run arrival routine. Phone disconnects โ run departure routine. Checks presence before acting to avoid false positives.
Raspberry Pi nodes โ all AI processing happens server-side
Each room has a Raspberry Pi with a microphone, hardware kill switch, and multi-LED indicator. Audio pipeline โ two parallel streams on the server: - Voxtral Realtime (4B): continuous live transcription, speaker labels, sub-200ms latency - Voxtral Small (24B): speaker identification, intent classification, direct function calling - SpeechBrain: tone and affect analysis (urgency, distress, confidence) Fast path: simple voice commands bypass the full reasoning loop entirely โ Voxtral Small produces a structured action call directly from audio. Hardware switch OFF cuts microphone power โ cannot be overridden by software.
Each room has a Raspberry Pi with a camera module, hardware switch, and LED indicator. Video streams via NDI at 2โ5fps โ sufficient for all processing tasks. Video pipeline โ all tasks run in parallel: - YOLOv8/YOLOv9: object recognition (positions, confidence scores) - MediaPipe/MMPose: full skeleton keypoints per person - InsightFace/DeepFace: identity match against household members - Derived layer: position in room from pose + camera geometry Gesture recognition uses a rule layer on top of skeleton coordinates โ a personal gesture vocabulary mapped to structured intent signals. Face + voice recognition serve as independent trust signals. Both required for high-confidence identity on high-risk requests.
Every input type is normalised into a structured intent before reaching the reasoning loop
Continuous audio analysis layer runs at all times โ transcribing, identifying speakers, detecting intent signals, and enriching context before the reasoning loop even sees the request.
Phone app (LAN / VPN) or dedicated home terminal. The primary interface for complex multi-step requests where precision matters more than speed.
In-house cameras produce skeleton keypoints. A personal gesture vocabulary maps arm and hand positions to structured intent signals โ no trained gesture model required.
Acting on inferred intent โ without explicit commands
The audio and video pipelines together enable a class of behaviour where Jarvis acts on inferred intent without explicit commands. This requires the reasoning model to distinguish between direct requests and contextual signals, and act accordingly.
Example: the audio pipeline outputs { transcript: "isn't it cold today", intent_type: "indirect_preference", directed_at_jarvis: false }. The reasoning model infers a preference signal โ not a direct command, but actionable context. Jarvis silently raises the heater temperature (low paranoia act call, no user interruption) and stores the action in short-term memory for later retrieval.
Example: the video pipeline outputs { identity: "user", gesture: "pointing_at_tv", position: "couch" }. Jarvis cross-references with audio โ if the user says "turn that off", it validates and acts. If a defined gesture maps directly to an action (thumb-down near speaker = mute), it acts immediately.
Jarvis never acts silently on high-paranoia domains regardless of confidence. Low confidence on any signal means do nothing or clarify. All silent actions are always recorded in short-term memory so the user can verify them at any time.
Physical kill switches cut sensor power on every microphone and camera node โ this cannot be overridden by software. A lens cap holder provides a secondary physical fallback for cameras. Both states are reflected in the node's status API in real time.
All processing happens locally on the AI server. No audio, video, transcripts, or inferences leave the homelab. The user's personal data, context, and actions are never sent to any external service.
A dedicated monitoring VM receives a passthrough of an encrypted NAS storage block and logs every reasoning loop iteration, every ask/act call, and every safety review decision. No component of the main Jarvis stack has network access to this VM โ including Jarvis itself.
Tone, affect, and presence data is treated as sensitive and stored with restricted access. Voice and face recognition are security primitives โ an unknown or low-confidence speaker defaults to anonymous access, restricted to low-paranoia requests only.