ARCHITECTURE · OPEN

Where every
byte lives.

Last updated · 21 May 2026 · v1.0 · backend repo public soon

No black box. Here’s the full path of a whisper, end-to-end — from your ear back to your ear. Median round trip: ≤300ms.

MicAVAudio STTWhisper EdgeHaiku 4.5 TTSElevenLabs EarAirPods

Device → Edge → Ear · ≤300ms median round trip

The stack

Speech-to-text: Whisper-base · CoreML · int4 · on-device
Trigger detector: Phi-3-mini Q4 (fallback Llama-3.2-1B Q4) · on-device
Cue model: Anthropic Claude Haiku 4.5 · primary
Cue failover: OpenAI GPT-4o-mini · automatic on 5xx or >500ms
TTS: ElevenLabs Flash v2 · fallback AVSpeechSynthesizer
Backend: Cloudflare Workers (Hono) · Durable Objects per session · Neon Postgres for users/billing only
Auth: Clerk · passkey-first · JWT 15min · refresh silent
Billing: StoreKit 2 (iOS) · Stripe (web)
Enrichment: Brave Search · Apollo.io · request-scrubbed
Audit retention: 90 days · metadata only · no quoted speech

1. On the device

The iOS app captures 16kHz PCM via AVAudioEngine, streams it through Whisper-base, and pipes transcript chunks into a local trigger detector. The trigger detector classifies the moment — silence, name, question, memory hit, tone shift — and emits a context blob with extracted facts. The trigger detector is also on-device. Transcripts and audio buffers stay in RAM and are dropped each chunk.

2. The cue request

When a trigger fires, the app sends a small JSON payload to our edge — about 400 bytes typical. Payload contains: the trigger type, extracted facts, the calendar event title (if opted in), and a session ID. It does not contain verbatim other-party speech.

3. The edge

A Cloudflare Worker receives the request, hydrates a Durable Object for the session (so we have 24h conversation memory available), and routes to the primary cue model. The whisper composer enforces the ≤7-word cap with a hard clamp before sending to TTS. We stream the TTS audio back to the device while the next trigger is already being decoded.

4. The privacy boundary

This boundary is the product. It’s defended by code, not by promises.

Other-party audio: never leaves the device
Other-party transcripts: never leave the device
Extracted facts: leave the device, scoped to the cue request
Whisper text: returned by the backend, used to bill, never stored beyond audit
Audit metadata: 90 days, then deleted

5. Geofencing

The GeofenceManager checks your coarse location against the canonical two-party-consent state list (CA, FL, IL, MD, MA, MT, NV, NH, PA, WA) and forces a mode selection at session start. Mid-session state changes are handled with a soft re-prompt. The list is hardcoded — it doesn’t change unless we cite a statute.

6. Failure modes

Network out · STT + trigger detector still run · cue request fails silently · no charge
Cue model out · automatic failover to secondary · target <500ms additional latency
TTS out · AVSpeechSynthesizer fallback · same cue text, different voice
Auth expired · silent refresh · zero user-visible interruption

7. What we publish

The backend code goes open-source the day public TestFlight launches. Audit it. Verify the privacy claims yourself.

Technical questions: engineering@hearby.co

Read security