ARCHITECTURE · OPEN
Where every
byte lives.
No black box. Here’s the full path of a whisper, end-to-end — from your ear back to your ear. Median round trip: ≤300ms.
Device → Edge → Ear · ≤300ms median round trip
The stack
- Speech-to-text
- Whisper-base · CoreML · int4 · on-device
- Trigger detector
- Phi-3-mini Q4 (fallback Llama-3.2-1B Q4) · on-device
- Cue model
- Anthropic Claude Haiku 4.5 · primary
- Cue failover
- OpenAI GPT-4o-mini · automatic on 5xx or >500ms
- TTS
- ElevenLabs Flash v2 · fallback AVSpeechSynthesizer
- Backend
- Cloudflare Workers (Hono) · Durable Objects per session · Neon Postgres for users/billing only
- Auth
- Clerk · passkey-first · JWT 15min · refresh silent
- Billing
- StoreKit 2 (iOS) · Stripe (web)
- Enrichment
- Brave Search · Apollo.io · request-scrubbed
- Audit retention
- 90 days · metadata only · no quoted speech
1. On the device
The iOS app captures 16kHz PCM via AVAudioEngine, streams it through Whisper-base, and pipes transcript chunks into a local trigger detector. The trigger detector classifies the moment — silence, name, question, memory hit, tone shift — and emits a context blob with extracted facts. The trigger detector is also on-device. Transcripts and audio buffers stay in RAM and are dropped each chunk.
2. The cue request
When a trigger fires, the app sends a small JSON payload to our edge — about 400 bytes typical. Payload contains: the trigger type, extracted facts, the calendar event title (if opted in), and a session ID. It does not contain verbatim other-party speech.
3. The edge
A Cloudflare Worker receives the request, hydrates a Durable Object for the session (so we have 24h conversation memory available), and routes to the primary cue model. The whisper composer enforces the ≤7-word cap with a hard clamp before sending to TTS. We stream the TTS audio back to the device while the next trigger is already being decoded.
4. The privacy boundary
This boundary is the product. It’s defended by code, not by promises.
- Other-party audio: never leaves the device
- Other-party transcripts: never leave the device
- Extracted facts: leave the device, scoped to the cue request
- Whisper text: returned by the backend, used to bill, never stored beyond audit
- Audit metadata: 90 days, then deleted
5. Geofencing
The GeofenceManager checks your coarse location against the canonical two-party-consent state list (CA, FL, IL, MD, MA, MT, NV, NH, PA, WA) and forces a mode selection at session start. Mid-session state changes are handled with a soft re-prompt. The list is hardcoded — it doesn’t change unless we cite a statute.
6. Failure modes
- Network out · STT + trigger detector still run · cue request fails silently · no charge
- Cue model out · automatic failover to secondary · target <500ms additional latency
- TTS out · AVSpeechSynthesizer fallback · same cue text, different voice
- Auth expired · silent refresh · zero user-visible interruption
7. What we publish
The backend code goes open-source the day public TestFlight launches. Audit it. Verify the privacy claims yourself.
Technical questions: engineering@hearby.co
Read security