> ## Documentation Index
> Fetch the complete documentation index at: https://docs.masker.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# How Masker intercepts and masks a live voice call

> Step-by-step walkthrough of the STT → detect → mask → LLM → rehydrate → TTS flow, latency budget, streaming behavior, and audit artifacts.

Voice is what Masker was built for. The hard part isn't masking text — it's masking **streaming, low-latency, partial transcripts** without adding noticeable delay to the conversation. This page walks through exactly what happens on a typical call, from first word to final audio.

## End-to-end flow

Your voice platform handles speech-to-text and text-to-speech. Masker sits in the middle, between the transcript and your LLM:

```
caller ──▶ voice platform (Vapi / Bolna / Retell)
                │
                ├── ASR transcript ──▶ Masker proxy ──▶ LLM ──▶ Masker rehydrate ──▶ TTS ──▶ caller
                │
                └── audit webhook  ──▶ Masker session writer
```

Two integration points, both webhook-style:

<CardGroup cols={2}>
  <Card title="Custom LLM endpoint" icon="arrow-right-arrow-left">
    `POST /proxy/{agent_id}/v1/chat/completions`

    Your voice platform calls this instead of OpenAI directly. Masker masks the request, forwards to your upstream model, rehydrates the response, and streams it back.
  </Card>

  <Card title="Assistant request webhook" icon="webhook">
    `POST /vapi/webhook/{agent_id}`

    Vapi-specific. Masker returns the configuration the platform should use for this call — system prompt, model, function definitions — and attaches audit metadata to the session.
  </Card>
</CardGroup>

## Latency budget

Masker adds **45–95 ms** of overhead on a typical chat completion request:

| Stage                       | Cost                    |
| --------------------------- | ----------------------- |
| Receive request, parse JSON | \~2 ms                  |
| Pass 1 — regex catalogue    | \~1 ms                  |
| Pass 2 — Gemma-4 NER        | 30–80 ms                |
| Tokenize and write events   | \~5 ms                  |
| Forward to upstream model   | network only            |
| Rehydrate response stream   | \~5 ms                  |
| Write session record        | async, off the hot path |

Pass 2 (NER) is the dominant cost. Masker runs Gemma-4 quantized on a GPU pool; for stub-mode testing, the NER pass falls through.

For context: typical voice-agent end-to-end latency targets sit at 800–1,200 ms (ASR + LLM + TTS combined). Masker's 45–95 ms is **5–10% of that budget**. In pilots, no listener has been able to detect it on a blind A/B comparison.

## Streaming

Masker fully supports streaming chat completions (`stream: true`). Here is what happens:

<Steps>
  <Step title="Mask the request">
    The full request body arrives and Masker runs detection and tokenization before forwarding anything. The request leg is not streamed — it waits for a complete, masked payload.
  </Step>

  <Step title="Stream the response">
    The upstream LLM streams chunks back to Masker. Masker buffers each chunk just long enough to scan for tokens, then rehydrates inline and forwards.
  </Step>

  <Step title="Flush with a bounded buffer">
    Masker will not hold more than `MASKER_STREAM_BUFFER_MS` (default: 50 ms) of upstream output before flushing. In practice, the rehydration scan runs faster than the upstream's chunk cadence, so streaming feels native to your caller.
  </Step>
</Steps>

## Partial transcripts

When voice platforms send partial transcripts — "the user said: 'my number is five five five'..." — Masker treats them like any other input. Detection runs, partial spans get masked, and rehydration handles the response.

If the ASR corrects a partial in the next update ("...my number is five five five **one two**"), the updated partial is a fresh, independent request to Masker. Masker does not reconcile partials across messages — each request stands alone.

<Tip>
  For Retell deployments, set `MASKER_RETELL_PARTIAL_DEDUP=true` to skip detection work on partials that are byte-identical to the previous one. This reduces NER cost on noisy microphones.
</Tip>

## Rehydration failures

If a token cannot be rehydrated on the response leg — because a key was rotated out, a vault row is missing, or the token is malformed — Masker:

1. Emits a `rehydration_failed` event to the audit log
2. Replaces the token inline with `[REDACTED:KIND]` — for example, `[REDACTED:PHONE]`
3. Continues processing the rest of the response

Your TTS engine then speaks the fallback string. The caller hears "redacted phone" rather than an unresolved token or silence. The audit log records the failure with the affected turn and session ID.

<Warning>
  A `[REDACTED:KIND]` in a TTS response indicates a rehydration failure. Check the audit log for `rehydration_failed` events after any key rotation to confirm no live sessions were affected.
</Warning>

## The three session artifacts

Every call produces three artifacts, all derived from the same event stream:

<AccordionGroup>
  <Accordion title="Live firewall view">
    A side-by-side view of the call, split by a vertical compliance firewall:

    * **Left of the firewall (regulated):** the patient-to-voice-vendor channel. Real SSNs, phones, names, and addresses live here and only here.
    * **Right of the firewall (public):** the Masker-to-LLM channel. Every PHI span is replaced with its token.
    * **Across the firewall:** animated chips that visualize each redaction going out and each rehydration coming back.

    This is what you show an auditor when they ask "prove no PHI left the regulated boundary."
  </Accordion>

  <Accordion title="Audit chain (real-time)">
    Every detection becomes a tamper-evident event in a hash-chained journal:

    ```jsonl theme={null}
    {"seq":0,"kind":"detection","detector":"ssn_v1","placeholder":"[SSN_01]","prev_hash":"0000…","curr_hash":"a3f2…","ts":"2026-05-01T18:33:01Z"}
    {"seq":1,"kind":"detection","detector":"usphone_v2","placeholder":"[USPHONE_01]","prev_hash":"a3f2…","curr_hash":"7c9e…","ts":"2026-05-01T18:33:02Z"}
    {"seq":2,"kind":"redaction_applied","span":[12,23],"placeholder":"[SSN_01]","prev_hash":"7c9e…","curr_hash":"e1b4…","ts":"2026-05-01T18:33:02Z"}
    ```

    Each event carries a `prev_hash` linking it to the previous event and a `curr_hash` covering its own contents. A single mutated byte breaks every downstream hash. You can verify the chain offline, or call `POST /audit/verify` to get `{"ok": true, "event_count": N, "message": "chain ok"}`.

    <Note>
      If the durable journal append fails, Masker returns `AuditUnavailable` and does not process the call. There are no quiet drops.
    </Note>
  </Accordion>

  <Accordion title="Session compliance report (signed)">
    At call end, Masker mints a HIPAA Safe Harbor compliance report as two consistent artifacts from the same event chain:

    * **Masker Audit Schema v1 JSON** — machine-checkable, shareable with automated compliance tooling
    * **Auditor-ready HIPAA PDF** — human-readable, suitable for review by a compliance officer

    Both artifacts share the same `merkle_root_hex`, so you can prove the PDF and JSON describe identical chains. The report includes HIPAA Safe Harbor coverage, PCI-DSS scope, leak detection results, retention attestation, and BAA chain status.

    Download both from the Reports tab in one click.
  </Accordion>
</AccordionGroup>

## Platform-specific notes

<Tabs>
  <Tab title="Vapi">
    * Set Masker's proxy URL as the **Custom LLM** field in your Vapi assistant.
    * Set Masker's webhook URL as the **Server URL** field.
    * Set the **Server URL Secret** to a value you also configure as `MASKER_VAPI_WEBHOOK_SECRET`. Masker validates HMAC signatures on every webhook.
    * Vapi's own credit warnings pass through Masker unmodified — Masker does not filter platform-level messages.
  </Tab>

  <Tab title="Bolna">
    * Bolna's custom LLM endpoint follows the OpenAI chat-completions shape, so the proxy works as-is.
    * Bolna does not currently use the assistant-request webhook — only the proxy endpoint.
    * If you use a slower upstream model, increase `MASKER_UPSTREAM_TIMEOUT_MS` (default: 30,000 ms).
  </Tab>

  <Tab title="Retell">
    * Retell's LLM webhook follows the chat-completions shape — the proxy works as-is.
    * Retell sends continuous partial transcripts. Masker handles each as an independent request.
    * Set `MASKER_RETELL_PARTIAL_DEDUP=true` to deduplicate byte-identical partials and reduce NER cost on noisy microphones.
  </Tab>
</Tabs>

## What Masker does not do for voice

<Note>
  Masker only sees text. It works on transcripts produced by your voice platform's ASR engine, and produces text that your platform's TTS engine speaks back.

  * **Masker does not run ASR.** Your voice platform handles speech-to-text.
  * **Masker does not run TTS.** Your voice platform speaks the rehydrated response to the caller.
  * **Masker does not record audio.** Configure call recordings in Vapi, Bolna, or Retell directly, and apply a retention policy there.
</Note>
