How Masker intercepts and masks a live voice call

Voice is what Masker was built for. The hard part isn’t masking text — it’s masking streaming, low-latency, partial transcripts without adding noticeable delay to the conversation. This page walks through exactly what happens on a typical call, from first word to final audio.

End-to-end flow

Your voice platform handles speech-to-text and text-to-speech. Masker sits in the middle, between the transcript and your LLM:

caller ──▶ voice platform (Vapi / Bolna / Retell)
                │
                ├── ASR transcript ──▶ Masker proxy ──▶ LLM ──▶ Masker rehydrate ──▶ TTS ──▶ caller
                │
                └── audit webhook  ──▶ Masker session writer

Two integration points, both webhook-style:

Custom LLM endpoint

POST /proxy/{agent_id}/v1/chat/completionsYour voice platform calls this instead of OpenAI directly. Masker masks the request, forwards to your upstream model, rehydrates the response, and streams it back.

Assistant request webhook

POST /vapi/webhook/{agent_id}Vapi-specific. Masker returns the configuration the platform should use for this call — system prompt, model, function definitions — and attaches audit metadata to the session.

Latency budget

Masker adds 45–95 ms of overhead on a typical chat completion request:

Stage	Cost
Receive request, parse JSON	~2 ms
Pass 1 — regex catalogue	~1 ms
Pass 2 — Gemma-4 NER	30–80 ms
Tokenize and write events	~5 ms
Forward to upstream model	network only
Rehydrate response stream	~5 ms
Write session record	async, off the hot path

Pass 2 (NER) is the dominant cost. Masker runs Gemma-4 quantized on a GPU pool; for stub-mode testing, the NER pass falls through. For context: typical voice-agent end-to-end latency targets sit at 800–1,200 ms (ASR + LLM + TTS combined). Masker’s 45–95 ms is 5–10% of that budget. In pilots, no listener has been able to detect it on a blind A/B comparison.

Streaming

Masker fully supports streaming chat completions (stream: true). Here is what happens:

Mask the request

The full request body arrives and Masker runs detection and tokenization before forwarding anything. The request leg is not streamed — it waits for a complete, masked payload.

Stream the response

The upstream LLM streams chunks back to Masker. Masker buffers each chunk just long enough to scan for tokens, then rehydrates inline and forwards.

Flush with a bounded buffer

Masker will not hold more than MASKER_STREAM_BUFFER_MS (default: 50 ms) of upstream output before flushing. In practice, the rehydration scan runs faster than the upstream’s chunk cadence, so streaming feels native to your caller.

Partial transcripts

When voice platforms send partial transcripts — “the user said: ‘my number is five five five’…” — Masker treats them like any other input. Detection runs, partial spans get masked, and rehydration handles the response. If the ASR corrects a partial in the next update (“…my number is five five five one two”), the updated partial is a fresh, independent request to Masker. Masker does not reconcile partials across messages — each request stands alone.

For Retell deployments, set MASKER_RETELL_PARTIAL_DEDUP=true to skip detection work on partials that are byte-identical to the previous one. This reduces NER cost on noisy microphones.

Rehydration failures

If a token cannot be rehydrated on the response leg — because a key was rotated out, a vault row is missing, or the token is malformed — Masker:

Emits a rehydration_failed event to the audit log
Replaces the token inline with [REDACTED:KIND] — for example, [REDACTED:PHONE]
Continues processing the rest of the response

Your TTS engine then speaks the fallback string. The caller hears “redacted phone” rather than an unresolved token or silence. The audit log records the failure with the affected turn and session ID.

A [REDACTED:KIND] in a TTS response indicates a rehydration failure. Check the audit log for rehydration_failed events after any key rotation to confirm no live sessions were affected.

The three session artifacts

Every call produces three artifacts, all derived from the same event stream:

Live firewall view

A side-by-side view of the call, split by a vertical compliance firewall:

Left of the firewall (regulated): the patient-to-voice-vendor channel. Real SSNs, phones, names, and addresses live here and only here.
Right of the firewall (public): the Masker-to-LLM channel. Every PHI span is replaced with its token.
Across the firewall: animated chips that visualize each redaction going out and each rehydration coming back.

This is what you show an auditor when they ask “prove no PHI left the regulated boundary.”

Audit chain (real-time)

Every detection becomes a tamper-evident event in a hash-chained journal:

{"seq":0,"kind":"detection","detector":"ssn_v1","placeholder":"[SSN_01]","prev_hash":"0000…","curr_hash":"a3f2…","ts":"2026-05-01T18:33:01Z"}
{"seq":1,"kind":"detection","detector":"usphone_v2","placeholder":"[USPHONE_01]","prev_hash":"a3f2…","curr_hash":"7c9e…","ts":"2026-05-01T18:33:02Z"}
{"seq":2,"kind":"redaction_applied","span":[12,23],"placeholder":"[SSN_01]","prev_hash":"7c9e…","curr_hash":"e1b4…","ts":"2026-05-01T18:33:02Z"}

Each event carries a prev_hash linking it to the previous event and a curr_hash covering its own contents. A single mutated byte breaks every downstream hash. You can verify the chain offline, or call POST /audit/verify to get {"ok": true, "event_count": N, "message": "chain ok"}.

If the durable journal append fails, Masker returns AuditUnavailable and does not process the call. There are no quiet drops.

Session compliance report (signed)

At call end, Masker mints a HIPAA Safe Harbor compliance report as two consistent artifacts from the same event chain:

Masker Audit Schema v1 JSON — machine-checkable, shareable with automated compliance tooling
Auditor-ready HIPAA PDF — human-readable, suitable for review by a compliance officer

Both artifacts share the same merkle_root_hex, so you can prove the PDF and JSON describe identical chains. The report includes HIPAA Safe Harbor coverage, PCI-DSS scope, leak detection results, retention attestation, and BAA chain status.Download both from the Reports tab in one click.

Platform-specific notes

Vapi
Bolna
Retell

Set Masker’s proxy URL as the Custom LLM field in your Vapi assistant.
Set Masker’s webhook URL as the Server URL field.
Set the Server URL Secret to a value you also configure as MASKER_VAPI_WEBHOOK_SECRET. Masker validates HMAC signatures on every webhook.
Vapi’s own credit warnings pass through Masker unmodified — Masker does not filter platform-level messages.

Bolna’s custom LLM endpoint follows the OpenAI chat-completions shape, so the proxy works as-is.
Bolna does not currently use the assistant-request webhook — only the proxy endpoint.
If you use a slower upstream model, increase MASKER_UPSTREAM_TIMEOUT_MS (default: 30,000 ms).

Retell’s LLM webhook follows the chat-completions shape — the proxy works as-is.
Retell sends continuous partial transcripts. Masker handles each as an independent request.
Set MASKER_RETELL_PARTIAL_DEDUP=true to deduplicate byte-identical partials and reduce NER cost on noisy microphones.

What Masker does not do for voice

Masker only sees text. It works on transcripts produced by your voice platform’s ASR engine, and produces text that your platform’s TTS engine speaks back.

Masker does not run ASR. Your voice platform handles speech-to-text.
Masker does not run TTS. Your voice platform speaks the rehydrated response to the caller.
Masker does not record audio. Configure call recordings in Vapi, Bolna, or Retell directly, and apply a retention policy there.

Get Started

Masking

Integrations

Compliance

Configuration

How Masker intercepts and masks a live voice call

End-to-end flow

Custom LLM endpoint

Assistant request webhook

Latency budget

Streaming

Partial transcripts

Rehydration failures

The three session artifacts

Platform-specific notes

What Masker does not do for voice

Get Started

Masking

Integrations

Compliance

Configuration

Documentation Index

​End-to-end flow

Custom LLM endpoint

Assistant request webhook

​Latency budget

​Streaming

​Partial transcripts

​Rehydration failures

​The three session artifacts

​Platform-specific notes

​What Masker does not do for voice

End-to-end flow

Latency budget

Streaming

Partial transcripts

Rehydration failures

The three session artifacts

Platform-specific notes

What Masker does not do for voice