Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.masker.dev/llms.txt

Use this file to discover all available pages before exploring further.

When you’re building or debugging a voice AI integration, you need realistic transcripts. But loading real patient data into a development environment creates PHI exposure — even if the environment is secured. Masker’s synthetic mode solves this by replacing detected PHI with plausible but entirely fictional surrogates, so your development and CI environments never need to see real identifiers.
Synthetic data generation is in private preview as of the May 30 production launch. Beta customers get access on request. The CLI commands and API surface below may change before GA. Email hello@masker.dev to enable it for your account.

What synthetic mode does

Synthetic mode is a third tokenization scheme alongside vault-deterministic and reversible AEAD. Instead of replacing PHI with an opaque token, Masker generates a plausible surrogate — a value that looks real and has the right shape, but refers to nobody:
OriginalSynthetic surrogate
Sarah ChenMaria Lopez
(415) 555-2671(415) 555-8423
1990-04-121991-08-30
123 Main St, Seattle WA 9810147 Cedar Ave, Tacoma WA 98402
MRN 4827193MRN 8362041
Surrogates are:
  • Locale-aware — phone numbers stay in the same country code, ZIP codes stay in the same metro area
  • Type-stable — the LLM still sees a phone where a phone was, a name where a name was
  • Cohort-consistent — within a single session, “Sarah” maps to “Maria” everywhere, so the LLM’s reasoning remains coherent across turns
  • Statistically faithful — name, age, and geographic distributions match public datasets (US Census, SSA name tables) so aggregate statistics on synthetic exports remain meaningful

When to use synthetic mode

Local development

Developers need realistic call transcripts to debug edge cases. Synthetic mode gives them production-shaped data without ever touching real PHI.

CI / automated testing

Run your detection and integration tests against synthetic fixtures. No BAA required for your CI environment.

Demos and screenshots

Show what calls look like in public-facing demos or conference talks without exposing real customers.

Model fine-tuning

Fine-tune downstream models on call transcripts that statistically resemble production, without PHI risk.

What synthetic mode is not for

Do not use synthetic mode for:
  • Customer support — agents need the real PHI to help real patients. Use vault-deterministic or reversible AEAD for live calls.
  • Compliance reporting — auditors want real audit trails, not synthesized ones.
  • Re-identifiable analytics — by design, you cannot get back to originals from synthetic output.

Setting up synthetic mode

Configure synthetic mode per agent in your policy file:
agents:
  - name: dev-replay
    upstream: stub
    tokenization: synthetic
    synthetic:
      seed: 0xC0FFEE
      preserve_locale: true
With preserve_locale: true, Masker keeps geographic context consistent — a Seattle caller stays in the Pacific Northwest, not in an unrelated metro.

Using the CLI for local testing

The masker detect command lets you validate your detection configuration against a sample call fixture without sending real PHI anywhere:
masker detect tests/fixtures/sample-call.json \
  --policy configs/mask_policy.yaml \
  --json
To run the full mask pipeline — including synthetic replacement — use masker mask:
masker mask --policy configs/mask_policy.yaml
For voice call replay with synthetic output, use masker-voice:
masker-voice replay \
  --input ./recording.wav \
  --policy configs/mask_policy.yaml \
  --asr-provider deepgram
This replays the recording through ASR, runs detection and synthetic replacement, and outputs a masked transcript you can inspect without any real PHI present.

Determinism and the seed parameter

The seed parameter controls whether synthesis is reproducible:
With a fixed seed, the same input always produces the same surrogate:
synthetic:
  seed: 0xC0FFEE
This is useful for replay — a developer can re-run a problematic call and get the same fake names every time, making debugging consistent.A fixed seed is acceptable in dev environments because those environments see no real PHI in the first place. It is safe to check a dev seed into your dev config.
With a known seed and a known surrogate, it is theoretically possible to recover the original value. Never use a fixed seed in a production or analytics environment that handles real PHI.

Differential privacy budget

Synthetic mode honors the same epsilon (ε) parameter as vault-deterministic tokenization. ε controls how aggressively rare values are perturbed before synthesis:
ε valuePrivacy levelRecommended use
0.5 (default for healthcare)Strong privacy, some loss of statistical detailPHI-adjacent analytics
1.0Moderate privacy, more statistically usefulGeneral healthcare analytics
2.0Weak privacyNon-PHI-adjacent data only
Values appearing fewer than k times in a session window are re-sampled from the full distribution rather than the empirical one. This prevents a rare identifier — “the only Latvian-named caller” — from being trivially re-identified in your synthetic export.

Programmatic synthesis

You can also call synthesis as a one-shot transform outside the proxy path:
curl -X POST https://masker-voice.fly.dev/api/v1/synthesize \
  -H "Cookie: masker_session=..." \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hi, this is Sarah Chen at 415-555-2671...",
    "policy": "healthcare-default",
    "seed": 12345
  }'
This endpoint is gated behind the same private preview as agent-level synthetic mode. Contact hello@masker.dev to enable it.