When you’re building or debugging a voice AI integration, you need realistic transcripts. But loading real patient data into a development environment creates PHI exposure — even if the environment is secured. Masker’s synthetic mode solves this by replacing detected PHI with plausible but entirely fictional surrogates, so your development and CI environments never need to see real identifiers.Documentation Index
Fetch the complete documentation index at: https://docs.masker.dev/llms.txt
Use this file to discover all available pages before exploring further.
Synthetic data generation is in private preview as of the May 30 production launch. Beta customers get access on request. The CLI commands and API surface below may change before GA. Email hello@masker.dev to enable it for your account.
What synthetic mode does
Synthetic mode is a third tokenization scheme alongside vault-deterministic and reversible AEAD. Instead of replacing PHI with an opaque token, Masker generates a plausible surrogate — a value that looks real and has the right shape, but refers to nobody:| Original | Synthetic surrogate |
|---|---|
| Sarah Chen | Maria Lopez |
| (415) 555-2671 | (415) 555-8423 |
| 1990-04-12 | 1991-08-30 |
| 123 Main St, Seattle WA 98101 | 47 Cedar Ave, Tacoma WA 98402 |
| MRN 4827193 | MRN 8362041 |
- Locale-aware — phone numbers stay in the same country code, ZIP codes stay in the same metro area
- Type-stable — the LLM still sees a phone where a phone was, a name where a name was
- Cohort-consistent — within a single session, “Sarah” maps to “Maria” everywhere, so the LLM’s reasoning remains coherent across turns
- Statistically faithful — name, age, and geographic distributions match public datasets (US Census, SSA name tables) so aggregate statistics on synthetic exports remain meaningful
When to use synthetic mode
Local development
Developers need realistic call transcripts to debug edge cases. Synthetic mode gives them production-shaped data without ever touching real PHI.
CI / automated testing
Run your detection and integration tests against synthetic fixtures. No BAA required for your CI environment.
Demos and screenshots
Show what calls look like in public-facing demos or conference talks without exposing real customers.
Model fine-tuning
Fine-tune downstream models on call transcripts that statistically resemble production, without PHI risk.
What synthetic mode is not for
Setting up synthetic mode
Configure synthetic mode per agent in your policy file:preserve_locale: true, Masker keeps geographic context consistent — a Seattle caller stays in the Pacific Northwest, not in an unrelated metro.
Using the CLI for local testing
Themasker detect command lets you validate your detection configuration against a sample call fixture without sending real PHI anywhere:
masker mask:
masker-voice:
Determinism and the seed parameter
Theseed parameter controls whether synthesis is reproducible:
- Fixed seed (dev environments)
- No seed (analytics exports)
With a fixed seed, the same input always produces the same surrogate:This is useful for replay — a developer can re-run a problematic call and get the same fake names every time, making debugging consistent.A fixed seed is acceptable in dev environments because those environments see no real PHI in the first place. It is safe to check a dev seed into your dev config.
Differential privacy budget
Synthetic mode honors the sameepsilon (ε) parameter as vault-deterministic tokenization. ε controls how aggressively rare values are perturbed before synthesis:
| ε value | Privacy level | Recommended use |
|---|---|---|
0.5 (default for healthcare) | Strong privacy, some loss of statistical detail | PHI-adjacent analytics |
1.0 | Moderate privacy, more statistically useful | General healthcare analytics |
2.0 | Weak privacy | Non-PHI-adjacent data only |
k times in a session window are re-sampled from the full distribution rather than the empirical one. This prevents a rare identifier — “the only Latvian-named caller” — from being trivially re-identified in your synthetic export.