> ## Documentation Index
> Fetch the complete documentation index at: https://docs.masker.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# Generate synthetic data for testing without real PHI

> Replace detected PHI with realistic but fictional surrogates for local development, CI pipelines, and demos — without touching real patient data.

When you're building or debugging a voice AI integration, you need realistic transcripts. But loading real patient data into a development environment creates PHI exposure — even if the environment is secured. Masker's **synthetic mode** solves this by replacing detected PHI with plausible but entirely fictional surrogates, so your development and CI environments never need to see real identifiers.

<Note>
  Synthetic data generation is in **private preview** as of the May 30 production launch. Beta customers get access on request. The CLI commands and API surface below may change before GA. Email [hello@masker.dev](mailto:hello@masker.dev) to enable it for your account.
</Note>

## What synthetic mode does

Synthetic mode is a third tokenization scheme alongside vault-deterministic and reversible AEAD. Instead of replacing PHI with an opaque token, Masker generates a **plausible surrogate** — a value that looks real and has the right shape, but refers to nobody:

| Original                      | Synthetic surrogate           |
| ----------------------------- | ----------------------------- |
| Sarah Chen                    | Maria Lopez                   |
| (415) 555-2671                | (415) 555-8423                |
| 1990-04-12                    | 1991-08-30                    |
| 123 Main St, Seattle WA 98101 | 47 Cedar Ave, Tacoma WA 98402 |
| MRN 4827193                   | MRN 8362041                   |

Surrogates are:

* **Locale-aware** — phone numbers stay in the same country code, ZIP codes stay in the same metro area
* **Type-stable** — the LLM still sees a phone where a phone was, a name where a name was
* **Cohort-consistent** — within a single session, "Sarah" maps to "Maria" everywhere, so the LLM's reasoning remains coherent across turns
* **Statistically faithful** — name, age, and geographic distributions match public datasets (US Census, SSA name tables) so aggregate statistics on synthetic exports remain meaningful

## When to use synthetic mode

<CardGroup cols={2}>
  <Card title="Local development" icon="laptop">
    Developers need realistic call transcripts to debug edge cases. Synthetic mode gives them production-shaped data without ever touching real PHI.
  </Card>

  <Card title="CI / automated testing" icon="flask">
    Run your detection and integration tests against synthetic fixtures. No BAA required for your CI environment.
  </Card>

  <Card title="Demos and screenshots" icon="presentation">
    Show what calls look like in public-facing demos or conference talks without exposing real customers.
  </Card>

  <Card title="Model fine-tuning" icon="cpu">
    Fine-tune downstream models on call transcripts that statistically resemble production, without PHI risk.
  </Card>
</CardGroup>

## What synthetic mode is not for

<Warning>
  Do not use synthetic mode for:

  * **Customer support** — agents need the real PHI to help real patients. Use vault-deterministic or reversible AEAD for live calls.
  * **Compliance reporting** — auditors want real audit trails, not synthesized ones.
  * **Re-identifiable analytics** — by design, you cannot get back to originals from synthetic output.
</Warning>

## Setting up synthetic mode

Configure synthetic mode per agent in your policy file:

```yaml theme={null}
agents:
  - name: dev-replay
    upstream: stub
    tokenization: synthetic
    synthetic:
      seed: 0xC0FFEE
      preserve_locale: true
```

With `preserve_locale: true`, Masker keeps geographic context consistent — a Seattle caller stays in the Pacific Northwest, not in an unrelated metro.

## Using the CLI for local testing

The `masker detect` command lets you validate your detection configuration against a sample call fixture without sending real PHI anywhere:

```bash theme={null}
masker detect tests/fixtures/sample-call.json \
  --policy configs/mask_policy.yaml \
  --json
```

To run the full mask pipeline — including synthetic replacement — use `masker mask`:

```bash theme={null}
masker mask --policy configs/mask_policy.yaml
```

For voice call replay with synthetic output, use `masker-voice`:

```bash theme={null}
masker-voice replay \
  --input ./recording.wav \
  --policy configs/mask_policy.yaml \
  --asr-provider deepgram
```

This replays the recording through ASR, runs detection and synthetic replacement, and outputs a masked transcript you can inspect without any real PHI present.

## Determinism and the seed parameter

The `seed` parameter controls whether synthesis is reproducible:

<Tabs>
  <Tab title="Fixed seed (dev environments)">
    With a fixed seed, the same input always produces the same surrogate:

    ```yaml theme={null}
    synthetic:
      seed: 0xC0FFEE
    ```

    This is useful for **replay** — a developer can re-run a problematic call and get the same fake names every time, making debugging consistent.

    A fixed seed is acceptable in dev environments because those environments see no real PHI in the first place. It is safe to check a dev seed into your dev config.

    <Warning>
      With a known seed and a known surrogate, it is theoretically possible to recover the original value. Never use a fixed seed in a production or analytics environment that handles real PHI.
    </Warning>
  </Tab>

  <Tab title="No seed (analytics exports)">
    Without a seed, each synthesis run produces different surrogates:

    ```yaml theme={null}
    synthetic:
      # no seed field
      preserve_locale: true
    ```

    Synthesis is **non-reversible** — there is no path from "Maria Lopez" back to "Sarah Chen." The vault is not consulted. Use a fresh, never-stored seed per export for production analytics pipelines.
  </Tab>
</Tabs>

## Differential privacy budget

Synthetic mode honors the same `epsilon` (ε) parameter as vault-deterministic tokenization. ε controls how aggressively rare values are perturbed before synthesis:

| ε value                        | Privacy level                                   | Recommended use              |
| ------------------------------ | ----------------------------------------------- | ---------------------------- |
| `0.5` (default for healthcare) | Strong privacy, some loss of statistical detail | PHI-adjacent analytics       |
| `1.0`                          | Moderate privacy, more statistically useful     | General healthcare analytics |
| `2.0`                          | Weak privacy                                    | Non-PHI-adjacent data only   |

Values appearing fewer than `k` times in a session window are re-sampled from the full distribution rather than the empirical one. This prevents a rare identifier — "the only Latvian-named caller" — from being trivially re-identified in your synthetic export.

## Programmatic synthesis

You can also call synthesis as a one-shot transform outside the proxy path:

```bash theme={null}
curl -X POST https://masker-voice.fly.dev/api/v1/synthesize \
  -H "Cookie: masker_session=..." \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Hi, this is Sarah Chen at 415-555-2671...",
    "policy": "healthcare-default",
    "seed": 12345
  }'
```

This endpoint is gated behind the same private preview as agent-level synthetic mode. Contact [hello@masker.dev](mailto:hello@masker.dev) to enable it.
