> ## Documentation Index
> Fetch the complete documentation index at: https://docs.masker.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# Configuring Masker's mask_policy.yaml detection rules

> Learn how mask_policy.yaml controls which PHI entities are detected, how each span is tokenized or redacted, and which key ring applies.

The masking policy is the YAML file that drives every decision Masker makes at runtime: which entity types to scan for, which detection passes to run, whether to tokenize or redact each entity, and which key to use when minting tokens. Masker ships a default policy at `configs/mask_policy.yaml` named `healthcare-default` that covers HIPAA Safe Harbor identifiers out of the box. You can tune that file, create per-agent policies, or switch tokenization schemes — all without touching code.

## Sample mask\_policy.yaml

The annotated example below matches the structure Masker expects. Every field is optional except those marked **required**.

```yaml mask_policy.yaml theme={null}
name: healthcare-default       # required — unique name, referenced by agents
version: 1                     # required — schema version, currently 1
description: |
  HIPAA Safe Harbor coverage for voice AI agents in healthcare.
  Covers 9 of 18 categories fully, 3 partially.

# Default key ID for tokenization.
# Must match an env var MASKER_KEY_<kid> on the running server.
kid: K_HEALTHCARE              # required

# Differential privacy budget for surrogate / synthetic generation.
epsilon: 0.5                   # optional, default 0.5

# Tokenization scheme applied to every entity unless overridden.
# vault-deterministic — HMAC lookup in a SQLite vault; same input = same token
# reversible-aead     — stateless AES-256-GCM-SIV; no vault state needed
# synthetic           — generate a realistic-looking but fake value
tokenization: vault-deterministic   # required

# Detection passes — order matters.
# regex   — fast pattern matching, runs first
# gemma   — on-device NER model, catches names and context-dependent spans
# diarize — speaker attribution for audio; auto-enabled for audio webhooks
passes:
  - regex
  - gemma           # comment out to skip NER and run regex-only

# Per-entity detection and action rules
entities:

  PHONE:
    enabled: true
    regex: true
    ner: true
    confidence_threshold: 0.6   # NER hits below this score are dropped
    action: tokenize

  SSN:
    enabled: true
    regex: true
    ner: false                  # regex covers SSN fully; NER not needed
    confidence_threshold: 0.0
    action: tokenize

  NAME:
    enabled: true
    regex: false                # names don't match regex patterns reliably
    ner: true
    confidence_threshold: 0.7
    action: tokenize

  EMAIL:
    enabled: true
    regex: true
    ner: false
    confidence_threshold: 0.0
    action: tokenize

  DOB:
    enabled: true
    regex: true
    ner: true
    confidence_threshold: 0.5
    action: tokenize

  ADDRESS:
    enabled: true
    regex: true                 # ZIP codes and street patterns
    ner: true                   # full address recognition
    confidence_threshold: 0.6
    action: tokenize

  MRN:
    enabled: true
    regex: true
    ner: true
    confidence_threshold: 0.6
    action: tokenize

  ACCOUNT:
    enabled: true
    regex: true
    ner: true
    confidence_threshold: 0.7
    action: tokenize

  IP_ADDRESS:
    enabled: true
    regex: true
    ner: false
    confidence_threshold: 0.0
    action: redact              # IPs aren't useful to the LLM; just remove them

# Audit log behavior
audit:
  log_events: true
  log_payloads: false           # encrypted payload retention; off by default
  retention_days: 2555          # 7 years — the HIPAA minimum
```

## Field reference

### Top-level fields

| Field          | Type   | Required | Description                                                              |
| -------------- | ------ | -------- | ------------------------------------------------------------------------ |
| `name`         | string | yes      | Unique policy name. Referenced by agents and shown in the portal.        |
| `version`      | int    | yes      | Schema version. Currently `1`.                                           |
| `description`  | string | no       | Free-form description shown in the portal.                               |
| `kid`          | string | yes      | Default key ID. Must match `MASKER_KEY_<kid>` in your environment.       |
| `epsilon`      | float  | no       | Differential privacy budget for synthetic surrogates. Defaults to `0.5`. |
| `tokenization` | enum   | yes      | One of `vault-deterministic`, `reversible-aead`, or `synthetic`.         |
| `passes`       | list   | yes      | Ordered list of detection passes: `regex`, `gemma`, `diarize`.           |
| `entities`     | map    | yes      | Per-entity rules. See below.                                             |
| `audit`        | map    | no       | Audit log behavior.                                                      |

### Per-entity fields

| Field                  | Type  | Default    | Description                                                             |
| ---------------------- | ----- | ---------- | ----------------------------------------------------------------------- |
| `enabled`              | bool  | `true`     | Master switch for this entity. Set to `false` to skip it entirely.      |
| `regex`                | bool  | `true`     | Run the regex pass for this entity.                                     |
| `ner`                  | bool  | `true`     | Run the NER pass (Gemma model) for this entity.                         |
| `confidence_threshold` | float | `0.6`      | Minimum NER confidence score. Hits below this are discarded.            |
| `action`               | enum  | `tokenize` | What to do with detected spans: `tokenize`, `redact`, or `passthrough`. |

### Audit fields

| Field            | Type | Default | Description                                                                |
| ---------------- | ---- | ------- | -------------------------------------------------------------------------- |
| `log_events`     | bool | `true`  | Write a per-redaction event to the audit log.                              |
| `log_payloads`   | bool | `false` | Retain encrypted payloads alongside events. Off by default.                |
| `retention_days` | int  | `2555`  | How long audit records are kept. 2555 days (7 years) is the HIPAA minimum. |

## Tokenization schemes

<Tabs>
  <Tab title="vault-deterministic">
    Masker stores a mapping of `(plaintext, entity_kind)` → token in a local SQLite vault. The same input always produces the same token, so LLM responses referring to `MSKV1.PHONE.K_HEALTHCARE.abc123` can be correctly rehydrated even across turns.

    Best for: single-node deployments where vault state is easy to persist.

    Drawback: requires a shared vault in multi-replica setups. Use a Postgres database via `MASKER_DATABASE_URL` or switch to `reversible-aead` instead.
  </Tab>

  <Tab title="reversible-aead">
    Masker encrypts each plaintext span using AES-256-GCM-SIV and encodes the ciphertext into the token. No vault state is required — the key alone is sufficient to rehydrate.

    Best for: multi-replica Kubernetes deployments, Fly.io multi-region, or anywhere you want to avoid coordinating shared state.

    Drawback: tokens are longer and not stable across key rotations (the old key is required to rehydrate old tokens).
  </Tab>

  <Tab title="synthetic">
    Masker replaces each detected span with a realistic-looking but entirely synthetic value. Names become different names, phone numbers become different phone numbers.

    Best for: generating safe test fixtures or demo recordings where human-readable output matters more than reversibility.

    Drawback: tokens are not rehydratable — what the caller hears will differ from what the LLM was told.
  </Tab>
</Tabs>

## Tuning detection sensitivity

Every entity's `confidence_threshold` controls how aggressively the NER pass fires. Lower values catch more but may introduce false positives; higher values are more precise but may miss edge cases.

<Tip>
  Start with the defaults, run `masker detect` against real transcripts (with PHI scrubbed from the samples), and raise or lower thresholds based on what you observe.
</Tip>

To disable NER for a specific entity and rely only on regex, set `ner: false`. SSN and EMAIL are good candidates — their formats are regular enough that NER adds noise rather than coverage.

To disable an entity type entirely, set `enabled: false`. This prevents Masker from running any detection pass for that kind.

## Applying a policy

### Global policy

Set `MASKER_POLICY_PATH` to point to your policy file before starting Masker. The default is `configs/mask_policy.yaml`.

To reload a running server without restarting it:

```bash theme={null}
curl -X POST https://masker-voice.fly.dev/api/v1/admin/policy/reload \
  -H "Cookie: masker_session=$MASKER_SESSION"
```

The reload is atomic — in-flight requests complete on the old policy; new requests immediately pick up the updated one.

### Per-agent policy overrides

Each agent inherits the global policy by default. To assign a custom policy to one agent, pass `policy_yaml` when creating or updating the agent:

```bash theme={null}
curl -X POST https://masker-voice.fly.dev/api/v1/agents \
  -H "Cookie: masker_session=$MASKER_SESSION" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "billing-bot",
    "upstream": "openai:gpt-4o-mini",
    "policy_yaml": "<contents of custom-policy.yaml>"
  }'
```

The custom YAML is stored alongside the agent record and loaded only for that agent's requests.

## CLI: validate and diff

Use the `masker policy` subcommands to validate and compare policies before deploying them.

### Validate before deploying

```bash theme={null}
masker policy validate configs/mask_policy.yaml
```

Validation catches the three most common errors:

* `unknown_kid` — the policy references a `kid` with no matching `MASKER_KEY_<kid>` environment variable
* `invalid_pass` — the `passes` list contains a name Masker doesn't recognize
* `missing_entity` — an entity referenced in `passes` is not declared under `entities`

<Warning>
  Validation errors prevent boot in production mode. In development mode (`MASKER_DEV=1`) Masker logs the error and falls through to defaults — never rely on this in production.
</Warning>

### Diff two policy versions

```bash theme={null}
masker policy diff configs/mask_policy.yaml configs/mask_policy_v2.yaml
```

The diff shows which entities were added or removed, which thresholds changed, and which actions changed. Run this before replacing a live policy to understand the impact on detection coverage.
