> ## Documentation Index
> Fetch the complete documentation index at: https://docs.masker.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# PHI and PII identifiers Masker detects in your calls

> Full coverage matrix of HIPAA Safe Harbor identifiers, how each detector works, and which categories are fully covered versus on the roadmap.

Masker inspects every payload heading to your LLM and marks every span that looks like a regulated identifier before it crosses the compliance firewall. Detection runs in passes — each pass uses a different technique with different latency and recall characteristics, and the output of one pass feeds the next.

## Detection passes

<Steps>
  <Step title="Pass 1 — Regex catalogue">
    The first pass runs a curated set of regular expressions against the full payload. This covers the **deterministic** PHI and PII shapes — identifiers with a fixed structure that regex can match reliably:

    * Phone numbers (E.164, North American, international forms)
    * Social Security Numbers (with and without dashes)
    * Email addresses (RFC 5322 simplified)
    * Medical record numbers (configurable per-tenant pattern)
    * Dates (MM/DD/YYYY, ISO 8601, written forms)
    * ZIP codes (US 5-digit and ZIP+4)
    * IP addresses (v4 and v6)
    * URLs containing identifying paths
    * Account, license, vehicle, and device identifiers

    **Latency:** sub-millisecond per request. This pass runs on every payload without exception.

    **Recall:** very high for the shapes it knows. It misses identifiers spoken as prose — for example, "my number is five-five-five-one-two-three-four-five-six-seven" won't match a phone regex.
  </Step>

  <Step title="Pass 2 — Gemma-4 NER">
    After regex, Masker runs the partially-redacted payload through a fine-tuned **Gemma-4** named-entity recognition model. NER catches what regex can't:

    * Names ("Dr. Sarah Chen", "Mr. Johnson")
    * Geographic subdivisions smaller than a state
    * Health plan beneficiary numbers in unusual formats
    * Account numbers Masker hasn't seen a pattern for yet
    * Diagnoses, procedures, and clinical phrases that map to identifying entities
    * **Numbers spoken as words** — see [spoken-form detection](#spoken-form-detection) below

    The model was fine-tuned on three corpora:

    <AccordionGroup>
      <Accordion title="i2b2 / n2c2 — de-identified clinical notes">
        The gold standard for clinical NER. These are real de-identified clinical notes from major research competitions. Fine-tuning on this corpus gives Masker strong recall on the kind of PHI that appears in medical documentation.
      </Accordion>

      <Accordion title="MedDialog — patient-doctor dialogue">
        Conversational transcripts of patient-provider interactions. This corpus is what makes Masker effective on voice calls, where PHI appears in natural speech rather than structured records.
      </Accordion>

      <Accordion title="Switchboard — conversational phone speech">
        A large corpus of telephone conversations. Combined with MedDialog, this helps the model handle the informal, fragmented utterances typical of real calls.
      </Accordion>
    </AccordionGroup>

    **Latency:** 30–80 ms on Masker's default GPU pool. The model runs in-region (US-West today; US-East and EU are on the roadmap).

    **Recall:** materially higher than regex on conversational input. Also catches misspelled or transliterated names.

    <Note>
      The Gemma-4 model ships inside the Rust container. No external API calls are made during detection — the model loads once at boot and stays in memory.
    </Note>
  </Step>

  <Step title="Pass 3 — Diarization (audio path only)">
    When Masker sits in the audio path via a voice platform webhook, it adds a diarization pass **before** regex. This pass:

    * Separates speakers (caller versus agent)
    * Tags each turn with speaker ID and timing
    * Lets your policy mask the patient's voice content while leaving agent prompts untouched

    This is what makes compliance reports actionable: "Across 1,247 calls, the patient spoke 8,930 turns. We redacted PHI in 4,118 of them."
  </Step>
</Steps>

## Spoken-form detection

Callers rarely recite structured identifiers. They say:

> "My Social is **one two three forty-five sixty-seven eighty-nine**."

> "Call me back at **five ten, five five five, one two one two**."

The NER pass handles this. The Gemma-4 model was fine-tuned on conversational speech and recognizes digit-word sequences as the identifier kinds they represent — even when the spoken form doesn't match any regex pattern.

| Spoken form                                        | Detected as   |
| -------------------------------------------------- | ------------- |
| "one two three forty-five sixty-seven eighty-nine" | SSN           |
| "five ten five five five one two one two"          | US phone      |
| "January fifth, nineteen eighty"                   | Date of birth |
| "nine four one one zero"                           | ZIP code      |

## Coverage matrix

<Note>
  Masker covers **9 of 18** HIPAA Safe Harbor identifier categories fully today. Three are partial. The remaining six are on the May 30 production roadmap. Every compliance report shows the actual coverage at generation time.
</Note>

| Identifier                      | Detector            | Format examples                                                   | Coverage |
| ------------------------------- | ------------------- | ----------------------------------------------------------------- | -------- |
| SSN                             | Regex + spoken-form | `123-45-6789`, "one two three forty-five sixty-seven eighty-nine" | Full     |
| US phone / fax                  | Regex + spoken-form | `(510) 555-1212`, "five ten five five five…"                      | Full     |
| Email                           | RFC 5322 subset     | `name@example.com`                                                | Full     |
| URL                             | URL parser          | `https://example.com/path`                                        | Full     |
| IPv4 / IPv6                     | Regex               | `192.0.2.1`, `2001:db8::1`                                        | Full     |
| Names                           | NER (Gemma-4)       | "Dr. Sarah Chen", "Mr. Johnson"                                   | Full     |
| Addresses                       | Geo + regex         | Street numbers, unit, city/state                                  | Full     |
| Credit card                     | Luhn algorithm      | `4111 1111 1111 1111`                                             | Full     |
| VIN                             | Regex + checksum    | `1HGCM82633A004352`                                               | Full     |
| ZIP code                        | Regex               | `94110`, `94110-1234`                                             | Partial  |
| Dates                           | Dateparser          | `Jan 5, 1980`, "January fifth, nineteen eighty"                   | Partial  |
| Medical record numbers          | Configurable regex  | Per-tenant format                                                 | Partial  |
| Geographic subdivisions         | NER                 | Cities, counties                                                  | Roadmap  |
| Health plan beneficiary numbers | NER                 | Various formats                                                   | Roadmap  |
| Account numbers                 | NER                 | Various formats                                                   | Roadmap  |
| Device identifiers              | Regex               | Serial numbers, MAC addresses                                     | Roadmap  |
| Biometric identifiers           | —                   | —                                                                 | Roadmap  |
| Full-face photographs           | —                   | —                                                                 | Roadmap  |

For the authoritative Safe Harbor mapping, see [HIPAA Safe Harbor](/compliance/hipaa-safe-harbor).

## Detection output

Each detected span becomes an in-memory event before being handed to the tokenizer:

```json theme={null}
{
  "pass": "gemma",
  "kind": "NAME",
  "span": [142, 156],
  "text": "Sarah Chen",
  "confidence": 0.94
}
```

The `text` field is held in memory only long enough to call the tokenizer. It is never written to disk.

## Tuning detection

Each entity kind has three knobs in `mask_policy.yaml`:

```yaml theme={null}
entities:
  PHONE:
    enabled: true
    regex: true
    ner: true
    confidence_threshold: 0.6
    action: tokenize     # tokenize | redact | passthrough
```

| Field                  | Effect                                                   |
| ---------------------- | -------------------------------------------------------- |
| `regex`                | Enable or disable the regex pass for this entity kind    |
| `ner`                  | Enable or disable the NER pass for this entity kind      |
| `confidence_threshold` | Drop NER hits below this score (0.0–1.0)                 |
| `action: tokenize`     | Replace with a reversible token (default)                |
| `action: redact`       | Replace with `[REDACTED:KIND]`, no rehydration possible  |
| `action: passthrough`  | Log the detection but don't replace — useful for testing |

See [Mask policy](/configuration/mask-policy) for the full schema.

## Privacy properties

<CardGroup cols={2}>
  <Card title="PHI never logged" icon="shield-check">
    Only spans, entity kinds, and resulting tokens are written to the audit chain. The original PHI value is never persisted.
  </Card>

  <Card title="In-region inference" icon="lock">
    Detection runs inside your VPC, or in Masker's HIPAA-eligible region for hosted deployments.
  </Card>

  <Card title="No external calls" icon="plug-zap">
    The Gemma-4 model ships inside the Rust container. No data leaves for an external inference API during detection.
  </Card>

  <Card title="Auditable source" icon="code">
    You can audit detection by reading the source. The repository is at [github.com/maskerdev/masker-core](https://github.com/maskerdev/masker-core).
  </Card>
</CardGroup>
