Masker inspects every payload heading to your LLM and marks every span that looks like a regulated identifier before it crosses the compliance firewall. Detection runs in passes — each pass uses a different technique with different latency and recall characteristics, and the output of one pass feeds the next.Documentation Index
Fetch the complete documentation index at: https://docs.masker.dev/llms.txt
Use this file to discover all available pages before exploring further.
Detection passes
Pass 1 — Regex catalogue
The first pass runs a curated set of regular expressions against the full payload. This covers the deterministic PHI and PII shapes — identifiers with a fixed structure that regex can match reliably:
- Phone numbers (E.164, North American, international forms)
- Social Security Numbers (with and without dashes)
- Email addresses (RFC 5322 simplified)
- Medical record numbers (configurable per-tenant pattern)
- Dates (MM/DD/YYYY, ISO 8601, written forms)
- ZIP codes (US 5-digit and ZIP+4)
- IP addresses (v4 and v6)
- URLs containing identifying paths
- Account, license, vehicle, and device identifiers
Pass 2 — Gemma-4 NER
After regex, Masker runs the partially-redacted payload through a fine-tuned Gemma-4 named-entity recognition model. NER catches what regex can’t:
Latency: 30–80 ms on Masker’s default GPU pool. The model runs in-region (US-West today; US-East and EU are on the roadmap).Recall: materially higher than regex on conversational input. Also catches misspelled or transliterated names.
- Names (“Dr. Sarah Chen”, “Mr. Johnson”)
- Geographic subdivisions smaller than a state
- Health plan beneficiary numbers in unusual formats
- Account numbers Masker hasn’t seen a pattern for yet
- Diagnoses, procedures, and clinical phrases that map to identifying entities
- Numbers spoken as words — see spoken-form detection below
i2b2 / n2c2 — de-identified clinical notes
i2b2 / n2c2 — de-identified clinical notes
The gold standard for clinical NER. These are real de-identified clinical notes from major research competitions. Fine-tuning on this corpus gives Masker strong recall on the kind of PHI that appears in medical documentation.
MedDialog — patient-doctor dialogue
MedDialog — patient-doctor dialogue
Conversational transcripts of patient-provider interactions. This corpus is what makes Masker effective on voice calls, where PHI appears in natural speech rather than structured records.
Switchboard — conversational phone speech
Switchboard — conversational phone speech
A large corpus of telephone conversations. Combined with MedDialog, this helps the model handle the informal, fragmented utterances typical of real calls.
The Gemma-4 model ships inside the Rust container. No external API calls are made during detection — the model loads once at boot and stays in memory.
Pass 3 — Diarization (audio path only)
When Masker sits in the audio path via a voice platform webhook, it adds a diarization pass before regex. This pass:
- Separates speakers (caller versus agent)
- Tags each turn with speaker ID and timing
- Lets your policy mask the patient’s voice content while leaving agent prompts untouched
Spoken-form detection
Callers rarely recite structured identifiers. They say:“My Social is one two three forty-five sixty-seven eighty-nine.”
“Call me back at five ten, five five five, one two one two.”The NER pass handles this. The Gemma-4 model was fine-tuned on conversational speech and recognizes digit-word sequences as the identifier kinds they represent — even when the spoken form doesn’t match any regex pattern.
| Spoken form | Detected as |
|---|---|
| ”one two three forty-five sixty-seven eighty-nine” | SSN |
| ”five ten five five five one two one two” | US phone |
| ”January fifth, nineteen eighty” | Date of birth |
| ”nine four one one zero” | ZIP code |
Coverage matrix
Masker covers 9 of 18 HIPAA Safe Harbor identifier categories fully today. Three are partial. The remaining six are on the May 30 production roadmap. Every compliance report shows the actual coverage at generation time.
| Identifier | Detector | Format examples | Coverage |
|---|---|---|---|
| SSN | Regex + spoken-form | 123-45-6789, “one two three forty-five sixty-seven eighty-nine” | Full |
| US phone / fax | Regex + spoken-form | (510) 555-1212, “five ten five five five…” | Full |
| RFC 5322 subset | name@example.com | Full | |
| URL | URL parser | https://example.com/path | Full |
| IPv4 / IPv6 | Regex | 192.0.2.1, 2001:db8::1 | Full |
| Names | NER (Gemma-4) | “Dr. Sarah Chen”, “Mr. Johnson” | Full |
| Addresses | Geo + regex | Street numbers, unit, city/state | Full |
| Credit card | Luhn algorithm | 4111 1111 1111 1111 | Full |
| VIN | Regex + checksum | 1HGCM82633A004352 | Full |
| ZIP code | Regex | 94110, 94110-1234 | Partial |
| Dates | Dateparser | Jan 5, 1980, “January fifth, nineteen eighty” | Partial |
| Medical record numbers | Configurable regex | Per-tenant format | Partial |
| Geographic subdivisions | NER | Cities, counties | Roadmap |
| Health plan beneficiary numbers | NER | Various formats | Roadmap |
| Account numbers | NER | Various formats | Roadmap |
| Device identifiers | Regex | Serial numbers, MAC addresses | Roadmap |
| Biometric identifiers | — | — | Roadmap |
| Full-face photographs | — | — | Roadmap |
Detection output
Each detected span becomes an in-memory event before being handed to the tokenizer:text field is held in memory only long enough to call the tokenizer. It is never written to disk.
Tuning detection
Each entity kind has three knobs inmask_policy.yaml:
| Field | Effect |
|---|---|
regex | Enable or disable the regex pass for this entity kind |
ner | Enable or disable the NER pass for this entity kind |
confidence_threshold | Drop NER hits below this score (0.0–1.0) |
action: tokenize | Replace with a reversible token (default) |
action: redact | Replace with [REDACTED:KIND], no rehydration possible |
action: passthrough | Log the detection but don’t replace — useful for testing |
Privacy properties
PHI never logged
Only spans, entity kinds, and resulting tokens are written to the audit chain. The original PHI value is never persisted.
In-region inference
Detection runs inside your VPC, or in Masker’s HIPAA-eligible region for hosted deployments.
No external calls
The Gemma-4 model ships inside the Rust container. No data leaves for an external inference API during detection.
Auditable source
You can audit detection by reading the source. The repository is at github.com/maskerdev/masker-core.