Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.masker.dev/llms.txt

Use this file to discover all available pages before exploring further.

Masker inspects every payload heading to your LLM and marks every span that looks like a regulated identifier before it crosses the compliance firewall. Detection runs in passes — each pass uses a different technique with different latency and recall characteristics, and the output of one pass feeds the next.

Detection passes

1

Pass 1 — Regex catalogue

The first pass runs a curated set of regular expressions against the full payload. This covers the deterministic PHI and PII shapes — identifiers with a fixed structure that regex can match reliably:
  • Phone numbers (E.164, North American, international forms)
  • Social Security Numbers (with and without dashes)
  • Email addresses (RFC 5322 simplified)
  • Medical record numbers (configurable per-tenant pattern)
  • Dates (MM/DD/YYYY, ISO 8601, written forms)
  • ZIP codes (US 5-digit and ZIP+4)
  • IP addresses (v4 and v6)
  • URLs containing identifying paths
  • Account, license, vehicle, and device identifiers
Latency: sub-millisecond per request. This pass runs on every payload without exception.Recall: very high for the shapes it knows. It misses identifiers spoken as prose — for example, “my number is five-five-five-one-two-three-four-five-six-seven” won’t match a phone regex.
2

Pass 2 — Gemma-4 NER

After regex, Masker runs the partially-redacted payload through a fine-tuned Gemma-4 named-entity recognition model. NER catches what regex can’t:
  • Names (“Dr. Sarah Chen”, “Mr. Johnson”)
  • Geographic subdivisions smaller than a state
  • Health plan beneficiary numbers in unusual formats
  • Account numbers Masker hasn’t seen a pattern for yet
  • Diagnoses, procedures, and clinical phrases that map to identifying entities
  • Numbers spoken as words — see spoken-form detection below
The model was fine-tuned on three corpora:
The gold standard for clinical NER. These are real de-identified clinical notes from major research competitions. Fine-tuning on this corpus gives Masker strong recall on the kind of PHI that appears in medical documentation.
Conversational transcripts of patient-provider interactions. This corpus is what makes Masker effective on voice calls, where PHI appears in natural speech rather than structured records.
A large corpus of telephone conversations. Combined with MedDialog, this helps the model handle the informal, fragmented utterances typical of real calls.
Latency: 30–80 ms on Masker’s default GPU pool. The model runs in-region (US-West today; US-East and EU are on the roadmap).Recall: materially higher than regex on conversational input. Also catches misspelled or transliterated names.
The Gemma-4 model ships inside the Rust container. No external API calls are made during detection — the model loads once at boot and stays in memory.
3

Pass 3 — Diarization (audio path only)

When Masker sits in the audio path via a voice platform webhook, it adds a diarization pass before regex. This pass:
  • Separates speakers (caller versus agent)
  • Tags each turn with speaker ID and timing
  • Lets your policy mask the patient’s voice content while leaving agent prompts untouched
This is what makes compliance reports actionable: “Across 1,247 calls, the patient spoke 8,930 turns. We redacted PHI in 4,118 of them.”

Spoken-form detection

Callers rarely recite structured identifiers. They say:
“My Social is one two three forty-five sixty-seven eighty-nine.”
“Call me back at five ten, five five five, one two one two.”
The NER pass handles this. The Gemma-4 model was fine-tuned on conversational speech and recognizes digit-word sequences as the identifier kinds they represent — even when the spoken form doesn’t match any regex pattern.
Spoken formDetected as
”one two three forty-five sixty-seven eighty-nine”SSN
”five ten five five five one two one two”US phone
”January fifth, nineteen eighty”Date of birth
”nine four one one zero”ZIP code

Coverage matrix

Masker covers 9 of 18 HIPAA Safe Harbor identifier categories fully today. Three are partial. The remaining six are on the May 30 production roadmap. Every compliance report shows the actual coverage at generation time.
IdentifierDetectorFormat examplesCoverage
SSNRegex + spoken-form123-45-6789, “one two three forty-five sixty-seven eighty-nine”Full
US phone / faxRegex + spoken-form(510) 555-1212, “five ten five five five…”Full
EmailRFC 5322 subsetname@example.comFull
URLURL parserhttps://example.com/pathFull
IPv4 / IPv6Regex192.0.2.1, 2001:db8::1Full
NamesNER (Gemma-4)“Dr. Sarah Chen”, “Mr. Johnson”Full
AddressesGeo + regexStreet numbers, unit, city/stateFull
Credit cardLuhn algorithm4111 1111 1111 1111Full
VINRegex + checksum1HGCM82633A004352Full
ZIP codeRegex94110, 94110-1234Partial
DatesDateparserJan 5, 1980, “January fifth, nineteen eighty”Partial
Medical record numbersConfigurable regexPer-tenant formatPartial
Geographic subdivisionsNERCities, countiesRoadmap
Health plan beneficiary numbersNERVarious formatsRoadmap
Account numbersNERVarious formatsRoadmap
Device identifiersRegexSerial numbers, MAC addressesRoadmap
Biometric identifiersRoadmap
Full-face photographsRoadmap
For the authoritative Safe Harbor mapping, see HIPAA Safe Harbor.

Detection output

Each detected span becomes an in-memory event before being handed to the tokenizer:
{
  "pass": "gemma",
  "kind": "NAME",
  "span": [142, 156],
  "text": "Sarah Chen",
  "confidence": 0.94
}
The text field is held in memory only long enough to call the tokenizer. It is never written to disk.

Tuning detection

Each entity kind has three knobs in mask_policy.yaml:
entities:
  PHONE:
    enabled: true
    regex: true
    ner: true
    confidence_threshold: 0.6
    action: tokenize     # tokenize | redact | passthrough
FieldEffect
regexEnable or disable the regex pass for this entity kind
nerEnable or disable the NER pass for this entity kind
confidence_thresholdDrop NER hits below this score (0.0–1.0)
action: tokenizeReplace with a reversible token (default)
action: redactReplace with [REDACTED:KIND], no rehydration possible
action: passthroughLog the detection but don’t replace — useful for testing
See Mask policy for the full schema.

Privacy properties

PHI never logged

Only spans, entity kinds, and resulting tokens are written to the audit chain. The original PHI value is never persisted.

In-region inference

Detection runs inside your VPC, or in Masker’s HIPAA-eligible region for hosted deployments.

No external calls

The Gemma-4 model ships inside the Rust container. No data leaves for an external inference API during detection.

Auditable source

You can audit detection by reading the source. The repository is at github.com/maskerdev/masker-core.