> ## Documentation Index
> Fetch the complete documentation index at: https://docs.masker.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# POST /proxy/{id}/v1/chat/completions — Masker LLM proxy

> OpenAI-compatible proxy that redacts PHI from incoming messages, forwards to the upstream LLM, and rehydrates tokens in the response. Supports streaming.

This is the integration endpoint — the URL you paste into your voice platform's "custom LLM" field. It speaks the OpenAI [chat completions](https://platform.openai.com/docs/api-reference/chat) API exactly, so any client that can talk to OpenAI can talk to Masker without code changes. Your voice platform calls this endpoint; you do not call it directly from your application code.

When a request arrives, Masker redacts PHI from all message content before forwarding the sanitized payload to the upstream LLM. The LLM response is then scanned for replacement tokens, which are rehydrated back to their original values before the response is returned to the caller. The caller sees a clean response; the upstream LLM never sees raw PHI.

<Note>
  This endpoint sits outside the `/api/v1` namespace because it's called by external systems that follow the OpenAI URL convention.
</Note>

## Endpoint

```
POST /proxy/{agent_id}/v1/chat/completions
```

## Path parameters

<ParamField path="agent_id" type="string" required>
  Masker agent ID in `agt_*` ULID format. Treat this like an API key — keep the proxy URL confidential.
</ParamField>

## Authentication

This endpoint is **not** authenticated by `masker_session` cookie. Voice platforms calling it do not have a session. Authentication relies on two mechanisms:

* The `agent_id` in the URL acts as a shared secret. Do not expose the proxy URL publicly.
* When configured, HMAC signature verification validates the `X-Vapi-Signature` header against `MASKER_VAPI_WEBHOOK_SECRET`.

For high-security deployments, run Masker inside your VPC and add mTLS or IP allowlisting in front of the proxy.

## Request body

The request body follows the standard OpenAI chat completions schema. Masker accepts every field OpenAI accepts and passes through unrecognized fields.

<ParamField body="model" type="string" required>
  The model to use. Must be compatible with the agent's configured `upstream`. If the request specifies a model the agent is not allowed to use, Masker returns `422 model_not_allowed`.
</ParamField>

<ParamField body="messages" type="object[]" required>
  Array of message objects (`role` + `content`). PHI is redacted from all `content` fields before forwarding.
</ParamField>

<ParamField body="stream" type="boolean" default="false">
  If `true`, the response is streamed as Server-Sent Events (`text/event-stream`). Streaming is fully supported — response chunks are scanned for tokens and rehydrated inline.
</ParamField>

<ParamField body="temperature" type="number">
  Sampling temperature, passed through to the upstream LLM unchanged.
</ParamField>

<ParamField body="max_tokens" type="number">
  Maximum tokens in the response, passed through unchanged.
</ParamField>

<ParamField body="tools" type="object[]">
  Tool definitions. Tool descriptions and function names that contain PHI are also redacted.
</ParamField>

<ParamField body="tool_choice" type="string | object">
  Tool selection mode, passed through unchanged.
</ParamField>

## Processing pipeline

1. Receive the request body.
2. Detect and redact PHI in `messages[*].content`, tool descriptions, and function names.
3. Forward the sanitized body to the upstream LLM provider.
4. Buffer or stream the response from the upstream LLM.
5. Scan the response for Masker replacement tokens and rehydrate them to original values.
6. Return the rehydrated response to the caller.

## Response

The response is identical in shape to an OpenAI chat completions response. PHI tokens in the LLM output are rehydrated before the response reaches the caller. For streaming requests, the response uses `text/event-stream` with standard OpenAI SSE chunks.

## Latency

Masker adds approximately **45–95 ms** of end-to-end latency on top of the upstream LLM's response time.

## Rate limit

100 requests/second sustained, burst 200. Rate-limited requests receive `429` with a `Retry-After` header.

## Configuration in Vapi

Set the proxy URL as your Vapi assistant's **Custom LLM URL**:

```
https://masker-voice.fly.dev/proxy/agt_01HYZ.../v1/chat/completions
```

Set the model field to match the agent's configured upstream (e.g. `gpt-4o-mini`). No other code changes are required.

## Example

<CodeGroup>
  ```bash curl theme={null}
  curl -X POST \
    -H "Content-Type: application/json" \
    https://masker-voice.fly.dev/proxy/agt_01HYZ.../v1/chat/completions \
    -d '{
      "model": "gpt-4o-mini",
      "messages": [
        {"role": "system", "content": "You are a helpful healthcare assistant."},
        {"role": "user", "content": "Hi, this is Sarah Chen, my number is 415-555-2671."}
      ],
      "stream": false,
      "temperature": 0.4,
      "max_tokens": 512
    }'
  ```
</CodeGroup>

## Errors

Errors are returned in OpenAI-compatible shape so existing clients handle them naturally.

| Status | Code                | Meaning                                                              |
| ------ | ------------------- | -------------------------------------------------------------------- |
| `401`  | `bad_signature`     | HMAC signature verification failed                                   |
| `404`  | `agent_not_found`   | The `agent_id` in the URL does not match any active agent            |
| `422`  | `model_not_allowed` | The requested `model` does not match the agent's configured upstream |
| `429`  | `rate_limited`      | Account or agent quota exceeded; respect `Retry-After`               |
| `502`  | `upstream_error`    | The upstream LLM returned an error; the error is passed through      |
| `504`  | `upstream_timeout`  | The upstream LLM did not respond within the configured timeout       |
