Media Gateway
SIP termination, RTP audio handling, and codec management for voicetyped.
The Media Gateway is the telephony boundary of voicetyped. It handles SIP signaling, RTP audio streams, and codec transcoding so that downstream services receive clean, normalized audio. If this component drops calls, the entire system fails — so telephony reliability is the top priority.
Responsibilities
- SIP endpoint — Receives and processes SIP signaling (INVITE, BYE, CANCEL, re-INVITE)
- RTP audio handling — Receives, buffers, and transmits RTP audio packets
- Codec transcoding — Negotiates and converts between G.711 μ-law, G.711 A-law, and Opus
- Call lifecycle — Manages the full call state: ringing, connected, on-hold, transferred, terminated
- DTMF detection — Supports both RFC 2833 (out-of-band) and in-band DTMF
- Jitter buffer — Smooths out network-induced audio packet timing variations
- Packet loss handling — Implements PLC (Packet Loss Concealment) for degraded networks
Configuration
# /etc/voice-gateway/config.yaml — media section
media:
# SIP Configuration
sip_port: 5060 # UDP port for SIP signaling
sip_transport: udp # udp, tcp, or tls
sip_tls_cert: "" # Path to TLS cert (for SIP-TLS)
sip_tls_key: "" # Path to TLS key
# RTP Configuration
rtp_port_range: "10000-20000" # Port range for RTP media
rtp_symmetric: true # Use symmetric RTP (NAT traversal)
# Codec Configuration
codecs:
- g711-ulaw # G.711 μ-law (default for North America)
- g711-alaw # G.711 A-law (default for Europe)
- opus # Opus (WebRTC, high quality)
# Audio Output
output_format: pcm16 # Output format to Speech Gateway
output_sample_rate: 16000 # 16kHz mono PCM
output_channels: 1 # Mono
# Reliability
jitter_buffer_ms: 60 # Jitter buffer size in milliseconds
packet_loss_concealment: true # Enable PLC
reconnection_timeout: 30s # Time to wait before declaring call dead
keepalive_interval: 30s # SIP keepalive interval
SIP Signaling
Supported SIP Methods
| Method | Direction | Description |
|---|---|---|
| INVITE | Inbound | New call setup |
| ACK | Both | Acknowledge call setup |
| BYE | Both | Terminate call |
| CANCEL | Inbound | Cancel pending call |
| re-INVITE | Both | Modify active call (hold/resume, codec change) |
| OPTIONS | Both | Keepalive / capability query |
| REFER | Outbound | Call transfer |
SIP Headers
voicetyped extracts and exposes these SIP headers to downstream services:
From: "John Doe" <sip:+15551234567@carrier.example.com>
To: <sip:greeting@voicegateway.local:5060>
Call-ID: a84b4c76e66710@192.168.1.100
X-Custom-Header: custom-value
Custom headers prefixed with X- are passed through to the CallSession and available in dialog flows.
SIP Registration
For deployments behind a SIP trunk or PBX, the Media Gateway can register as a SIP client:
media:
registrations:
- uri: sip:voicegateway@pbx.internal
username: voicegateway
password: "${SIP_PASSWORD}" # Environment variable
registrar: sip:pbx.internal:5060
expiry: 3600
RTP Audio Handling
Audio Pipeline
RTP packets → Jitter Buffer → Codec Decode → Resampler → PCM 16kHz mono
- Jitter Buffer absorbs network timing variations (configurable, default 60ms)
- Codec Decode converts from the negotiated codec to raw PCM
- Resampler converts to 16kHz mono PCM (required by whisper.cpp)
Codec Support
| Codec | Bandwidth | Quality | Use Case |
|---|---|---|---|
| G.711 μ-law | 64 kbps | Good | PSTN, North American carriers |
| G.711 A-law | 64 kbps | Good | PSTN, European carriers |
| Opus | 6-510 kbps | Excellent | WebRTC, modern VoIP |
Codec negotiation happens during SIP INVITE/200 OK via SDP (Session Description Protocol). The Media Gateway prefers codecs in the order listed in configuration.
DTMF Detection
Two DTMF methods are supported:
- RFC 2833 (recommended) — DTMF digits sent as named telephone events in RTP
- In-band — DTMF tones detected from the audio stream using Goertzel algorithm
media:
dtmf:
method: rfc2833 # rfc2833, inband, or both
inband_sensitivity: 0.8 # Sensitivity for in-band detection (0.0–1.0)
Call Lifecycle
State Machine
INVITE
│
▼
RINGING ──── CANCEL ──→ TERMINATED
│
▼ (200 OK + ACK)
CONNECTED
/ │ \
HOLD TRANSFER BYE
│ │ │
▼ ▼ ▼
RESUMED ... TERMINATED
Call Events
The Media Gateway emits these events to downstream services:
enum CallEventType {
CALL_STARTED = 0;
CALL_RINGING = 1;
CALL_CONNECTED = 2;
CALL_ON_HOLD = 3;
CALL_RESUMED = 4;
CALL_TRANSFERRED = 5;
CALL_TERMINATED = 6;
DTMF_RECEIVED = 7;
AUDIO_STARTED = 8;
AUDIO_STOPPED = 9;
}
Reliability
Jitter Buffer
The adaptive jitter buffer smooths network-induced timing variations:
media:
jitter_buffer_ms: 60 # Initial buffer size
jitter_buffer_max_ms: 200 # Maximum buffer size
jitter_buffer_adaptive: true # Auto-adjust based on network conditions
Packet Loss Concealment
When packets are lost, PLC fills the gap using the previous audio frame:
- Loss < 5%: PLC maintains acceptable quality
- Loss 5–15%: Noticeable degradation, PLC helps significantly
- Loss > 15%: Consider network remediation
Reconnection
If the RTP stream stops (network issue), the Media Gateway waits before terminating:
media:
reconnection_timeout: 30s # Wait this long before declaring the call dead
rtp_timeout: 10s # Time with no RTP before triggering reconnection
Metrics
The Media Gateway exposes these Prometheus metrics:
| Metric | Type | Description |
|---|---|---|
vg_media_active_calls | Gauge | Currently active calls |
vg_media_total_calls | Counter | Total calls since startup |
vg_media_call_duration_seconds | Histogram | Call duration distribution |
vg_media_rtp_packets_received | Counter | RTP packets received |
vg_media_rtp_packets_lost | Counter | RTP packets lost |
vg_media_rtp_jitter_ms | Histogram | RTP jitter distribution |
vg_media_codec_negotiation_errors | Counter | Failed codec negotiations |
vg_media_dtmf_events | Counter | DTMF events detected |
Troubleshooting
No audio after call connects
- Check that the RTP port range is open in your firewall
- Verify
rtp_symmetric: trueis set if behind NAT - Check codec negotiation — ensure at least one common codec between caller and gateway
Calls drop after 30 seconds
This usually indicates a SIP NAT traversal issue. Ensure:
- SIP keepalive is enabled (
keepalive_interval: 30s) - The SIP proxy or SBC is configured to relay RTP
High jitter or packet loss
- Increase jitter buffer size (
jitter_buffer_ms: 100) - Enable adaptive jitter buffer
- Check network path for congestion
- Consider dedicated network paths for RTP traffic
Next Steps
- Speech Gateway — configure ASR on the PCM output
- Getting Started — full setup walkthrough
- Observability — monitor Media Gateway metrics