Speech Gateway

The Speech Gateway handles all speech processing — recognition (ASR) and synthesis (TTS). It runs entirely on your infrastructure, ensuring that audio data never leaves your network. This is the component that makes voicetyped viable for regulated industries, classified environments, and privacy-sensitive deployments.

Responsibilities

Automatic Speech Recognition (ASR) — Real-time transcription using whisper.cpp
Voice Activity Detection (VAD) — Detects when the caller is speaking
Audio Segmentation — Splits continuous audio into utterances for transcription
Partial Transcripts — Streams interim results for responsive UX
Final Transcripts — Delivers complete, punctuated transcripts
Text-to-Speech (TTS) — Renders text responses as audio
Worker Pool — Manages per-call ASR workers for concurrent processing

Configuration

# /etc/voice-gateway/config.yaml — speech section

speech:
  # ASR Engine
  engine: whisper                # whisper (default), faster-whisper
  model: whisper-medium          # Model name (see model table)
  model_dir: /var/lib/voice-gateway/models/
  language: en                   # ISO 639-1 language code

  # GPU Configuration
  gpu: auto                      # auto, true, false
  gpu_device: 0                  # GPU device index
  gpu_layers: -1                 # -1 = all layers on GPU

  # VAD Configuration
  vad:
    enabled: true
    threshold: 0.5               # Speech probability threshold (0.0–1.0)
    min_speech_ms: 250           # Minimum speech duration to trigger
    min_silence_ms: 500          # Silence duration to end utterance
    padding_ms: 200              # Padding added around speech segments

  # Transcription
  partial_results: true          # Stream interim/partial transcripts
  partial_interval_ms: 300       # How often to emit partial results
  beam_size: 5                   # Beam search width (higher = more accurate, slower)
  temperature: 0.0               # Sampling temperature (0 = greedy)

  # Worker Pool
  max_workers: 4                 # Maximum concurrent ASR workers
  worker_timeout: 30s            # Worker idle timeout
  queue_depth: 10                # Maximum queued audio segments

  # TTS Configuration
  tts:
    engine: piper                # piper (default), espeak
    voice: en_US-amy-medium      # Voice model name
    sample_rate: 22050           # Output sample rate
    speed: 1.0                   # Speaking speed multiplier

ASR Engine

whisper.cpp

The default ASR engine is whisper.cpp, a C++ port of OpenAI’s Whisper model. It runs on CPU or GPU and provides excellent accuracy for most languages.

How it works:

Audio arrives as 16kHz mono PCM chunks from the Media Gateway
VAD detects speech segments and buffers them
Complete utterances are sent to whisper.cpp for transcription
Partial results are emitted at configurable intervals during long utterances
Final results include the complete transcript with timing information

Model Selection

Model	Parameters	Size	Speed (CPU)	Speed (GPU)	Quality
whisper-tiny	39M	75 MB	~10x real-time	~32x	Fair
whisper-base	74M	142 MB	~7x real-time	~25x	Good
whisper-small	244M	466 MB	~4x real-time	~15x	Better
whisper-medium	769M	1.5 GB	~2x real-time	~10x	High
whisper-large-v3	1550M	3.1 GB	~0.5x real-time	~5x	Highest

Recommendation: Use whisper-medium for production. It provides the best balance of accuracy and latency. Use whisper-base for development and testing.

Model Management

# List available models
voice-gateway model list

# Download a model
voice-gateway model download whisper-medium

# Check loaded model
voice-gateway model info

# Switch model at runtime (requires restart)
voice-gateway config set speech.model whisper-large-v3

faster-whisper Backend

For GPU-heavy deployments, you can use faster-whisper as an alternative backend. It uses CTranslate2 for optimized inference.

speech:
  engine: faster-whisper
  model: large-v3
  gpu: true
  compute_type: float16          # float16, int8_float16, int8

faster-whisper provides:

~4x speed improvement over whisper.cpp on GPU
Lower memory usage via quantization
Batch inference support

Voice Activity Detection (VAD)

VAD is critical for determining when the caller is speaking and when they have finished. Poor VAD leads to either cut-off speech or long pauses.

How VAD Works

Audio Stream → Energy Detection → Speech Probability → Segmentation
                                                          │
                                               ┌──────────┴──────────┐
                                         Speech Start           Speech End
                                         (> threshold)        (silence > min_silence_ms)

Tuning VAD

Parameter	Effect of Increase	Effect of Decrease
`threshold`	Requires louder speech to trigger	Triggers on quieter speech, more false positives
`min_speech_ms`	Ignores short sounds (clicks, pops)	Captures very short utterances
`min_silence_ms`	Waits longer before ending utterance	Ends utterance faster, may split long pauses
`padding_ms`	More context around speech	Less context, may clip edges

Recommended settings by environment:

# Quiet office environment
vad:
  threshold: 0.4
  min_speech_ms: 200
  min_silence_ms: 400

# Noisy call center
vad:
  threshold: 0.7
  min_speech_ms: 300
  min_silence_ms: 600

# IVR with short commands
vad:
  threshold: 0.5
  min_speech_ms: 150
  min_silence_ms: 300

Partial vs Final Transcripts

Partial Transcripts

Emitted during active speech at partial_interval_ms intervals. These are useful for:

Displaying real-time captions
Triggering early intent detection
Providing visual feedback in admin UIs

{
  "type": "partial",
  "text": "I need help with my",
  "confidence": 0.82,
  "timestamp_ms": 1234
}

Final Transcripts

Emitted after VAD detects end-of-utterance. These are the canonical transcription result:

{
  "type": "final",
  "text": "I need help with my password reset.",
  "confidence": 0.94,
  "language": "en",
  "duration_ms": 2340,
  "segments": [
    {
      "text": "I need help with my password reset.",
      "start_ms": 0,
      "end_ms": 2340,
      "confidence": 0.94
    }
  ]
}

Text-to-Speech (TTS)

Piper TTS

The default TTS engine is Piper, a fast, local text-to-speech system.

# Download a voice model
voice-gateway tts download en_US-amy-medium

# List available voices
voice-gateway tts voices

# Test TTS output
voice-gateway tts speak "Hello, this is a test."

TTS Configuration

speech:
  tts:
    engine: piper
    voice: en_US-amy-medium
    sample_rate: 22050
    speed: 1.0                   # 0.5 = half speed, 2.0 = double speed
    sentence_silence: 0.3        # Silence between sentences (seconds)

Streaming TTS

TTS audio is streamed back to the caller as it is generated, reducing perceived latency:

Text is split into sentences
Each sentence is synthesized independently
Audio chunks are streamed to the Media Gateway in real-time
The caller hears the first sentence while later sentences are still being generated

Worker Pool

The Speech Gateway uses a worker pool to process concurrent calls:

┌──────────────┐
│   Call 1      │ → Worker 1 (ASR) → Transcript
│   Call 2      │ → Worker 2 (ASR) → Transcript
│   Call 3      │ → Worker 3 (ASR) → Transcript
│   Call 4      │ → [Queued]
└──────────────┘

Sizing the Worker Pool

GPU	Model	Recommended Workers	Max Concurrent Calls
None (CPU only)	whisper-base	2	5-8
None (CPU only)	whisper-medium	1	2-3
NVIDIA T4	whisper-medium	4	15-20
NVIDIA A100	whisper-large-v3	8	40-60

Metrics

Metric	Type	Description
`vg_speech_asr_latency_seconds`	Histogram	Time from audio to transcript
`vg_speech_transcriptions_total`	Counter	Total transcriptions completed
`vg_speech_active_workers`	Gauge	Currently active ASR workers
`vg_speech_queue_depth`	Gauge	Audio segments waiting for processing
`vg_speech_tts_latency_seconds`	Histogram	TTS generation latency
`vg_speech_vad_false_positives`	Counter	VAD false positive triggers
`vg_speech_gpu_utilization`	Gauge	GPU utilization percentage

Troubleshooting

High ASR latency

Check GPU utilization — switch to a GPU if on CPU
Use a smaller model (whisper-base for testing)
Reduce beam size (beam_size: 3)
Increase worker pool size

Transcripts are cut off

Increase min_silence_ms to wait longer before ending utterance
Increase padding_ms to capture more audio context
Check that VAD threshold isn’t too high

Poor transcript quality

Use a larger model (whisper-medium or whisper-large-v3)
Set temperature: 0.0 for deterministic output
Set language explicitly rather than auto-detecting
Check audio quality — packet loss degrades ASR accuracy

Next Steps

Conversation Runtime — build dialog flows on transcripts
Media Gateway — configure the audio source
Observability — monitor ASR performance