Speech Gateway

Local ASR and TTS engine powered by whisper.cpp — no cloud, no data leakage.

The Speech Gateway handles all speech processing — recognition (ASR) and synthesis (TTS). It runs entirely on your infrastructure, ensuring that audio data never leaves your network. This is the component that makes voicetyped viable for regulated industries, classified environments, and privacy-sensitive deployments.

Responsibilities

  • Automatic Speech Recognition (ASR) — Real-time transcription using whisper.cpp
  • Voice Activity Detection (VAD) — Detects when the caller is speaking
  • Audio Segmentation — Splits continuous audio into utterances for transcription
  • Partial Transcripts — Streams interim results for responsive UX
  • Final Transcripts — Delivers complete, punctuated transcripts
  • Text-to-Speech (TTS) — Renders text responses as audio
  • Worker Pool — Manages per-call ASR workers for concurrent processing

Configuration

# /etc/voice-gateway/config.yaml — speech section

speech:
  # ASR Engine
  engine: whisper                # whisper (default), faster-whisper
  model: whisper-medium          # Model name (see model table)
  model_dir: /var/lib/voice-gateway/models/
  language: en                   # ISO 639-1 language code

  # GPU Configuration
  gpu: auto                      # auto, true, false
  gpu_device: 0                  # GPU device index
  gpu_layers: -1                 # -1 = all layers on GPU

  # VAD Configuration
  vad:
    enabled: true
    threshold: 0.5               # Speech probability threshold (0.0–1.0)
    min_speech_ms: 250           # Minimum speech duration to trigger
    min_silence_ms: 500          # Silence duration to end utterance
    padding_ms: 200              # Padding added around speech segments

  # Transcription
  partial_results: true          # Stream interim/partial transcripts
  partial_interval_ms: 300       # How often to emit partial results
  beam_size: 5                   # Beam search width (higher = more accurate, slower)
  temperature: 0.0               # Sampling temperature (0 = greedy)

  # Worker Pool
  max_workers: 4                 # Maximum concurrent ASR workers
  worker_timeout: 30s            # Worker idle timeout
  queue_depth: 10                # Maximum queued audio segments

  # TTS Configuration
  tts:
    engine: piper                # piper (default), espeak
    voice: en_US-amy-medium      # Voice model name
    sample_rate: 22050           # Output sample rate
    speed: 1.0                   # Speaking speed multiplier

ASR Engine

whisper.cpp

The default ASR engine is whisper.cpp, a C++ port of OpenAI’s Whisper model. It runs on CPU or GPU and provides excellent accuracy for most languages.

How it works:

  1. Audio arrives as 16kHz mono PCM chunks from the Media Gateway
  2. VAD detects speech segments and buffers them
  3. Complete utterances are sent to whisper.cpp for transcription
  4. Partial results are emitted at configurable intervals during long utterances
  5. Final results include the complete transcript with timing information

Model Selection

ModelParametersSizeSpeed (CPU)Speed (GPU)Quality
whisper-tiny39M75 MB~10x real-time~32xFair
whisper-base74M142 MB~7x real-time~25xGood
whisper-small244M466 MB~4x real-time~15xBetter
whisper-medium769M1.5 GB~2x real-time~10xHigh
whisper-large-v31550M3.1 GB~0.5x real-time~5xHighest

Recommendation: Use whisper-medium for production. It provides the best balance of accuracy and latency. Use whisper-base for development and testing.

Model Management

# List available models
voice-gateway model list

# Download a model
voice-gateway model download whisper-medium

# Check loaded model
voice-gateway model info

# Switch model at runtime (requires restart)
voice-gateway config set speech.model whisper-large-v3

faster-whisper Backend

For GPU-heavy deployments, you can use faster-whisper as an alternative backend. It uses CTranslate2 for optimized inference.

speech:
  engine: faster-whisper
  model: large-v3
  gpu: true
  compute_type: float16          # float16, int8_float16, int8

faster-whisper provides:

  • ~4x speed improvement over whisper.cpp on GPU
  • Lower memory usage via quantization
  • Batch inference support

Voice Activity Detection (VAD)

VAD is critical for determining when the caller is speaking and when they have finished. Poor VAD leads to either cut-off speech or long pauses.

How VAD Works

Audio Stream → Energy Detection → Speech Probability → Segmentation
                                                          │
                                               ┌──────────┴──────────┐
                                         Speech Start           Speech End
                                         (> threshold)        (silence > min_silence_ms)

Tuning VAD

ParameterEffect of IncreaseEffect of Decrease
thresholdRequires louder speech to triggerTriggers on quieter speech, more false positives
min_speech_msIgnores short sounds (clicks, pops)Captures very short utterances
min_silence_msWaits longer before ending utteranceEnds utterance faster, may split long pauses
padding_msMore context around speechLess context, may clip edges

Recommended settings by environment:

# Quiet office environment
vad:
  threshold: 0.4
  min_speech_ms: 200
  min_silence_ms: 400

# Noisy call center
vad:
  threshold: 0.7
  min_speech_ms: 300
  min_silence_ms: 600

# IVR with short commands
vad:
  threshold: 0.5
  min_speech_ms: 150
  min_silence_ms: 300

Partial vs Final Transcripts

Partial Transcripts

Emitted during active speech at partial_interval_ms intervals. These are useful for:

  • Displaying real-time captions
  • Triggering early intent detection
  • Providing visual feedback in admin UIs
{
  "type": "partial",
  "text": "I need help with my",
  "confidence": 0.82,
  "timestamp_ms": 1234
}

Final Transcripts

Emitted after VAD detects end-of-utterance. These are the canonical transcription result:

{
  "type": "final",
  "text": "I need help with my password reset.",
  "confidence": 0.94,
  "language": "en",
  "duration_ms": 2340,
  "segments": [
    {
      "text": "I need help with my password reset.",
      "start_ms": 0,
      "end_ms": 2340,
      "confidence": 0.94
    }
  ]
}

Text-to-Speech (TTS)

Piper TTS

The default TTS engine is Piper, a fast, local text-to-speech system.

# Download a voice model
voice-gateway tts download en_US-amy-medium

# List available voices
voice-gateway tts voices

# Test TTS output
voice-gateway tts speak "Hello, this is a test."

TTS Configuration

speech:
  tts:
    engine: piper
    voice: en_US-amy-medium
    sample_rate: 22050
    speed: 1.0                   # 0.5 = half speed, 2.0 = double speed
    sentence_silence: 0.3        # Silence between sentences (seconds)

Streaming TTS

TTS audio is streamed back to the caller as it is generated, reducing perceived latency:

  1. Text is split into sentences
  2. Each sentence is synthesized independently
  3. Audio chunks are streamed to the Media Gateway in real-time
  4. The caller hears the first sentence while later sentences are still being generated

Worker Pool

The Speech Gateway uses a worker pool to process concurrent calls:

┌──────────────┐
│   Call 1      │ → Worker 1 (ASR) → Transcript
│   Call 2      │ → Worker 2 (ASR) → Transcript
│   Call 3      │ → Worker 3 (ASR) → Transcript
│   Call 4      │ → [Queued]
└──────────────┘

Sizing the Worker Pool

GPUModelRecommended WorkersMax Concurrent Calls
None (CPU only)whisper-base25-8
None (CPU only)whisper-medium12-3
NVIDIA T4whisper-medium415-20
NVIDIA A100whisper-large-v3840-60

Metrics

MetricTypeDescription
vg_speech_asr_latency_secondsHistogramTime from audio to transcript
vg_speech_transcriptions_totalCounterTotal transcriptions completed
vg_speech_active_workersGaugeCurrently active ASR workers
vg_speech_queue_depthGaugeAudio segments waiting for processing
vg_speech_tts_latency_secondsHistogramTTS generation latency
vg_speech_vad_false_positivesCounterVAD false positive triggers
vg_speech_gpu_utilizationGaugeGPU utilization percentage

Troubleshooting

High ASR latency

  1. Check GPU utilization — switch to a GPU if on CPU
  2. Use a smaller model (whisper-base for testing)
  3. Reduce beam size (beam_size: 3)
  4. Increase worker pool size

Transcripts are cut off

  1. Increase min_silence_ms to wait longer before ending utterance
  2. Increase padding_ms to capture more audio context
  3. Check that VAD threshold isn’t too high

Poor transcript quality

  1. Use a larger model (whisper-medium or whisper-large-v3)
  2. Set temperature: 0.0 for deterministic output
  3. Set language explicitly rather than auto-detecting
  4. Check audio quality — packet loss degrades ASR accuracy

Next Steps