Observability

Prometheus metrics, OpenTelemetry tracing, and structured logging for voicetyped.

voicetyped provides comprehensive observability through three pillars: metrics (Prometheus), traces (OpenTelemetry), and structured logs. Every service emits detailed telemetry that enables you to monitor call quality, debug issues, and track system health.

Metrics (Prometheus)

All services expose Prometheus-compatible metrics on the configured metrics port (default :9100).

Configuration

observability:
  metrics:
    port: 9100
    path: /metrics
    enabled: true

Key Metrics

Call Metrics

MetricTypeLabelsDescription
vg_calls_activeGaugedialogCurrently active calls
vg_calls_totalCounterdialog, statusTotal calls
vg_call_duration_secondsHistogramdialogCall duration distribution
vg_call_state_transitions_totalCounterdialog, from, toState transitions

ASR Metrics

MetricTypeLabelsDescription
vg_asr_latency_secondsHistogrammodelTime from audio to transcript
vg_asr_transcriptions_totalCountermodel, typeTranscriptions (partial/final)
vg_asr_workers_activeGaugeActive ASR workers
vg_asr_queue_depthGaugeQueued audio segments
vg_asr_confidenceHistogrammodelConfidence score distribution
vg_asr_gpu_utilizationGaugedeviceGPU utilization %

Media Metrics

MetricTypeLabelsDescription
vg_rtp_packets_received_totalCounterRTP packets received
vg_rtp_packets_lost_totalCounterRTP packets lost
vg_rtp_jitter_msHistogramRTP jitter
vg_sip_requests_totalCountermethod, statusSIP requests
vg_audio_buffer_underruns_totalCounterAudio buffer underruns

Integration Metrics

MetricTypeLabelsDescription
vg_integration_requests_totalCounterservice, method, statusBackend requests
vg_integration_latency_secondsHistogramservice, methodBackend latency
vg_integration_retries_totalCounterserviceRetry attempts
vg_integration_circuit_breakerGaugeserviceCircuit breaker state

System Metrics

MetricTypeDescription
vg_uptime_secondsGaugeTime since startup
vg_goroutinesGaugeActive goroutines
vg_memory_alloc_bytesGaugeMemory allocated

Prometheus Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'voice-gateway'
    scrape_interval: 15s
    static_configs:
      - targets: ['voice-gateway:9100']

    # For Kubernetes with ServiceMonitor
    # (handled automatically by Helm chart)

Useful PromQL Queries

# Active calls
vg_calls_active

# Call rate (calls per minute)
rate(vg_calls_total[5m]) * 60

# Average call duration
histogram_quantile(0.5, rate(vg_call_duration_seconds_bucket[5m]))

# P99 ASR latency
histogram_quantile(0.99, rate(vg_asr_latency_seconds_bucket[5m]))

# RTP packet loss rate
rate(vg_rtp_packets_lost_total[5m]) /
  rate(vg_rtp_packets_received_total[5m]) * 100

# Integration error rate
rate(vg_integration_requests_total{status!="OK"}[5m]) /
  rate(vg_integration_requests_total[5m]) * 100

# ASR queue depth (backpressure indicator)
vg_asr_queue_depth > 5

Alerting Rules

# prometheus-rules.yml
groups:
  - name: voice-gateway
    rules:
      - alert: HighASRLatency
        expr: histogram_quantile(0.99, rate(vg_asr_latency_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "ASR latency P99 exceeds 2 seconds"

      - alert: HighPacketLoss
        expr: |
          rate(vg_rtp_packets_lost_total[5m]) /
          rate(vg_rtp_packets_received_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "RTP packet loss exceeds 5%"

      - alert: ASRQueueBacklog
        expr: vg_asr_queue_depth > 10
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "ASR queue depth exceeds 10 segments"

      - alert: IntegrationCircuitOpen
        expr: vg_integration_circuit_breaker == 1
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Circuit breaker is OPEN for {{ $labels.service }}"

      - alert: HighCallVolume
        expr: vg_calls_active > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Active calls exceeding 80% capacity"

Tracing (OpenTelemetry)

voicetyped supports distributed tracing via OpenTelemetry (OTLP).

Configuration

observability:
  tracing:
    enabled: true
    otlp_endpoint: "otel-collector:4317"
    otlp_protocol: grpc             # grpc or http
    sample_rate: 1.0                 # 1.0 = 100%, 0.1 = 10%
    service_name: voice-gateway
    resource_attributes:
      deployment.environment: production
      service.version: "1.0.0"

Trace Spans

Each call generates a trace with spans for each processing stage:

Trace: call-abc-123
├── media.sip_invite (2ms)
├── media.rtp_setup (15ms)
├── speech.vad_detect (50ms)
├── speech.asr_transcribe (340ms)
│   ├── speech.whisper_inference (280ms)
│   └── speech.post_process (60ms)
├── runtime.state_transition (1ms)
│   └── runtime.evaluate_conditions (0.5ms)
├── integration.call_hook (234ms)
│   ├── integration.serialize (1ms)
│   ├── integration.grpc_call (230ms)
│   └── integration.deserialize (3ms)
├── speech.tts_synthesize (120ms)
└── media.rtp_playback (2100ms)

Span Attributes

Each span includes relevant attributes:

speech.asr_transcribe:
  asr.model: whisper-medium
  asr.language: en
  asr.confidence: 0.94
  asr.duration_ms: 2340
  asr.is_final: true

integration.call_hook:
  rpc.service: ticketing
  rpc.method: CreateTicket
  rpc.status_code: OK
  retry.count: 0

Structured Logging

Configuration

observability:
  logging:
    level: info                      # debug, info, warn, error
    format: json                     # json, text
    output: stdout                   # stdout, file
    file_path: /var/log/voice-gateway/vg.log
    include_caller: true             # Include source file:line

Log Format

{
  "timestamp": "2024-01-15T10:30:45.123Z",
  "level": "info",
  "message": "call started",
  "session_id": "abc-123-def",
  "caller_id": "+15551234567",
  "dialog": "helpdesk",
  "component": "media-gateway",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736"
}

Log Levels

LevelUse
debugDetailed debugging (audio processing, VAD decisions)
infoNormal operations (call start/end, state transitions)
warnPotential issues (high latency, retry attempts)
errorFailures (hook errors, codec failures, connection drops)

Grafana Dashboards

voicetyped provides pre-built Grafana dashboards:

Call Overview Dashboard

Displays:

  • Active calls (real-time)
  • Call volume over time
  • Average call duration
  • Call completion rate
  • Top dialogs by volume

ASR Performance Dashboard

Displays:

  • ASR latency (P50, P95, P99)
  • Transcription throughput
  • Model accuracy distribution
  • GPU utilization
  • Worker pool usage and queue depth

Media Quality Dashboard

Displays:

  • RTP packet loss rate
  • Jitter distribution
  • Audio buffer underruns
  • SIP error rates
  • Codec distribution

Integration Health Dashboard

Displays:

  • Backend request rate
  • Error rate by service
  • Latency by service
  • Circuit breaker states
  • Retry rates

Health Endpoints

# Liveness probe (is the process running?)
curl http://localhost:9100/healthz
# Returns 200 if alive

# Readiness probe (is the service ready to accept calls?)
curl http://localhost:9100/readyz
# Returns 200 if ready, 503 if not

# Detailed health check
curl http://localhost:9100/health
# Returns JSON with component status
{
  "status": "healthy",
  "components": {
    "media_gateway": {"status": "healthy", "sip_port": 5060},
    "speech_gateway": {"status": "healthy", "model": "whisper-medium", "gpu": true},
    "runtime": {"status": "healthy", "dialogs_loaded": 3},
    "integration": {"status": "healthy", "services": 2}
  },
  "uptime_seconds": 86400,
  "version": "1.0.0"
}

Next Steps