Observability

voicetyped provides comprehensive observability through three pillars: metrics (Prometheus), traces (OpenTelemetry), and structured logs. Every service emits detailed telemetry that enables you to monitor call quality, debug issues, and track system health.

Metrics (Prometheus)

All services expose Prometheus-compatible metrics on the configured metrics port (default :9100).

Configuration

observability:
  metrics:
    port: 9100
    path: /metrics
    enabled: true

Key Metrics

Call Metrics

Metric	Type	Labels	Description
`vg_calls_active`	Gauge	`dialog`	Currently active calls
`vg_calls_total`	Counter	`dialog`, `status`	Total calls
`vg_call_duration_seconds`	Histogram	`dialog`	Call duration distribution
`vg_call_state_transitions_total`	Counter	`dialog`, `from`, `to`	State transitions

ASR Metrics

Metric	Type	Labels	Description
`vg_asr_latency_seconds`	Histogram	`model`	Time from audio to transcript
`vg_asr_transcriptions_total`	Counter	`model`, `type`	Transcriptions (partial/final)
`vg_asr_workers_active`	Gauge		Active ASR workers
`vg_asr_queue_depth`	Gauge		Queued audio segments
`vg_asr_confidence`	Histogram	`model`	Confidence score distribution
`vg_asr_gpu_utilization`	Gauge	`device`	GPU utilization %

Media Metrics

Metric	Type	Labels	Description
`vg_rtp_packets_received_total`	Counter		RTP packets received
`vg_rtp_packets_lost_total`	Counter		RTP packets lost
`vg_rtp_jitter_ms`	Histogram		RTP jitter
`vg_sip_requests_total`	Counter	`method`, `status`	SIP requests
`vg_audio_buffer_underruns_total`	Counter		Audio buffer underruns

Integration Metrics

Metric	Type	Labels	Description
`vg_integration_requests_total`	Counter	`service`, `method`, `status`	Backend requests
`vg_integration_latency_seconds`	Histogram	`service`, `method`	Backend latency
`vg_integration_retries_total`	Counter	`service`	Retry attempts
`vg_integration_circuit_breaker`	Gauge	`service`	Circuit breaker state

System Metrics

Metric	Type	Description
`vg_uptime_seconds`	Gauge	Time since startup
`vg_goroutines`	Gauge	Active goroutines
`vg_memory_alloc_bytes`	Gauge	Memory allocated

Prometheus Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'voice-gateway'
    scrape_interval: 15s
    static_configs:
      - targets: ['voice-gateway:9100']

    # For Kubernetes with ServiceMonitor
    # (handled automatically by Helm chart)

Useful PromQL Queries

# Active calls
vg_calls_active

# Call rate (calls per minute)
rate(vg_calls_total[5m]) * 60

# Average call duration
histogram_quantile(0.5, rate(vg_call_duration_seconds_bucket[5m]))

# P99 ASR latency
histogram_quantile(0.99, rate(vg_asr_latency_seconds_bucket[5m]))

# RTP packet loss rate
rate(vg_rtp_packets_lost_total[5m]) /
  rate(vg_rtp_packets_received_total[5m]) * 100

# Integration error rate
rate(vg_integration_requests_total{status!="OK"}[5m]) /
  rate(vg_integration_requests_total[5m]) * 100

# ASR queue depth (backpressure indicator)
vg_asr_queue_depth > 5

Alerting Rules

# prometheus-rules.yml
groups:
  - name: voice-gateway
    rules:
      - alert: HighASRLatency
        expr: histogram_quantile(0.99, rate(vg_asr_latency_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "ASR latency P99 exceeds 2 seconds"

      - alert: HighPacketLoss
        expr: |
          rate(vg_rtp_packets_lost_total[5m]) /
          rate(vg_rtp_packets_received_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "RTP packet loss exceeds 5%"

      - alert: ASRQueueBacklog
        expr: vg_asr_queue_depth > 10
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "ASR queue depth exceeds 10 segments"

      - alert: IntegrationCircuitOpen
        expr: vg_integration_circuit_breaker == 1
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Circuit breaker is OPEN for {{ $labels.service }}"

      - alert: HighCallVolume
        expr: vg_calls_active > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Active calls exceeding 80% capacity"

Tracing (OpenTelemetry)

voicetyped supports distributed tracing via OpenTelemetry (OTLP).

Configuration

observability:
  tracing:
    enabled: true
    otlp_endpoint: "otel-collector:4317"
    otlp_protocol: grpc             # grpc or http
    sample_rate: 1.0                 # 1.0 = 100%, 0.1 = 10%
    service_name: voice-gateway
    resource_attributes:
      deployment.environment: production
      service.version: "1.0.0"

Trace Spans

Each call generates a trace with spans for each processing stage:

Trace: call-abc-123
├── media.sip_invite (2ms)
├── media.rtp_setup (15ms)
├── speech.vad_detect (50ms)
├── speech.asr_transcribe (340ms)
│   ├── speech.whisper_inference (280ms)
│   └── speech.post_process (60ms)
├── runtime.state_transition (1ms)
│   └── runtime.evaluate_conditions (0.5ms)
├── integration.call_hook (234ms)
│   ├── integration.serialize (1ms)
│   ├── integration.grpc_call (230ms)
│   └── integration.deserialize (3ms)
├── speech.tts_synthesize (120ms)
└── media.rtp_playback (2100ms)

Span Attributes

Each span includes relevant attributes:

speech.asr_transcribe:
  asr.model: whisper-medium
  asr.language: en
  asr.confidence: 0.94
  asr.duration_ms: 2340
  asr.is_final: true

integration.call_hook:
  rpc.service: ticketing
  rpc.method: CreateTicket
  rpc.status_code: OK
  retry.count: 0

Structured Logging

Configuration

observability:
  logging:
    level: info                      # debug, info, warn, error
    format: json                     # json, text
    output: stdout                   # stdout, file
    file_path: /var/log/voice-gateway/vg.log
    include_caller: true             # Include source file:line

Log Format

{
  "timestamp": "2024-01-15T10:30:45.123Z",
  "level": "info",
  "message": "call started",
  "session_id": "abc-123-def",
  "caller_id": "+15551234567",
  "dialog": "helpdesk",
  "component": "media-gateway",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736"
}

Log Levels

Level	Use
`debug`	Detailed debugging (audio processing, VAD decisions)
`info`	Normal operations (call start/end, state transitions)
`warn`	Potential issues (high latency, retry attempts)
`error`	Failures (hook errors, codec failures, connection drops)

Grafana Dashboards

voicetyped provides pre-built Grafana dashboards:

Call Overview Dashboard

Displays:

Active calls (real-time)
Call volume over time
Average call duration
Call completion rate
Top dialogs by volume

ASR Performance Dashboard

Displays:

ASR latency (P50, P95, P99)
Transcription throughput
Model accuracy distribution
GPU utilization
Worker pool usage and queue depth

Media Quality Dashboard

Displays:

RTP packet loss rate
Jitter distribution
Audio buffer underruns
SIP error rates
Codec distribution

Integration Health Dashboard

Displays:

Backend request rate
Error rate by service
Latency by service
Circuit breaker states
Retry rates

Health Endpoints

# Liveness probe (is the process running?)
curl http://localhost:9100/healthz
# Returns 200 if alive

# Readiness probe (is the service ready to accept calls?)
curl http://localhost:9100/readyz
# Returns 200 if ready, 503 if not

# Detailed health check
curl http://localhost:9100/health
# Returns JSON with component status

{
  "status": "healthy",
  "components": {
    "media_gateway": {"status": "healthy", "sip_port": 5060},
    "speech_gateway": {"status": "healthy", "model": "whisper-medium", "gpu": true},
    "runtime": {"status": "healthy", "dialogs_loaded": 3},
    "integration": {"status": "healthy", "services": 2}
  },
  "uptime_seconds": 86400,
  "version": "1.0.0"
}

Next Steps

Security — audit logging and compliance
Kubernetes Deployment — ServiceMonitor setup
Getting Started — verify your installation