Getting Started

This guide walks you through installing voicetyped, running your first call flow, and verifying that the system is operational. By the end, you will have a working SIP endpoint that transcribes inbound calls and executes a simple dialog.

Prerequisites

Before you begin, ensure you have:

Linux host (Ubuntu 22.04+ or RHEL 8+ recommended)
4 GB RAM minimum (8 GB recommended for GPU-accelerated ASR)
Go 1.21+ (if building from source)
A SIP client for testing (e.g., Opal, Opal, Linphone, or a softphone)
Optional: NVIDIA GPU with CUDA 12+ for accelerated speech recognition

Installation

Option 1: Quick Install (recommended)

Download and run the installer script:

curl -sSL https://get.voicetyped.com/install | sh

This installs the voice-gateway binary to /usr/local/bin/ and downloads the default ASR model (whisper-base).

Option 2: Build from Source

git clone https://github.com/voicetyped/voice-gateway.git
cd voice-gateway
make build
sudo make install

Option 3: Docker

docker pull voicetyped/voice-gateway:latest
docker run -d \
  --name voice-gateway \
  -p 5060:5060/udp \
  -p 8080:8080 \
  -p 9100:9100 \
  -v /opt/vg/models:/models \
  voicetyped/voice-gateway:latest

Download ASR Models

voicetyped uses whisper.cpp for local speech recognition. Download the model you need:

# Base model (fastest, least accurate)
voice-gateway model download whisper-base

# Medium model (recommended for production)
voice-gateway model download whisper-medium

# Large model (most accurate, requires GPU)
voice-gateway model download whisper-large-v3

Models are stored in /var/lib/voice-gateway/models/ by default.

Model	Size	Speed	Accuracy	GPU Required
whisper-base	142 MB	Real-time	Good	No
whisper-small	466 MB	Real-time	Better	No
whisper-medium	1.5 GB	Near real-time	High	Recommended
whisper-large-v3	3.1 GB	Slower	Highest	Yes

Configuration

Create a configuration file at /etc/voice-gateway/config.yaml:

# /etc/voice-gateway/config.yaml

media:
  sip_port: 5060
  rtp_port_range: "10000-20000"
  codecs:
    - g711-ulaw
    - g711-alaw
    - opus

speech:
  engine: whisper
  model: whisper-medium
  language: en
  gpu: auto  # auto, true, false

runtime:
  dialog_dir: /etc/voice-gateway/dialogs/
  default_timeout: 10s
  max_concurrent_calls: 10

integration:
  api_port: 8080

observability:
  metrics_port: 9100
  log_level: info
  otlp_endpoint: ""  # Optional OpenTelemetry collector

security:
  mtls: false  # Enable for production
  cert_dir: /etc/voice-gateway/certs/

Start voicetyped

# Start with default configuration
voice-gateway start

# Start with a specific config file
voice-gateway start --config /etc/voice-gateway/config.yaml

# Start with inline overrides
voice-gateway start --asr-model whisper-medium --sip-port 5060

You should see output similar to:

INFO  Loading configuration from /etc/voice-gateway/config.yaml
INFO  Media Gateway listening on :5060 (SIP/UDP)
INFO  Speech Gateway ready (whisper-medium, GPU: detected)
INFO  Conversation Runtime loaded 1 dialog(s)
INFO  REST API listening on :8080
INFO  Metrics endpoint on :9100/metrics
✓ voicetyped is running

Create Your First Dialog

Create a simple dialog flow at /etc/voice-gateway/dialogs/greeting.yaml:

# /etc/voice-gateway/dialogs/greeting.yaml
name: greeting
description: Simple greeting dialog

states:
  start:
    on_enter:
      - action: play_tts
        text: "Hello, you have reached the IT helpdesk. How can I help you?"
    transitions:
      - event: speech
        target: process_request
      - event: timeout
        after: 10s
        target: no_input

  process_request:
    on_enter:
      - action: call_hook
        service: dialog_hooks
        method: OnIntent
    transitions:
      - event: hook_result
        target: respond
      - event: timeout
        after: 15s
        target: no_input

  respond:
    on_enter:
      - action: play_tts
        text: "{{ .HookResult.Response }}"
    transitions:
      - event: speech
        target: process_request
      - event: timeout
        after: 10s
        target: goodbye

  no_input:
    on_enter:
      - action: play_tts
        text: "I did not hear anything. Please try again."
    transitions:
      - event: speech
        target: process_request
      - event: timeout
        after: 10s
        target: goodbye

  goodbye:
    on_enter:
      - action: play_tts
        text: "Thank you for calling. Goodbye."
      - action: hangup

Test Your Setup

1. Check system status

voice-gateway status

Expected output:

voicetyped Status
  Media Gateway:    ✓ running (SIP :5060)
  Speech Gateway:   ✓ running (whisper-medium)
  Runtime:          ✓ running (1 dialog loaded)
  Integration:      ✓ running (REST :8080)
  Active Calls:     0
  Uptime:           2m 34s

2. Make a test call

Using a SIP softphone, dial sip:greeting@<your-server-ip>:5060. You should hear the greeting prompt from your dialog.

3. Check metrics

curl http://localhost:9100/metrics | grep voice_gateway

You will see Prometheus metrics including:

voice_gateway_active_calls 0
voice_gateway_total_calls 1
voice_gateway_asr_latency_seconds{quantile="0.99"} 0.234
voice_gateway_call_duration_seconds_sum 45.2

Next Steps

Architecture Overview — understand how the components fit together
Media Gateway — configure SIP and RTP handling
Speech Gateway — tune ASR performance
API Reference — integrate with your backend
Kubernetes Deployment — scale to production