Architecture Overview

voicetyped is composed of four main services that form a complete voice processing pipeline. Each service has a single responsibility and communicates with adjacent services via well-defined interfaces. This architecture enables independent scaling, testing, and replacement of any component.

High-Level Architecture

SIP / WebRTC
     │
     ▼
┌─────────────────┐
│  Media Gateway   │ ← SIP endpoint, RTP audio, codec handling
└────────┬────────┘
         │ PCM stream
         ▼
┌─────────────────┐
│ Speech Gateway   │ ← Local ASR (whisper.cpp), TTS
└────────┬────────┘
         │ Transcripts
         ▼
┌─────────────────┐
│  Conversation    │ ← Turn detection, dialog FSM, tool invocation
│  Runtime         │
└────────┬────────┘
         │ Actions
         ▼
┌─────────────────┐
│  Integration     │ ← REST/HTTP to customer backend
│  Gateway         │
└─────────────────┘
         │
         ▼
   Customer Backend

Service Interactions

Call Flow

When an inbound call arrives, the services interact in this sequence:

Media Gateway receives the SIP INVITE, negotiates codecs, and begins extracting RTP audio
Speech Gateway receives the PCM audio stream and begins producing transcripts
Conversation Runtime receives transcript events and evaluates them against the active dialog FSM
Integration Gateway executes any actions that require calling external systems
Results flow back through the stack: Integration → Runtime → Speech (TTS) → Media → Caller

Communication Protocols

From	To	Protocol	Data
Media Gateway	Speech Gateway	Internal ConnectRPC stream	PCM audio chunks
Speech Gateway	Conversation Runtime	Internal ConnectRPC stream	Transcript events
Conversation Runtime	Integration Gateway	Internal ConnectRPC	Action requests
Integration Gateway	Customer Backend	REST / HTTP	Business logic calls
Conversation Runtime	Speech Gateway	Internal ConnectRPC	TTS requests
Speech Gateway	Media Gateway	Internal ConnectRPC stream	Audio playback

Call Session Model

Every active call is represented as a CallSession object that flows through the system:

CallSession
  ├── SessionID (unique per call)
  ├── CallerInfo (SIP headers, caller ID)
  ├── State (current FSM state)
  ├── Events[]
  │   ├── SpeechEvent (transcript, confidence, timing)
  │   ├── DTMFEvent (digit, duration)
  │   ├── TimeoutEvent (elapsed time)
  │   └── BackendResultEvent (response from integration)
  └── Actions[]
      ├── PlayTTS (text, voice)
      ├── Transfer (target SIP URI)
      ├── Hangup (reason code)
      └── CallHook (service, method, payload)

Service Details

Media Gateway

The Media Gateway is the telephony boundary of the system. It speaks SIP and RTP so the rest of the platform does not have to.

Responsibilities:

SIP endpoint (INVITE, BYE, CANCEL, re-INVITE, hold/resume)
RTP audio reception and transmission
Codec negotiation and transcoding (G.711 μ-law, G.711 A-law, Opus)
Jitter buffer management
DTMF detection (RFC 2833 and in-band)
Call lifecycle management

Implementation: Go with cgo bindings to PJSIP or a lightweight SIP stack.

Output: Normalized 16kHz mono PCM stream per active call.

Speech Gateway

The Speech Gateway handles all speech processing — both recognition (ASR) and synthesis (TTS).

Responsibilities:

Real-time speech recognition using whisper.cpp
Voice activity detection (VAD) and audio segmentation
Partial transcript streaming (interim results)
Final transcript delivery
Text-to-speech rendering
Per-call worker pool management

Implementation: Go service wrapping whisper.cpp via cgo. Optional faster-whisper (Python) backend for GPU-heavy workloads.

API: Internal ConnectRPC (high-performance binary protocol). External-facing APIs use REST/JSON — no special client tooling required.

Conversation Runtime

The Conversation Runtime is the core differentiator. It is not a chatbot builder — it is a deterministic runtime for dialog execution.

Responsibilities:

Turn detection (endpoint detection, barge-in handling)
Dialog state machine execution
Tool/action invocation
DTMF-driven menus
Timeout handling
Optional LLM node evaluation

Conversation Model:

# A dialog is a finite state machine
Dialog:
  name: string
  states:
    state_name:
      on_enter: Action[]
      transitions:
        - event: EventType
          condition: Expression  # optional
          target: StateName
          actions: Action[]      # optional

Key Design Decisions:

Deterministic by default — LLM nodes are opt-in, not the foundation
State machine is serializable — calls survive restarts
Per-call state isolation — no shared mutable state between calls

Integration Gateway

The Integration Gateway is the boundary between voicetyped and customer systems.

Responsibilities:

Outbound REST and HTTP calls to customer backends
Authentication (mTLS, API keys, OAuth2)
Retry with exponential backoff
Rate limiting (per-service, per-call)
Circuit breaking (half-open, open, closed states)
Request/response logging

Implementation: Go service with configurable backends.

Scaling Strategy

Horizontal Scaling

Each service can be scaled independently:

Service	Scale Factor	Strategy
Media Gateway	Active calls	1 instance per ~100 concurrent calls
Speech Gateway	ASR workload	GPU instances, worker pool sizing
Conversation Runtime	Active sessions	Stateless with external state store
Integration Gateway	Backend call volume	Standard horizontal scaling

Kubernetes Scaling

In Kubernetes, the Helm chart configures Horizontal Pod Autoscalers:

autoscaling:
  mediaGateway:
    enabled: true
    minReplicas: 2
    maxReplicas: 10
    targetCPU: 70
  speechGateway:
    enabled: true
    minReplicas: 1
    maxReplicas: 5
    targetGPUUtilization: 80
  runtime:
    enabled: true
    minReplicas: 2
    maxReplicas: 20
    targetCPU: 60

Data Flow Guarantees

Audio never leaves the deployment — all ASR processing is local
Transcripts are ephemeral — stored only in call session memory unless explicitly persisted
Actions are idempotent — the Integration Gateway ensures at-least-once delivery with deduplication
State is recoverable — call sessions can be serialized and restored after restarts

Next Steps

Media Gateway — deep dive into SIP and RTP handling
Speech Gateway — configure ASR models and tuning
Conversation Runtime — build dialog state machines
Integration Gateway — connect to your backend