Architecture Overview
How voicetyped's four core services work together to process voice calls.
voicetyped is composed of four main services that form a complete voice processing pipeline. Each service has a single responsibility and communicates with adjacent services via well-defined interfaces. This architecture enables independent scaling, testing, and replacement of any component.
High-Level Architecture
SIP / WebRTC
│
▼
┌─────────────────┐
│ Media Gateway │ ← SIP endpoint, RTP audio, codec handling
└────────┬────────┘
│ PCM stream
▼
┌─────────────────┐
│ Speech Gateway │ ← Local ASR (whisper.cpp), TTS
└────────┬────────┘
│ Transcripts
▼
┌─────────────────┐
│ Conversation │ ← Turn detection, dialog FSM, tool invocation
│ Runtime │
└────────┬────────┘
│ Actions
▼
┌─────────────────┐
│ Integration │ ← REST/HTTP to customer backend
│ Gateway │
└─────────────────┘
│
▼
Customer Backend
Service Interactions
Call Flow
When an inbound call arrives, the services interact in this sequence:
- Media Gateway receives the SIP INVITE, negotiates codecs, and begins extracting RTP audio
- Speech Gateway receives the PCM audio stream and begins producing transcripts
- Conversation Runtime receives transcript events and evaluates them against the active dialog FSM
- Integration Gateway executes any actions that require calling external systems
- Results flow back through the stack: Integration → Runtime → Speech (TTS) → Media → Caller
Communication Protocols
| From | To | Protocol | Data |
|---|---|---|---|
| Media Gateway | Speech Gateway | Internal ConnectRPC stream | PCM audio chunks |
| Speech Gateway | Conversation Runtime | Internal ConnectRPC stream | Transcript events |
| Conversation Runtime | Integration Gateway | Internal ConnectRPC | Action requests |
| Integration Gateway | Customer Backend | REST / HTTP | Business logic calls |
| Conversation Runtime | Speech Gateway | Internal ConnectRPC | TTS requests |
| Speech Gateway | Media Gateway | Internal ConnectRPC stream | Audio playback |
Call Session Model
Every active call is represented as a CallSession object that flows through the system:
CallSession
├── SessionID (unique per call)
├── CallerInfo (SIP headers, caller ID)
├── State (current FSM state)
├── Events[]
│ ├── SpeechEvent (transcript, confidence, timing)
│ ├── DTMFEvent (digit, duration)
│ ├── TimeoutEvent (elapsed time)
│ └── BackendResultEvent (response from integration)
└── Actions[]
├── PlayTTS (text, voice)
├── Transfer (target SIP URI)
├── Hangup (reason code)
└── CallHook (service, method, payload)
Service Details
Media Gateway
The Media Gateway is the telephony boundary of the system. It speaks SIP and RTP so the rest of the platform does not have to.
Responsibilities:
- SIP endpoint (INVITE, BYE, CANCEL, re-INVITE, hold/resume)
- RTP audio reception and transmission
- Codec negotiation and transcoding (G.711 μ-law, G.711 A-law, Opus)
- Jitter buffer management
- DTMF detection (RFC 2833 and in-band)
- Call lifecycle management
Implementation: Go with cgo bindings to PJSIP or a lightweight SIP stack.
Output: Normalized 16kHz mono PCM stream per active call.
Speech Gateway
The Speech Gateway handles all speech processing — both recognition (ASR) and synthesis (TTS).
Responsibilities:
- Real-time speech recognition using whisper.cpp
- Voice activity detection (VAD) and audio segmentation
- Partial transcript streaming (interim results)
- Final transcript delivery
- Text-to-speech rendering
- Per-call worker pool management
Implementation: Go service wrapping whisper.cpp via cgo. Optional faster-whisper (Python) backend for GPU-heavy workloads.
API: Internal ConnectRPC (high-performance binary protocol). External-facing APIs use REST/JSON — no special client tooling required.
Conversation Runtime
The Conversation Runtime is the core differentiator. It is not a chatbot builder — it is a deterministic runtime for dialog execution.
Responsibilities:
- Turn detection (endpoint detection, barge-in handling)
- Dialog state machine execution
- Tool/action invocation
- DTMF-driven menus
- Timeout handling
- Optional LLM node evaluation
Conversation Model:
# A dialog is a finite state machine
Dialog:
name: string
states:
state_name:
on_enter: Action[]
transitions:
- event: EventType
condition: Expression # optional
target: StateName
actions: Action[] # optional
Key Design Decisions:
- Deterministic by default — LLM nodes are opt-in, not the foundation
- State machine is serializable — calls survive restarts
- Per-call state isolation — no shared mutable state between calls
Integration Gateway
The Integration Gateway is the boundary between voicetyped and customer systems.
Responsibilities:
- Outbound REST and HTTP calls to customer backends
- Authentication (mTLS, API keys, OAuth2)
- Retry with exponential backoff
- Rate limiting (per-service, per-call)
- Circuit breaking (half-open, open, closed states)
- Request/response logging
Implementation: Go service with configurable backends.
Scaling Strategy
Horizontal Scaling
Each service can be scaled independently:
| Service | Scale Factor | Strategy |
|---|---|---|
| Media Gateway | Active calls | 1 instance per ~100 concurrent calls |
| Speech Gateway | ASR workload | GPU instances, worker pool sizing |
| Conversation Runtime | Active sessions | Stateless with external state store |
| Integration Gateway | Backend call volume | Standard horizontal scaling |
Kubernetes Scaling
In Kubernetes, the Helm chart configures Horizontal Pod Autoscalers:
autoscaling:
mediaGateway:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPU: 70
speechGateway:
enabled: true
minReplicas: 1
maxReplicas: 5
targetGPUUtilization: 80
runtime:
enabled: true
minReplicas: 2
maxReplicas: 20
targetCPU: 60
Data Flow Guarantees
- Audio never leaves the deployment — all ASR processing is local
- Transcripts are ephemeral — stored only in call session memory unless explicitly persisted
- Actions are idempotent — the Integration Gateway ensures at-least-once delivery with deduplication
- State is recoverable — call sessions can be serialized and restored after restarts
Next Steps
- Media Gateway — deep dive into SIP and RTP handling
- Speech Gateway — configure ASR models and tuning
- Conversation Runtime — build dialog state machines
- Integration Gateway — connect to your backend