Today we are publicly introducing voicetyped, a self-hosted service that terminates calls and exposes them as programmable, real-time sessions with local speech recognition and dialog control.
This is not another AI voice agent. It is not a chatbot builder. It is not a SaaS platform that processes your audio in someone else’s data center.
voicetyped is voice infrastructure — the backend layer that handles the hard problems of telephony, speech processing, and dialog execution so that engineering teams can focus on building their actual application.
The Problem
Organizations across healthcare, finance, government, and critical infrastructure want to add voice automation to their existing systems. The requirements are clear:
- Audio and transcripts must stay on-premises
- Integration with existing backend services is required
- No SaaS lock-in
- Must run in Kubernetes or on a single VM
- Must operate even with unreliable or no internet connectivity
Today, meeting these requirements means stitching together Asterisk or FreeSWITCH, cloud ASR APIs, custom LLM adapters, bespoke media pipelines, and fragile glue scripts. There is no modern, developer-first, self-hosted voice stack.
Until now.
What We Built
voicetyped is composed of four purpose-built services:
Media Gateway — Terminates SIP calls, handles RTP audio, and manages codec transcoding. This is the telephony boundary. It speaks SIP so the rest of your stack doesn’t have to.
Speech Gateway — Runs whisper.cpp locally for real-time speech recognition. No audio ever leaves your infrastructure. GPU acceleration is supported but not required.
Conversation Runtime — A deterministic dialog engine based on finite state machines. Not a chatbot — a runtime. Dialogs are defined in YAML, state transitions are logged, and LLM nodes are optional, not foundational.
Integration Gateway — Connects to your backend systems via REST and webhooks with built-in retry logic, circuit breaking, and rate limiting.
Why a State Machine?
We deliberately chose finite state machines over LLM-driven conversation because:
- Deterministic — The same input always produces the same output. You can reason about every possible path.
- Auditable — Every state transition is logged. Compliance teams can review exactly what happened in every call.
- Fast — No LLM inference on the critical path. Responses are instant.
- Reliable — No hallucinations. No unexpected behavior. No prompt injection vulnerabilities.
- Serializable — Call state survives service restarts. Calls don’t drop during deployments.
LLM nodes are available as optional components for states that genuinely need natural language understanding. But the foundation is deterministic.
Why Self-Hosted?
Every voice SaaS platform eventually hits the same wall: customers in regulated industries cannot send call audio to an external service. The data sovereignty requirements are non-negotiable.
voicetyped is designed for these environments from day one:
- Air-gapped deployment — Offline installer bundle with preloaded models. Zero external dependencies.
- Single-node deployment — One binary, systemd service. No Kubernetes required.
- Kubernetes deployment — Helm chart with GPU scheduling, autoscaling, and HA.
- mTLS everywhere — Mutual TLS between all components, per-client certificates.
- No phone-home — No license server, no telemetry, no analytics unless you configure it.
The API Surface
voicetyped exposes clean REST APIs that engineering teams actually want to integrate with:
GET /v1/calls/events # SSE stream of real-time call events
POST /v1/calls/{id}/tts # Play text-to-speech
POST /v1/calls/{id}/hangup # Hang up a call
POST /v1/calls/{id}/transfer # Transfer a call
POST /v1/speech/transcribe # Transcribe audio files
Your team implements a small webhook endpoint. voicetyped POSTs JSON to your server when a dialog event matches. Everything else — SIP, RTP, ASR, TTS, state management, retry logic — is handled.
Who Is This For?
voicetyped is built for engineering teams at organizations that:
- Cannot send calls to the cloud
- Have legacy PBX infrastructure they need to integrate with
- Want programmable voice automation without SaaS lock-in
- Need to operate in air-gapped or unreliable network environments
- Value deterministic, auditable behavior over “AI magic”
The primary buyers are IT platform teams, digital transformation teams, and engineering managers — not call center managers.
Open-Core Model
We believe in earning adoption through open source:
Open Source (Apache 2.0):
- Speech Gateway (local ASR)
- Basic Media Gateway
- CLI tools and client libraries
Commercial License:
- Conversation Runtime
- Integration Gateway
- Enterprise security features
- Multi-tenant support
Start with the free components. Scale to enterprise when you’re ready.
What’s Next
We are currently in the MVP phase, focused on the core pipeline: SIP termination → ASR → dialog FSM → integration → TTS. The roadmap covers our plans through enterprise features and the optional intelligence layer.
If you’re building voice automation for internal operations, IT helpdesk intake, appointment confirmation, or any scenario where calls need to stay on-premises, we’d like to talk.
