STT
Local
- Whisper is probably near-real time if iGPU can be passed through
whisper.cpp small.en + HD 630 = WER ~9-11%, RTF ~0.05
Cloud
| Provider | Latency | WER | Cost | Free Tier | Notes |
|---|
| Deepgram Nova-3 | ~200ms | ~5-8% | $0.0077/min | $200 credit | WebSocket streaming, interim results |
| AssemblyAI | ~300ms | ~8.1% | $0.0037/min | $50 credit | Cheapest streaming, good accuracy |
| OpenAI Whisper API | 1-3s | ~7-9% | $0.006/min | None | Batch only — bad for conversational |
| ElevenLabs Scribe | <150ms | Good | ~$0.012/min | Limited | Fastest, but pricier |
TTS
Local
- Kokoro looks really good, can be run on CPU, but not my current hardware.
Cloud
Best AI ones suitable for agents.
| Provider | Expressiveness | Latency (TTFA) | Cost | Voice Cloning | Why |
|---|
| ElevenLabs | Best-in-class naturalness | ~75ms inference / ~135ms P90 | 5/mo(30Kchars)/22/mo (100K) | Instant on all tiers | Gold standard |
| Cartesia | Realistic (breathing, laughter) | 40ms Turbo / 90ms Sonic 3 | ~$0.03/min (usage-based) | Instant + Pro cloning | Fastest TTFA |
| Hume AI | Detects emotion from context | ~100-200ms | $7.60/1M chars | Via EVI | Best value |
Cost
Assuming an average usage of 30/min/day for STT agents:
| Provider | Est. Monthly |
|---|
| Hume AI | $0-2 |
| ElevenLabs Starter | $5 |
| Cartesia | ~$5-10 |