TTS & STT

STT

Local

Whisper is probably near-real time if iGPU can be passed through
whisper.cpp small.en + HD 630 = WER ~9-11%, RTF ~0.05

Cloud

Provider	Latency	WER	Cost	Free Tier	Notes
Deepgram Nova-3	~200ms	~5-8%	$0.0077/min	$200 credit	WebSocket streaming, interim results
AssemblyAI	~300ms	~8.1%	$0.0037/min	$50 credit	Cheapest streaming, good accuracy
OpenAI Whisper API	1-3s	~7-9%	$0.006/min	None	Batch only — bad for conversational
ElevenLabs Scribe	<150ms	Good	~$0.012/min	Limited	Fastest, but pricier

TTS

Local

Kokoro looks really good, can be run on CPU, but not my current hardware.

Cloud

Best AI ones suitable for agents.

Provider	Expressiveness	Latency (TTFA)	Cost	Voice Cloning	Why
ElevenLabs	Best-in-class naturalness	~75ms inference / ~135ms P90	$5/mo (30K chars) /$ 22/mo (100K)	Instant on all tiers	Gold standard
Cartesia	Realistic (breathing, laughter)	40ms Turbo / 90ms Sonic 3	~$0.03/min (usage-based)	Instant + Pro cloning	Fastest TTFA
Hume AI	Detects emotion from context	~100-200ms	$7.60/1M chars	Via EVI	Best value

Cost

Assuming an average usage of 30/min/day for STT agents:

Provider	Est. Monthly
Hume AI	$0-2
ElevenLabs Starter	$5
Cartesia	~$5-10