TTS & STT

STT

Local

  • Whisper is probably near-real time if iGPU can be passed through
  • whisper.cpp small.en + HD 630 = WER ~9-11%, RTF ~0.05

Cloud

ProviderLatencyWERCostFree TierNotes
Deepgram Nova-3~200ms~5-8%$0.0077/min$200 creditWebSocket streaming, interim results
AssemblyAI~300ms~8.1%$0.0037/min$50 creditCheapest streaming, good accuracy
OpenAI Whisper API1-3s~7-9%$0.006/minNoneBatch only — bad for conversational
ElevenLabs Scribe<150msGood~$0.012/minLimitedFastest, but pricier

TTS

Local

  • Kokoro looks really good, can be run on CPU, but not my current hardware.

Cloud

Best AI ones suitable for agents.

ProviderExpressivenessLatency (TTFA)CostVoice CloningWhy
ElevenLabsBest-in-class naturalness~75ms inference / ~135ms P905/mo(30Kchars)/5/mo (30K chars) / 22/mo (100K)Instant on all tiersGold standard
CartesiaRealistic (breathing, laughter)40ms Turbo / 90ms Sonic 3~$0.03/min (usage-based)Instant + Pro cloningFastest TTFA
Hume AIDetects emotion from context~100-200ms$7.60/1M charsVia EVIBest value

Cost

Assuming an average usage of 30/min/day for STT agents:

ProviderEst. Monthly
Hume AI$0-2
ElevenLabs Starter$5
Cartesia~$5-10