The complete market map and weighted assessment of 48 text-to-speech and speech-to-text providers for AI agents: benchmark data, real pricing, latency measurements, and the 10 best ranked by what actually matters.
Voice AI funding surged 8x to $2.1 billion in 2025. ElevenLabs hit $330 million ARR and an $11 billion valuation - TechCrunch. Deepgram raised a $130 million Series C at $1.3 billion - Deepgram. Production voice agent deployments grew 340% year-over-year, and Gartner predicts conversational AI will reduce contact center labor costs by $80 billion in 2026 - Gartner. The voice AI agent market hit $22.5 billion and is accelerating at 34.8% CAGR toward $47.5 billion by 2034 - Market.us.
This is not a future trend. Voice agents are in production now, at scale. The question for builders is not whether to add voice to their agents. It is which TTS and STT provider delivers the best quality at the right price for their specific use case.
We mapped 48 providers across the TTS and STT landscape: 22 cloud TTS APIs, 12 cloud STT APIs, 5 voice agent platforms, and 10 open-source models. We scored them on 7 weighted criteria derived from first principles (what does an AI agent actually need from voice?), verified every price against official pricing pages, and cross-referenced quality claims against independent benchmarks from CodeSOTA, Artificial Analysis Speech Arena, and Inworld's 2026 evaluation. As we explored in our guide to the top 10 agent capabilities, TTS and STT are among the most important external capabilities for production agents.
Contents
- The Master Ranking: Top 10 TTS and STT APIs, Weighted and Ranked
- Assessment Criteria and Weight Rationale
- The Top 10: Detailed Profiles
- The Full Provider Directory (48 Services)
- TTS Pricing Comparison
- STT Pricing Comparison
- Benchmark Data: Quality, Latency, and Accuracy
- Open Source: The Quality Gap Has Closed
- Voice Agent Platforms: The Full-Stack Option
- How to Choose: Decision Framework
1. The Master Ranking: Top 10 TTS and STT APIs, Weighted and Ranked
| # | Provider | What It Does | Type | Price | Quality (25%) | Cost (20%) | Latency (20%) | Agent Ready (15%) | Languages (10%) | Cloning (5%) | Scale (5%) | Final /10 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | ElevenLabs | MOS 4.8 TTS + Scribe STT, voice cloning, full agent platform, $11B valuation | TTS+STT | $60-120/1M chars | 10 | 4 | 7 | 10 | 9 | 10 | 9 | 7.9 |
| 2 | Deepgram | Nova-3 STT (2.2% WER) + Aura TTS + Voice Agent API, $200 free credit | TTS+STT | $0.46/hr STT, $30/1M TTS | 8 | 8 | 8 | 9 | 6 | 2 | 8 | 7.7 |
| 3 | Cartesia | 40ms TTFA (industry fastest), Sonic-3 TTS + Ink-Whisper STT, voice agent Line | TTS+STT | ~$47/1M chars | 9 | 5 | 10 | 8 | 7 | 9 | 7 | 7.7 |
| 4 | AssemblyAI | Universal-2 STT (2.4% WER), 99 languages, richest feature set (sentiment, NER, medical) | STT only | $0.15-0.45/hr | 9 | 8 | 7 | 8 | 9 | 0 | 8 | 7.2 |
| 5 | OpenAI | TTS-1 (MOS 4.7) + Whisper STT, 13 voices, cheapest STT at $0.18/hr mini | TTS+STT | $15/1M TTS, $0.36/hr STT | 9 | 7 | 6 | 7 | 9 | 0 | 9 | 7.0 |
| 6 | Google Cloud | TTS + STT + Gemini TTS, 125+ STT languages, most generous free tier (4M chars/mo) | TTS+STT | $16/1M TTS, $0.96/hr STT | 8 | 6 | 5 | 7 | 10 | 3 | 10 | 6.8 |
| 7 | Gladia | Solaria-1 STT (2.5% WER), 103ms latency, 100+ languages, all features bundled | STT only | $0.20-0.61/hr | 8 | 7 | 8 | 6 | 9 | 0 | 7 | 6.6 |
| 8 | Hume AI | First emotion-aware TTS (Octave 2), $7.60/1M chars, speech-to-speech (EVI) | TTS | $7.60/1M chars | 8 | 8 | 7 | 7 | 3 | 5 | 6 | 6.6 |
| 9 | Azure Speech | TTS + STT, 140+ TTS languages (widest), custom neural voice training, enterprise | TTS+STT | $16/1M TTS, $1/hr STT | 7 | 6 | 5 | 7 | 10 | 8 | 10 | 6.6 |
| 10 | PlayHT | PlayHT 3.0 TTS (MOS 4.6), instant voice cloning, API-first, unlimited on $99/mo | TTS only | $39-99/mo flat | 8 | 7 | 6 | 6 | 5 | 9 | 6 | 6.5 |
How to read this table: Each cell is a raw score (0-10). The Final Score is the weighted average: (Quality x 0.25) + (Cost x 0.20) + (Latency x 0.20) + (Agent Ready x 0.15) + (Languages x 0.10) + (Cloning x 0.05) + (Scale x 0.05). The "What It Does" column captures each provider's identity and standout metric. Ordered by final score, best first.
Criteria definitions and weight rationale:
- Quality (25%): MOS scores for TTS, WER for STT, based on CodeSOTA and Artificial Analysis benchmarks. Highest weight because bad voice quality kills user experience instantly.
- Cost (20%): Normalized to $/1M chars for TTS, $/hr for STT. 10 = cheapest tier, 1 = most expensive. High weight because voice agents run at volume (thousands of minutes/month).
- Latency (20%): Time to first audio (TTS) or streaming factor (STT). 10 = sub-100ms, 7 = 100-300ms, 5 = 300-500ms, 3 = 500ms-1s, 1 = 1s+. Equal to cost because conversational agents need sub-300ms response.
- Agent Readiness (15%): Voice agent platform, WebSocket streaming, SDKs, telephony integration. 10 = full voice agent stack, 5 = REST API + streaming, 1 = REST only.
- Languages (10%): Number of supported languages. 10 = 100+, 7 = 40-99, 5 = 15-39, 3 = 5-14, 1 = English only.
- Voice Cloning (5%): Instant cloning, custom voice training. 10 = instant + professional, 5 = basic cloning, 0 = not available. Lower weight because not every agent needs custom voices.
- Scale/Reliability (5%): Enterprise SLA, concurrency limits, uptime guarantees. 10 = SOC2/HIPAA + 99.99%, 5 = standard cloud, 1 = no SLA.
The ranking shows a clear top tier: ElevenLabs (7.9) leads on quality and agent platform completeness. Deepgram (7.7) and Cartesia (7.7) tie for second through different strengths: Deepgram on price-performance and full-stack voice agents, Cartesia on raw latency (40ms, lowest in the industry). AssemblyAI (7.2) is the STT specialist with the richest feature set.
For a broader view of how voice fits into the agent capability stack, see our guide to the top 10 agent capabilities. For cost analysis across all agent infrastructure, see our AI agent cost analysis.
2. Assessment Criteria and Weight Rationale
The weights above are derived from first principles: what does an AI agent fundamentally need from voice capabilities?
An AI agent that speaks to humans needs three things above all else. First, the voice must sound natural enough that users do not disengage. A robotic voice triggers an immediate trust deficit that no amount of intelligence can overcome. This is why quality gets 25%: it is the gatekeeper for everything else. The MOS (Mean Opinion Score) benchmarks from CodeSOTA show the top providers now score 4.5-4.8 out of 5.0, within the range that is "generally indistinguishable from human speech in blind tests" - CodeSOTA. Below 4.0, users notice. Below 3.5, they leave.
Second, the agent must respond quickly. In a phone call, silence longer than 500 milliseconds feels like the agent is broken. In a customer support interaction, every additional second of latency increases abandonment. This is why latency gets 20%, equal to cost. The range across providers is enormous: from 40ms (Cartesia Sonic Turbo) to over 800ms (hyperscaler defaults). For conversational agents, anything above 300ms creates perceivable delay.
Third, the agent must be affordable at scale. A voice agent handling 10,000 minutes per month (typical for a mid-sized customer support deployment) will cost between $100 and $1,500/month in TTS alone, depending on the provider. STT adds another $15 to $1,000/month for the same volume. This is why cost gets 20%: at production volume, the difference between providers is material.
The remaining criteria (agent readiness at 15%, languages at 10%, cloning and scale at 5% each) matter for specific use cases but are not universally load-bearing. A US-only English agent does not need 100+ languages. A content generation agent does not need voice cloning. The weights reflect the general case.
3. The Top 10: Detailed Profiles
3.1 ElevenLabs (Score: 7.9/10)
ElevenLabs is the market leader in TTS quality and the only provider that combines best-in-class TTS with competitive STT in a single platform. The company reached $330 million in ARR, raised a $500 million Series D at an $11 billion valuation in February 2026, and serves over 1 million developers - ElevenLabs.
The TTS quality is objectively the best available. ElevenLabs Turbo v2.5 scores MOS 4.8 on CodeSOTA benchmarks, the highest of any provider - CodeSOTA. The v3 model ranks ELO 1,179 on the Artificial Analysis Speech Arena. Flash models (lower quality but faster) score MOS 4.6 at roughly half the latency.
Pricing spans two tiers: Flash/Turbo at $60/1M characters and Multilingual v2/v3 at $120/1M characters. Subscription plans range from free (10,000 credits) to $990/month (Business). The STT product, Scribe, costs $0.22/hour for batch and $0.39/hour for real-time streaming - ElevenLabs Pricing.
The Conversational AI 2.0 platform is what pushes ElevenLabs to the top for agent builders. It provides a complete voice agent stack: STT, LLM orchestration, TTS, with WebSocket streaming, Twilio/Genesys/Vonage telephony integrations, and SOC 2/HIPAA/GDPR compliance. If you are building a voice agent and want one vendor for the full pipeline, ElevenLabs is the most complete option - ElevenLabs.
The main weakness is cost. At $60-120/1M characters, ElevenLabs is 4-8x more expensive than alternatives like OpenAI TTS ($15/1M) or Hume ($7.60/1M). For high-volume, cost-sensitive deployments, the quality premium may not justify the price difference, especially as competitors narrow the quality gap.
3.2 Deepgram (Score: 7.7/10)
Deepgram is the most complete price-performance play in voice AI. Its Nova-3 STT model achieves 2.2% WER on LibriSpeech clean audio (lowest of any commercial API) with a 54.3% WER reduction over competitors in streaming scenarios - Deepgram. The company raised a $130 million Series C at a $1.3 billion valuation in January 2026 - TechCrunch.
STT pricing is aggressive: Nova-3 at $0.46/hour pay-as-you-go, dropping to $0.39/hour on Growth plans. The Aura-2 TTS costs $30/1M characters with 90-200ms time-to-first-byte - Deepgram Pricing. The $200 free credit (enough for 46,000+ minutes of transcription) is the most generous trial in the STT market.
The Voice Agent API is Deepgram's strongest differentiator for agent builders. At $0.05-0.075/minute, it provides a bundled STT + TTS pipeline optimized for real-time conversation. Combined with Python, JavaScript, Go, and .NET SDKs plus Twilio integration, it is the fastest path to a production voice agent for developers who want to own the stack without using a full platform like Vapi or Retell.
The weakness is TTS quality. Aura-2 is competent but does not match ElevenLabs, Cartesia, or OpenAI on naturalness. For agents where voice quality is the primary differentiator (luxury brand, premium support), Deepgram's TTS may not be sufficient.
3.3 Cartesia (Score: 7.7/10)
Cartesia owns the latency crown. Sonic Turbo achieves 40ms time-to-first-audio, which is the fastest commercial TTS available, and Sonic-3 runs at approximately 90ms in production streaming - Inworld Benchmarks. The quality is strong at MOS 4.7 (Sonic 2), placing it in the top tier alongside ElevenLabs and OpenAI.
Pricing uses a credit system: Sonic-3 TTS costs 15 credits per second of audio. Plans range from free (20,000 credits) to Scale ($239/month for 8 million credits). The effective cost works out to approximately $47/1M characters - Cartesia Pricing. Ink-Whisper STT costs 1 credit per second, roughly $0.13/hour at Scale tier.
The Line voice agent platform positions Cartesia as a full-stack voice AI company. Instant voice cloning from 3 seconds of audio, 40+ language support, and WebSocket streaming make it purpose-built for the real-time conversational use case. If your agent's primary requirement is the fastest possible voice response (phone banking, emergency services, gaming), Cartesia is the clear choice.
The limitation is ecosystem maturity. Cartesia is newer than ElevenLabs or Deepgram, with fewer framework integrations and a smaller developer community. The credit-based pricing can also be harder to predict than simple per-character or per-hour models.
3.4 AssemblyAI (Score: 7.2/10)
AssemblyAI is the STT specialist with the richest feature set in the market. Universal-2 achieves 2.4% WER on clean audio across 99 languages. The Universal-3 Pro model is the best commercial streaming STT, with the lowest WER in real-time scenarios - AssemblyAI Benchmarks.
What sets AssemblyAI apart is the audio intelligence layer built on top of transcription. Diarization, sentiment analysis, entity detection, summarization, topic detection, PII redaction, content moderation, and a medical mode are all available as add-ons. For agents that need to understand conversations (not just transcribe them), this feature depth is unmatched.
Pricing is competitive: Universal-2 at $0.15/hour (batch), U3 Pro at $0.21/hour (batch) and $0.45/hour (streaming). Add-ons range from $0.02/hour (diarization) to $0.15/hour (Medical Mode). The $50 free credit covers initial evaluation - AssemblyAI Pricing.
For agents that need pure STT with maximum intelligence, from meeting transcription bots to compliance monitoring systems, AssemblyAI's combination of accuracy, features, and price is the strongest in the market.
3.5 OpenAI (Score: 7.0/10)
OpenAI provides both TTS and STT with the simplest integration path for teams already using the OpenAI API. TTS-1 scores MOS 4.7 and TTS-HD reaches even higher quality. Whisper STT at $0.36/hour ($0.006/minute) and GPT-4o Mini Transcribe at $0.18/hour make OpenAI the cheapest major STT provider for batch processing - OpenAI Pricing.
TTS pricing is competitive at $15/1M characters for TTS-1 (standard) and $30/1M for TTS-HD. Thirteen built-in voices with no voice cloning. The gpt-4o-mini-tts model combines TTS with LLM intelligence, letting the model control speech style and emphasis dynamically.
The Realtime API enables full voice agent workflows with integrated STT + LLM + TTS in a single WebSocket connection. For teams building on GPT-4o, this is the zero-friction path to voice agents. The main limitation is the lack of voice cloning and the relatively small voice library (13 voices vs hundreds from ElevenLabs or Google).
3.6-3.10: Google Cloud, Gladia, Hume AI, Azure Speech, PlayHT
The remaining five in the top 10 each serve a specific niche. Google Cloud (6.8) offers the most generous free tier (4M characters/month TTS) and broadest language coverage (125+ STT languages). Gladia (6.6) bundles all features (diarization, NER, sentiment) into the base price with no add-on charges, at 103ms streaming latency. Hume AI (6.6) is the first emotion-aware TTS at a remarkably low $7.60/1M characters. Azure Speech (6.6) has the widest TTS language support (140+) with enterprise compliance. PlayHT (6.5) offers unlimited TTS generation on the $99/month Premium plan with instant voice cloning.
4. The Full Provider Directory (48 Services)
Beyond the top 10, we mapped 38 additional providers across four categories. For brevity, we list them with their key differentiator and pricing anchor.
Cloud TTS APIs (12 more): Amazon Polly ($4.80-30/1M chars, cheapest standard voices), LMNT (150ms latency, no rate limits), Resemble AI (emotion control focus, $0.006/sec), WellSaid Labs ($0.0025/min enterprise), Murf AI ($10/1M chars Falcon model), Speechify ($10/1M chars PAYG), Inworld (#1 ELO 1,236, $25-50/1M chars), Rime AI (sub-200ms, on-premise option), Smallest AI ($0.01/min, cheapest real-time), Neets AI ($1/1M chars, cheapest cloud TTS), Unreal Speech ($49/mo for 1M chars), MiniMax (ELO 1,156, $60-100/1M chars).
Cloud STT APIs (4 more): Soniox ($0.10/hr, cheapest commercial STT), Amazon Transcribe ($1.44/hr, deep AWS integration), Rev AI ($0.003/min API tier, hybrid AI+human option), Speechmatics ($0.21-0.45/hr, best diarization).
Voice Agent Platforms (5): Vapi ($0.07-0.33/min total, provider-agnostic orchestration), Bland AI ($0.09/min connected calls, phone specialist), Retell AI ($0.07-0.08/min, no platform fees), LiveKit ($0.01/min, open-source WebRTC), Vocode (free open-source core, modular).
Open Source (10): Whisper/whisper.cpp/faster-whisper (2.5% WER, runs on CPU), Kokoro 82M ($0.70/1M hosted, MOS 4.5, Apache 2.0), Sesame CSM (MOS 4.7, conversational focus), Orpheus 3B (MOS 4.6, emotion tags), Fish Speech S2 (MOS 4.4, 15K+ style tags), Qwen3-TTS (97ms latency, 10 languages), Dia 1.6B (multi-speaker dialogue), Bark (music + speech, MIT), Piper (30ms edge, Raspberry Pi), XTTS v2 (voice cloning from 6s, community-maintained).
5. TTS Pricing Comparison
The 120x price gap between Neets AI ($1/1M chars) and ElevenLabs v3 ($120/1M chars) is the most extreme in any AI API category. The question is whether the quality difference justifies it. On MOS benchmarks, the gap is 4.8 vs approximately 4.0, a meaningful but not transformative difference for most agent use cases.
6. STT Pricing Comparison
The hyperscalers (Google, Azure, Amazon) are consistently the most expensive for STT, often 3-10x more than specialized providers. This is because their pricing was set for the pre-agent era when transcription was a low-volume enterprise feature. The agent-native providers (Soniox, AssemblyAI, Deepgram) have priced for volume from day one.
7. Benchmark Data: Quality, Latency, and Accuracy
The independent benchmark landscape for voice AI has matured significantly in 2026. Three sources provide the most reliable quality data.
CodeSOTA maintains a continuously updated leaderboard of TTS MOS scores and STT WER measurements across standardized test sets. Their key finding: the top 6 TTS providers (ElevenLabs, Sesame CSM, OpenAI HD, Gemini, Cartesia, ElevenLabs Flash) all score within a 0.2 MOS range (4.6-4.8). At this level, quality differences are subtle and often preference-dependent rather than objectively measurable - CodeSOTA.
Artificial Analysis Speech Arena uses ELO ratings from blind human comparisons. Inworld TTS-1.5-Max leads at ELO 1,236, followed by ElevenLabs v3 (1,179), MiniMax HD (1,156), and OpenAI TTS-1 (1,106). The ELO gaps are more meaningful than MOS gaps because they reflect head-to-head human preference rather than absolute quality scores - Inworld Benchmarks.
For STT, the critical insight from AssemblyAI's research is that "WER is fundamentally broken" as a metric - AssemblyAI. A 2% WER on clean LibriSpeech audio translates to 8-15% WER on real-world conversational audio with background noise, accents, and crosstalk. When evaluating STT providers, test on YOUR data, not on benchmarks.
8. Open Source: The Quality Gap Has Closed
The most significant development in voice AI in 2025-2026 is the collapse of the quality gap between open-source and commercial TTS models. In 2023, the best open-source TTS scored approximately 1.0 MOS below the best commercial API. By April 2026, that gap has narrowed to 0.1 MOS: Sesame CSM (open source) at 4.7 vs ElevenLabs at 4.8 - CodeSOTA.
Kokoro 82M deserves particular attention. With only 82 million parameters (small enough to run on any hardware), it achieves MOS 4.5, ranks #1 on HuggingFace TTS Spaces Arena, and costs roughly $0.70/1M characters when hosted, 85x cheaper than ElevenLabs. The Apache 2.0 license means full commercial use with no restrictions - HuggingFace.
For STT, Whisper Large v3 Turbo (809M params) achieves 2.5% WER with 8x speed improvement over the original, and faster-whisper (CTranslate2 port) adds another 4x speed boost. Self-hosted Whisper on a modern GPU processes audio at 50-100x real-time, meaning an hour of audio transcribes in under a minute.
The practical implication: if you have GPU infrastructure and the engineering capacity to self-host, open-source models now deliver 90%+ of commercial quality at 1-5% of the cost. The commercial APIs' value proposition has shifted from quality superiority to operational convenience (no infrastructure to manage, SLAs, compliance certifications).
9. Voice Agent Platforms: The Full-Stack Option
For teams that want voice agents without assembling the STT + LLM + TTS stack themselves, five platforms offer turnkey solutions. The economics are different from buying components individually.
Vapi ($0.07-0.33/minute total) is the most flexible: bring your own STT, LLM, and TTS providers, and Vapi orchestrates the pipeline. Retell AI ($0.07-0.08/minute base, ~$0.13-0.31/minute with LLM and telephony) is the simplest, with no platform fees and all features included. Bland AI ($0.09/minute connected calls) specializes in phone call automation. LiveKit ($0.01/minute agent sessions) provides open-source WebRTC infrastructure for self-hosted voice agents. Vocode is fully open-source with the most modular architecture.
The build-vs-buy calculus: assembling Deepgram STT + Claude + Cartesia TTS yourself costs roughly $0.05-0.10/minute in API fees but requires significant engineering effort. A platform like Retell adds $0.07-0.08/minute in orchestration fees but eliminates months of infrastructure work. For most teams, the platform premium pays for itself in time-to-market.
As we covered in our guide to building a Claude chatbot, the voice layer adds significant complexity to the conversation engine. For teams building unified APIs that abstract this complexity, platforms like Suprsonic provide TTS and STT as part of a broader capability set, alongside search, scraping, enrichment, and 15+ other agent capabilities through a single API key.
10. How to Choose: Decision Framework
If quality is everything (premium brand, luxury, high-stakes): ElevenLabs. MOS 4.8, full agent platform, voice cloning. Accept the premium pricing.
If latency is critical (real-time phone calls, gaming, emergency): Cartesia. 40ms TTFA is 5-10x faster than most competitors. Sonic-3 at 90ms for production.
If cost matters most (high volume, background processing): Deepgram for STT ($0.46/hr), OpenAI TTS-1 ($15/1M chars) or Hume ($7.60/1M) for TTS. Or self-host Kokoro ($0.70/1M) and Whisper (free) if you have GPU infra.
If you need the full pipeline (STT + LLM + TTS, one vendor): ElevenLabs Conversational AI 2.0, Deepgram Voice Agent API, or a platform like Retell/Vapi.
If STT features matter (diarization, sentiment, medical): AssemblyAI. Richest feature set, best streaming accuracy, medical mode.
If language coverage is primary (global deployment): Azure (140+ TTS languages), Google (125+ STT languages), AssemblyAI (99 STT languages).
If you want open source (full control, privacy, edge): TTS: Kokoro (best quality/size), Sesame CSM (best MOS), Orpheus (best emotion). STT: faster-whisper (best speed), Whisper Large v3 (best accuracy).
Yuma Heymans (@yumahey), who builds agent infrastructure at O-mega and has deployed voice agents across customer support and sales workflows, notes that the most common mistake teams make is optimizing for TTS quality when their bottleneck is actually STT accuracy or LLM latency. In a voice agent pipeline, the slowest and least accurate component determines the user experience, regardless of how good the other components are.
This guide reflects TTS and STT pricing and capabilities as of April 2026. Voice AI is evolving rapidly. Verify current details on official pricing pages before purchasing.