Top 10 TTS and STT APIs for AI Agents (2026)

Yuma Heymans

22 April 2026

•

26 min read

The complete market map and weighted assessment of 48 text-to-speech and speech-to-text providers for AI agents: benchmark data, real pricing, latency measurements, and the 10 best ranked by what actually matters.

Voice AI funding surged 8x to $2.1 billion in 2025. ElevenLabs hit $330 million ARR and an $11 billion valuation - TechCrunch. Deepgram raised a $130 million Series C at $1.3 billion - Deepgram. Production voice agent deployments grew 340% year-over-year, and Gartner predicts conversational AI will reduce contact center labor costs by $80 billion in 2026 - Gartner. The voice AI agent market hit $22.5 billion and is accelerating at 34.8% CAGR toward $47.5 billion by 2034 - Market.us.

This is not a future trend. Voice agents are in production now, at scale. The question for builders is not whether to add voice to their agents. It is which TTS and STT provider delivers the best quality at the right price for their specific use case.

We mapped 48 providers across the TTS and STT landscape: 22 cloud TTS APIs, 12 cloud STT APIs, 5 voice agent platforms, and 10 open-source models. We scored them on 7 weighted criteria derived from first principles (what does an AI agent actually need from voice?), verified every price against official pricing pages, and cross-referenced quality claims against independent benchmarks from CodeSOTA, Artificial Analysis Speech Arena, and Inworld's 2026 evaluation. As we explored in our guide to the top 10 agent capabilities, TTS and STT are among the most important external capabilities for production agents.

The Master Ranking: Top 10 TTS and STT APIs, Weighted and Ranked
Assessment Criteria and Weight Rationale
The Top 10: Detailed Profiles
The Full Provider Directory (48 Services)
TTS Pricing Comparison
STT Pricing Comparison
Benchmark Data: Quality, Latency, and Accuracy
Open Source: The Quality Gap Has Closed
Voice Agent Platforms: The Full-Stack Option
How to Choose: Decision Framework

1. The Master Ranking: Top 10 TTS and STT APIs, Weighted and Ranked

#	Provider	What It Does	Type	Price	Quality (25%)	Cost (20%)	Latency (20%)	Agent Ready (15%)	Languages (10%)	Cloning (5%)	Scale (5%)	Final /10
1	ElevenLabs	MOS 4.8 TTS + Scribe STT, voice cloning, full agent platform, $11B valuation	TTS+STT	$60-120/1M chars	10	4	7	10	9	10	9	7.9
2	Deepgram	Nova-3 STT (2.2% WER) + Aura TTS + Voice Agent API, $200 free credit	TTS+STT	$0.46/hr STT, $30/1M TTS	8	8	8	9	6	2	8	7.7
3	Cartesia	40ms TTFA (industry fastest), Sonic-3 TTS + Ink-Whisper STT, voice agent Line	TTS+STT	~$47/1M chars	9	5	10	8	7	9	7	7.7
4	AssemblyAI	Universal-2 STT (2.4% WER), 99 languages, richest feature set (sentiment, NER, medical)	STT only	$0.15-0.45/hr	9	8	7	8	9	0	8	7.2
5	OpenAI	TTS-1 (MOS 4.7) + Whisper STT, 13 voices, cheapest STT at $0.18/hr mini	TTS+STT	$15/1M TTS, $0.36/hr STT	9	7	6	7	9	0	9	7.0
6	Google Cloud	TTS + STT + Gemini TTS, 125+ STT languages, most generous free tier (4M chars/mo)	TTS+STT	$16/1M TTS, $0.96/hr STT	8	6	5	7	10	3	10	6.8
7	Gladia	Solaria-1 STT (2.5% WER), 103ms latency, 100+ languages, all features bundled	STT only	$0.20-0.61/hr	8	7	8	6	9	0	7	6.6
8	Hume AI	First emotion-aware TTS (Octave 2), $7.60/1M chars, speech-to-speech (EVI)	TTS	$7.60/1M chars	8	8	7	7	3	5	6	6.6
9	Azure Speech	TTS + STT, 140+ TTS languages (widest), custom neural voice training, enterprise	TTS+STT	$16/1M TTS, $1/hr STT	7	6	5	7	10	8	10	6.6
10	PlayHT	PlayHT 3.0 TTS (MOS 4.6), instant voice cloning, API-first, unlimited on $99/mo	TTS only	$39-99/mo flat	8	7	6	6	5	9	6	6.5

How to read this table: Each cell is a raw score (0-10). The Final Score is the weighted average: (Quality x 0.25) + (Cost x 0.20) + (Latency x 0.20) + (Agent Ready x 0.15) + (Languages x 0.10) + (Cloning x 0.05) + (Scale x 0.05). The "What It Does" column captures each provider's identity and standout metric. Ordered by final score, best first.

Criteria definitions and weight rationale:

Quality (25%): MOS scores for TTS, WER for STT, based on CodeSOTA and Artificial Analysis benchmarks. Highest weight because bad voice quality kills user experience instantly.
Cost (20%): Normalized to $/1M chars for TTS, $/hr for STT. 10 = cheapest tier, 1 = most expensive. High weight because voice agents run at volume (thousands of minutes/month).
Latency (20%): Time to first audio (TTS) or streaming factor (STT). 10 = sub-100ms, 7 = 100-300ms, 5 = 300-500ms, 3 = 500ms-1s, 1 = 1s+. Equal to cost because conversational agents need sub-300ms response.
Agent Readiness (15%): Voice agent platform, WebSocket streaming, SDKs, telephony integration. 10 = full voice agent stack, 5 = REST API + streaming, 1 = REST only.
Languages (10%): Number of supported languages. 10 = 100+, 7 = 40-99, 5 = 15-39, 3 = 5-14, 1 = English only.
Voice Cloning (5%): Instant cloning, custom voice training. 10 = instant + professional, 5 = basic cloning, 0 = not available. Lower weight because not every agent needs custom voices.
Scale/Reliability (5%): Enterprise SLA, concurrency limits, uptime guarantees. 10 = SOC2/HIPAA + 99.99%, 5 = standard cloud, 1 = no SLA.

The ranking shows a clear top tier: ElevenLabs (7.9) leads on quality and agent platform completeness. Deepgram (7.7) and Cartesia (7.7) tie for second through different strengths: Deepgram on price-performance and full-stack voice agents, Cartesia on raw latency (40ms, lowest in the industry). AssemblyAI (7.2) is the STT specialist with the richest feature set.

For a broader view of how voice fits into the agent capability stack, see our guide to the top 10 agent capabilities. For cost analysis across all agent infrastructure, see our AI agent cost analysis.

2. Assessment Criteria and Weight Rationale

The weights above are derived from first principles: what does an AI agent fundamentally need from voice capabilities?

An AI agent that speaks to humans needs three things above all else. First, the voice must sound natural enough that users do not disengage. A robotic voice triggers an immediate trust deficit that no amount of intelligence can overcome. This is why quality gets 25%: it is the gatekeeper for everything else. The MOS (Mean Opinion Score) benchmarks from CodeSOTA show the top providers now score 4.5-4.8 out of 5.0, within the range that is "generally indistinguishable from human speech in blind tests" - CodeSOTA. Below 4.0, users notice. Below 3.5, they leave.

Second, the agent must respond quickly. In a phone call, silence longer than 500 milliseconds feels like the agent is broken. In a customer support interaction, every additional second of latency increases abandonment. This is why latency gets 20%, equal to cost. The range across providers is enormous: from 40ms (Cartesia Sonic Turbo) to over 800ms (hyperscaler defaults). For conversational agents, anything above 300ms creates perceivable delay.

Third, the agent must be affordable at scale. A voice agent handling 10,000 minutes per month (typical for a mid-sized customer support deployment) will cost between $100 and $1,500/month in TTS alone, depending on the provider. STT adds another $15 to $1,000/month for the same volume. This is why cost gets 20%: at production volume, the difference between providers is material.

The remaining criteria (agent readiness at 15%, languages at 10%, cloning and scale at 5% each) matter for specific use cases but are not universally load-bearing. A US-only English agent does not need 100+ languages. A content generation agent does not need voice cloning. The weights reflect the general case.

3. The Top 10: Detailed Profiles

3.1 ElevenLabs (Score: 7.9/10)

ElevenLabs is the market leader in TTS quality and the only provider that combines best-in-class TTS with competitive STT in a single platform. The company reached $330 million in ARR, raised a $500 million Series D at an $11 billion valuation in February 2026, and serves over 1 million developers - ElevenLabs.

The TTS quality is objectively the best available. ElevenLabs Turbo v2.5 scores MOS 4.8 on CodeSOTA benchmarks, the highest of any provider - CodeSOTA. The v3 model ranks ELO 1,179 on the Artificial Analysis Speech Arena. Flash models (lower quality but faster) score MOS 4.6 at roughly half the latency.

Pricing spans two tiers: Flash/Turbo at $60/1M characters and Multilingual v2/v3 at $120/1M characters. Subscription plans range from free (10,000 credits) to $990/month (Business). The STT product, Scribe, costs $0.22/hour for batch and $0.39/hour for real-time streaming - ElevenLabs Pricing.

The Conversational AI 2.0 platform is what pushes ElevenLabs to the top for agent builders. It provides a complete voice agent stack: STT, LLM orchestration, TTS, with WebSocket streaming, Twilio/Genesys/Vonage telephony integrations, and SOC 2/HIPAA/GDPR compliance. If you are building a voice agent and want one vendor for the full pipeline, ElevenLabs is the most complete option - ElevenLabs.

The main weakness is cost. At $60-120/1M characters, ElevenLabs is 4-8x more expensive than alternatives like OpenAI TTS ($15/1M) or Hume ($7.60/1M). For high-volume, cost-sensitive deployments, the quality premium may not justify the price difference, especially as competitors narrow the quality gap.

3.2 Deepgram (Score: 7.7/10)

Deepgram is the most complete price-performance play in voice AI. Its Nova-3 STT model achieves 2.2% WER on LibriSpeech clean audio (lowest of any commercial API) with a 54.3% WER reduction over competitors in streaming scenarios - Deepgram. The company raised a $130 million Series C at a $1.3 billion valuation in January 2026 - TechCrunch.

STT pricing is aggressive: Nova-3 at $0.46/hour pay-as-you-go, dropping to $0.39/hour on Growth plans. The Aura-2 TTS costs $30/1M characters with 90-200ms time-to-first-byte - Deepgram Pricing. The $200 free credit (enough for 46,000+ minutes of transcription) is the most generous trial in the STT market.

The Voice Agent API is Deepgram's strongest differentiator for agent builders. At $0.05-0.075/minute, it provides a bundled STT + TTS pipeline optimized for real-time conversation. Combined with Python, JavaScript, Go, and .NET SDKs plus Twilio integration, it is the fastest path to a production voice agent for developers who want to own the stack without using a full platform like Vapi or Retell.

The weakness is TTS quality. Aura-2 is competent but does not match ElevenLabs, Cartesia, or OpenAI on naturalness. For agents where voice quality is the primary differentiator (luxury brand, premium support), Deepgram's TTS may not be sufficient.

3.3 Cartesia (Score: 7.7/10)

Cartesia owns the latency crown. Sonic Turbo achieves 40ms time-to-first-audio, which is the fastest commercial TTS available, and Sonic-3 runs at approximately 90ms in production streaming - Inworld Benchmarks. The quality is strong at MOS 4.7 (Sonic 2), placing it in the top tier alongside ElevenLabs and OpenAI.

Pricing uses a credit system: Sonic-3 TTS costs 15 credits per second of audio. Plans range from free (20,000 credits) to Scale ($239/month for 8 million credits). The effective cost works out to approximately $47/1M characters - Cartesia Pricing. Ink-Whisper STT costs 1 credit per second, roughly $0.13/hour at Scale tier.

The Line voice agent platform positions Cartesia as a full-stack voice AI company. Instant voice cloning from 3 seconds of audio, 40+ language support, and WebSocket streaming make it purpose-built for the real-time conversational use case. If your agent's primary requirement is the fastest possible voice response (phone banking, emergency services, gaming), Cartesia is the clear choice.

The limitation is ecosystem maturity. Cartesia is newer than ElevenLabs or Deepgram, with fewer framework integrations and a smaller developer community. The credit-based pricing can also be harder to predict than simple per-character or per-hour models.

3.4 AssemblyAI (Score: 7.2/10)

AssemblyAI is the STT specialist with the richest feature set in the market. Universal-2 achieves 2.4% WER on clean audio across 99 languages. The Universal-3 Pro model is the best commercial streaming STT, with the lowest WER in real-time scenarios - AssemblyAI Benchmarks.

What sets AssemblyAI apart is the audio intelligence layer built on top of transcription. Diarization, sentiment analysis, entity detection, summarization, topic detection, PII redaction, content moderation, and a medical mode are all available as add-ons. For agents that need to understand conversations (not just transcribe them), this feature depth is unmatched.

Pricing is competitive: Universal-2 at $0.15/hour (batch), U3 Pro at $0.21/hour (batch) and $0.45/hour (streaming). Add-ons range from $0.02/hour (diarization) to $0.15/hour (Medical Mode). The $50 free credit covers initial evaluation - AssemblyAI Pricing.

For agents that need pure STT with maximum intelligence, from meeting transcription bots to compliance monitoring systems, AssemblyAI's combination of accuracy, features, and price is the strongest in the market.

3.5 OpenAI (Score: 7.0/10)

OpenAI provides both TTS and STT with the simplest integration path for teams already using the OpenAI API. TTS-1 scores MOS 4.7 and TTS-HD reaches even higher quality. Whisper STT at $0.36/hour ($0.006/minute) and the current OpenAI Mini Transcribe model at $0.18/hour make OpenAI the cheapest major STT provider for batch processing - OpenAI Pricing.

TTS pricing is competitive at $15/1M characters for TTS-1 (standard) and $30/1M for TTS-HD. Thirteen built-in voices with no voice cloning. The current OpenAI mini TTS model combines TTS with LLM intelligence, letting the model control speech style and emphasis dynamically.

The Realtime API enables full voice agent workflows with integrated STT + LLM + TTS in a single WebSocket connection. For teams building on the OpenAI API, this is the zero-friction path to voice agents. The main limitation is the lack of voice cloning and the relatively small voice library (13 voices vs hundreds from ElevenLabs or Google).

3.6-3.10: Google Cloud, Gladia, Hume AI, Azure Speech, PlayHT

The remaining five in the top 10 each serve a specific niche. Google Cloud (6.8) offers the most generous free tier (4M characters/month TTS) and broadest language coverage (125+ STT languages). Gladia (6.6) bundles all features (diarization, NER, sentiment) into the base price with no add-on charges, at 103ms streaming latency. Hume AI (6.6) is the first emotion-aware TTS at a remarkably low $7.60/1M characters. Azure Speech (6.6) has the widest TTS language support (140+) with enterprise compliance. PlayHT (6.5) offers unlimited TTS generation on the $99/month Premium plan with instant voice cloning.

4. The Full Provider Directory (48 Services)

Beyond the top 10, we mapped 38 additional providers across four categories. For brevity, we list them with their key differentiator and pricing anchor.

Cloud TTS APIs (12 more): Amazon Polly ($4.80-30/1M chars, cheapest standard voices), LMNT (150ms latency, no rate limits), Resemble AI (emotion control focus, $0.006/sec), WellSaid Labs ($0.0025/min enterprise), Murf AI ($10/1M chars Falcon model), Speechify ($10/1M chars PAYG), Inworld (#1 ELO 1,236, $25-50/1M chars), Rime AI (sub-200ms, on-premise option), Smallest AI ($0.01/min, cheapest real-time), Neets AI ($1/1M chars, cheapest cloud TTS), Unreal Speech ($49/mo for 1M chars), MiniMax (ELO 1,156, $60-100/1M chars).

Cloud STT APIs (4 more): Soniox ($0.10/hr, cheapest commercial STT), Amazon Transcribe ($1.44/hr, deep AWS integration), Rev AI ($0.003/min API tier, hybrid AI+human option), Speechmatics ($0.21-0.45/hr, best diarization).

Voice Agent Platforms (5): Vapi ($0.07-0.33/min total, provider-agnostic orchestration), Bland AI ($0.09/min connected calls, phone specialist), Retell AI ($0.07-0.08/min, no platform fees), LiveKit ($0.01/min, open-source WebRTC), Vocode (free open-source core, modular).

Open Source (10): Whisper/whisper.cpp/faster-whisper (2.5% WER, runs on CPU), Kokoro 82M ($0.70/1M hosted, MOS 4.5, Apache 2.0), Sesame CSM (MOS 4.7, conversational focus), Orpheus 3B (MOS 4.6, emotion tags), Fish Speech S2 (MOS 4.4, 15K+ style tags), Qwen3-TTS (97ms latency, 10 languages), Dia 1.6B (multi-speaker dialogue), Bark (music + speech, MIT), Piper (30ms edge, Raspberry Pi), XTTS v2 (voice cloning from 6s, community-maintained).

5. TTS Pricing Comparison

The 120x price gap between Neets AI ($1/1M chars) and ElevenLabs v3 ($120/1M chars) is the most extreme in any AI API category. The question is whether the quality difference justifies it. On MOS benchmarks, the gap is 4.8 vs approximately 4.0, a meaningful but not transformative difference for most agent use cases.

6. STT Pricing Comparison

The hyperscalers (Google, Azure, Amazon) are consistently the most expensive for STT, often 3-10x more than specialized providers. This is because their pricing was set for the pre-agent era when transcription was a low-volume enterprise feature. The agent-native providers (Soniox, AssemblyAI, Deepgram) have priced for volume from day one.

7. Benchmark Data: Quality, Latency, and Accuracy

The independent benchmark landscape for voice AI has matured significantly in 2026. Three sources provide the most reliable quality data.

CodeSOTA maintains a continuously updated leaderboard of TTS MOS scores and STT WER measurements across standardized test sets. Their key finding: the top 6 TTS providers (ElevenLabs, Sesame CSM, OpenAI HD, Gemini, Cartesia, ElevenLabs Flash) all score within a 0.2 MOS range (4.6-4.8). At this level, quality differences are subtle and often preference-dependent rather than objectively measurable - CodeSOTA.

Artificial Analysis Speech Arena uses ELO ratings from blind human comparisons. Inworld TTS-1.5-Max leads at ELO 1,236, followed by ElevenLabs v3 (1,179), MiniMax HD (1,156), and OpenAI TTS-1 (1,106). The ELO gaps are more meaningful than MOS gaps because they reflect head-to-head human preference rather than absolute quality scores - Inworld Benchmarks.

For STT, the critical insight from AssemblyAI's research is that "WER is fundamentally broken" as a metric - AssemblyAI. A 2% WER on clean LibriSpeech audio translates to 8-15% WER on real-world conversational audio with background noise, accents, and crosstalk. When evaluating STT providers, test on YOUR data, not on benchmarks.

8. Open Source: The Quality Gap Has Closed

The most significant development in voice AI in 2025-2026 is the collapse of the quality gap between open-source and commercial TTS models. In 2023, the best open-source TTS scored approximately 1.0 MOS below the best commercial API. By April 2026, that gap has narrowed to 0.1 MOS: Sesame CSM (open source) at 4.7 vs ElevenLabs at 4.8 - CodeSOTA.

Kokoro 82M deserves particular attention. With only 82 million parameters (small enough to run on any hardware), it achieves MOS 4.5, ranks #1 on HuggingFace TTS Spaces Arena, and costs roughly $0.70/1M characters when hosted, 85x cheaper than ElevenLabs. The Apache 2.0 license means full commercial use with no restrictions - HuggingFace.

For STT, Whisper Large v3 Turbo (809M params) achieves 2.5% WER with 8x speed improvement over the original, and faster-whisper (CTranslate2 port) adds another 4x speed boost. Self-hosted Whisper on a modern GPU processes audio at 50-100x real-time, meaning an hour of audio transcribes in under a minute.

The practical implication: if you have GPU infrastructure and the engineering capacity to self-host, open-source models now deliver 90%+ of commercial quality at 1-5% of the cost. The commercial APIs' value proposition has shifted from quality superiority to operational convenience (no infrastructure to manage, SLAs, compliance certifications).

9. Voice Agent Platforms: The Full-Stack Option

For teams that want voice agents without assembling the STT + LLM + TTS stack themselves, five platforms offer turnkey solutions. The economics are different from buying components individually.

Vapi ($0.07-0.33/minute total) is the most flexible: bring your own STT, LLM, and TTS providers, and Vapi orchestrates the pipeline. Retell AI ($0.07-0.08/minute base, ~$0.13-0.31/minute with LLM and telephony) is the simplest, with no platform fees and all features included. Bland AI ($0.09/minute connected calls) specializes in phone call automation. LiveKit ($0.01/minute agent sessions) provides open-source WebRTC infrastructure for self-hosted voice agents. Vocode is fully open-source with the most modular architecture.

The build-vs-buy calculus: assembling Deepgram STT + Claude + Cartesia TTS yourself costs roughly $0.05-0.10/minute in API fees but requires significant engineering effort. A platform like Retell adds $0.07-0.08/minute in orchestration fees but eliminates months of infrastructure work. For most teams, the platform premium pays for itself in time-to-market.

As we covered in our guide to building a Claude chatbot, the voice layer adds significant complexity to the conversation engine. For teams building unified APIs that abstract this complexity, platforms like Suprsonic provide TTS and STT as part of a broader capability set, alongside search, scraping, enrichment, and 15+ other agent capabilities through a single API key.

10. How to Choose: Decision Framework

If quality is everything (premium brand, luxury, high-stakes): ElevenLabs. MOS 4.8, full agent platform, voice cloning. Accept the premium pricing.

If latency is critical (real-time phone calls, gaming, emergency): Cartesia. 40ms TTFA is 5-10x faster than most competitors. Sonic-3 at 90ms for production.

If cost matters most (high volume, background processing): Deepgram for STT ($0.46/hr), OpenAI TTS-1 ($15/1M chars) or Hume ($7.60/1M) for TTS. Or self-host Kokoro ($0.70/1M) and Whisper (free) if you have GPU infra.

If you need the full pipeline (STT + LLM + TTS, one vendor): ElevenLabs Conversational AI 2.0, Deepgram Voice Agent API, or a platform like Retell/Vapi.

If STT features matter (diarization, sentiment, medical): AssemblyAI. Richest feature set, best streaming accuracy, medical mode.

If language coverage is primary (global deployment): Azure (140+ TTS languages), Google (125+ STT languages), AssemblyAI (99 STT languages).

If you want open source (full control, privacy, edge): TTS: Kokoro (best quality/size), Sesame CSM (best MOS), Orpheus (best emotion). STT: faster-whisper (best speed), Whisper Large v3 (best accuracy).

Yuma Heymans (@yumahey), who builds agent infrastructure at O-mega and has deployed voice agents across customer support and sales workflows, notes that the most common mistake teams make is optimizing for TTS quality when their bottleneck is actually STT accuracy or LLM latency. In a voice agent pipeline, the slowest and least accurate component determines the user experience, regardless of how good the other components are.

This guide reflects TTS and STT pricing and capabilities as of April 2026. Voice AI is evolving rapidly. Verify current details on official pricing pages before purchasing.

Yuma Heymans

22 April 2026

•

26 min read

The Master Ranking: Top 10 TTS and STT APIs, Weighted and Ranked
Assessment Criteria and Weight Rationale
The Top 10: Detailed Profiles
The Full Provider Directory (48 Services)
TTS Pricing Comparison
STT Pricing Comparison
Benchmark Data: Quality, Latency, and Accuracy
Open Source: The Quality Gap Has Closed
Voice Agent Platforms: The Full-Stack Option
How to Choose: Decision Framework

1. The Master Ranking: Top 10 TTS and STT APIs, Weighted and Ranked

#	Provider	What It Does	Type	Price	Quality (25%)	Cost (20%)	Latency (20%)	Agent Ready (15%)	Languages (10%)	Cloning (5%)	Scale (5%)	Final /10
1	ElevenLabs	MOS 4.8 TTS + Scribe STT, voice cloning, full agent platform, $11B valuation	TTS+STT	$60-120/1M chars	10	4	7	10	9	10	9	7.9
2	Deepgram	Nova-3 STT (2.2% WER) + Aura TTS + Voice Agent API, $200 free credit	TTS+STT	$0.46/hr STT, $30/1M TTS	8	8	8	9	6	2	8	7.7
3	Cartesia	40ms TTFA (industry fastest), Sonic-3 TTS + Ink-Whisper STT, voice agent Line	TTS+STT	~$47/1M chars	9	5	10	8	7	9	7	7.7
4	AssemblyAI	Universal-2 STT (2.4% WER), 99 languages, richest feature set (sentiment, NER, medical)	STT only	$0.15-0.45/hr	9	8	7	8	9	0	8	7.2
5	OpenAI	TTS-1 (MOS 4.7) + Whisper STT, 13 voices, cheapest STT at $0.18/hr mini	TTS+STT	$15/1M TTS, $0.36/hr STT	9	7	6	7	9	0	9	7.0
6	Google Cloud	TTS + STT + Gemini TTS, 125+ STT languages, most generous free tier (4M chars/mo)	TTS+STT	$16/1M TTS, $0.96/hr STT	8	6	5	7	10	3	10	6.8
7	Gladia	Solaria-1 STT (2.5% WER), 103ms latency, 100+ languages, all features bundled	STT only	$0.20-0.61/hr	8	7	8	6	9	0	7	6.6
8	Hume AI	First emotion-aware TTS (Octave 2), $7.60/1M chars, speech-to-speech (EVI)	TTS	$7.60/1M chars	8	8	7	7	3	5	6	6.6
9	Azure Speech	TTS + STT, 140+ TTS languages (widest), custom neural voice training, enterprise	TTS+STT	$16/1M TTS, $1/hr STT	7	6	5	7	10	8	10	6.6
10	PlayHT	PlayHT 3.0 TTS (MOS 4.6), instant voice cloning, API-first, unlimited on $99/mo	TTS only	$39-99/mo flat	8	7	6	6	5	9	6	6.5

Criteria definitions and weight rationale:

Quality (25%): MOS scores for TTS, WER for STT, based on CodeSOTA and Artificial Analysis benchmarks. Highest weight because bad voice quality kills user experience instantly.
Cost (20%): Normalized to $/1M chars for TTS, $/hr for STT. 10 = cheapest tier, 1 = most expensive. High weight because voice agents run at volume (thousands of minutes/month).
Latency (20%): Time to first audio (TTS) or streaming factor (STT). 10 = sub-100ms, 7 = 100-300ms, 5 = 300-500ms, 3 = 500ms-1s, 1 = 1s+. Equal to cost because conversational agents need sub-300ms response.
Agent Readiness (15%): Voice agent platform, WebSocket streaming, SDKs, telephony integration. 10 = full voice agent stack, 5 = REST API + streaming, 1 = REST only.
Languages (10%): Number of supported languages. 10 = 100+, 7 = 40-99, 5 = 15-39, 3 = 5-14, 1 = English only.
Voice Cloning (5%): Instant cloning, custom voice training. 10 = instant + professional, 5 = basic cloning, 0 = not available. Lower weight because not every agent needs custom voices.
Scale/Reliability (5%): Enterprise SLA, concurrency limits, uptime guarantees. 10 = SOC2/HIPAA + 99.99%, 5 = standard cloud, 1 = no SLA.

2. Assessment Criteria and Weight Rationale

The weights above are derived from first principles: what does an AI agent fundamentally need from voice capabilities?

3. The Top 10: Detailed Profiles

3.1 ElevenLabs (Score: 7.9/10)

3.2 Deepgram (Score: 7.7/10)

3.3 Cartesia (Score: 7.7/10)

3.4 AssemblyAI (Score: 7.2/10)

3.5 OpenAI (Score: 7.0/10)

3.6-3.10: Google Cloud, Gladia, Hume AI, Azure Speech, PlayHT

4. The Full Provider Directory (48 Services)

Beyond the top 10, we mapped 38 additional providers across four categories. For brevity, we list them with their key differentiator and pricing anchor.

5. TTS Pricing Comparison

6. STT Pricing Comparison

7. Benchmark Data: Quality, Latency, and Accuracy

The independent benchmark landscape for voice AI has matured significantly in 2026. Three sources provide the most reliable quality data.

8. Open Source: The Quality Gap Has Closed

9. Voice Agent Platforms: The Full-Stack Option

For teams that want voice agents without assembling the STT + LLM + TTS stack themselves, five platforms offer turnkey solutions. The economics are different from buying components individually.

10. How to Choose: Decision Framework

If quality is everything (premium brand, luxury, high-stakes): ElevenLabs. MOS 4.8, full agent platform, voice cloning. Accept the premium pricing.

If latency is critical (real-time phone calls, gaming, emergency): Cartesia. 40ms TTFA is 5-10x faster than most competitors. Sonic-3 at 90ms for production.

If you need the full pipeline (STT + LLM + TTS, one vendor): ElevenLabs Conversational AI 2.0, Deepgram Voice Agent API, or a platform like Retell/Vapi.

If STT features matter (diarization, sentiment, medical): AssemblyAI. Richest feature set, best streaming accuracy, medical mode.

If language coverage is primary (global deployment): Azure (140+ TTS languages), Google (125+ STT languages), AssemblyAI (99 STT languages).

This guide reflects TTS and STT pricing and capabilities as of April 2026. Voice AI is evolving rapidly. Verify current details on official pricing pages before purchasing.

Contents

1. The Master Ranking: Top 10 TTS and STT APIs, Weighted and Ranked

2. Assessment Criteria and Weight Rationale

3. The Top 10: Detailed Profiles

3.1 ElevenLabs (Score: 7.9/10)

3.2 Deepgram (Score: 7.7/10)

3.3 Cartesia (Score: 7.7/10)

3.4 AssemblyAI (Score: 7.2/10)

3.5 OpenAI (Score: 7.0/10)

3.6-3.10: Google Cloud, Gladia, Hume AI, Azure Speech, PlayHT

4. The Full Provider Directory (48 Services)

5. TTS Pricing Comparison

6. STT Pricing Comparison

7. Benchmark Data: Quality, Latency, and Accuracy

8. Open Source: The Quality Gap Has Closed

9. Voice Agent Platforms: The Full-Stack Option

10. How to Choose: Decision Framework

Top 10 TTS and STT APIs for AI Agents (2026)

Contents

1. The Master Ranking: Top 10 TTS and STT APIs, Weighted and Ranked

2. Assessment Criteria and Weight Rationale

3. The Top 10: Detailed Profiles

3.1 ElevenLabs (Score: 7.9/10)

3.2 Deepgram (Score: 7.7/10)

3.3 Cartesia (Score: 7.7/10)

3.4 AssemblyAI (Score: 7.2/10)

3.5 OpenAI (Score: 7.0/10)

3.6-3.10: Google Cloud, Gladia, Hume AI, Azure Speech, PlayHT

4. The Full Provider Directory (48 Services)

5. TTS Pricing Comparison

6. STT Pricing Comparison

7. Benchmark Data: Quality, Latency, and Accuracy

8. Open Source: The Quality Gap Has Closed

9. Voice Agent Platforms: The Full-Stack Option

10. How to Choose: Decision Framework