Cartesia
Builds real-time voice AI models using state space architecture for ultra-low-latency text-to-speech and speech-to-text.
Updated April 2026
Overview
- Website
- cartesia.ai
- Segment
- Audio & Speech
Product overview
Cartesia develops Sonic TTS models (e.g., Sonic-3, Turbo) for ultra-low latency speech synthesis (40-90ms time-to-first-audio) and Ink STT models, using efficient state space model architecture distinct from transformers for real-time applications. These power conversational AI agents, customer support, content creation, and gaming, used by over 10,000 customers including Quora, Cresta, Rasa, and Forethought. Their SSMs enable on-device operation, better long-context handling, and lower compute costs compared to competitors like OpenAI TTS.
Revenue model
Subscription tiers (Free $0, Pro $4/mo, Startup $39/mo, Scale $239/mo, Enterprise custom) with included credits plus usage-based billing: 1 credit/character for TTS (Sonic), 1 credit/second for STT (Ink), $0.014-$0.06/min telephony; 20% savings on yearly plans.
Moat
Cartesia's key competitive moat is its proprietary state space model (SSM) architecture, invented by its founders at Stanford AI Lab and scaled to deliver the fastest, most realistic voice AI models like Sonic 2.0 with 90ms latency, unprecedented controllability for voice cloning and editing, and efficient on-device deployment—outperforming transformer-based rivals in speed, quality, and real-time multimodal capabilities. This first-mover technical lead, combined with a robust API infrastructure boasting 99.9% uptime and enterprise compliance, creates high switching costs for customers reliant on its ultra-low-latency, customizable TTS performance.
Headwinds
State space models may not prove superior to transformers at scale, and the company faces intense competition from well-funded foundation model labs.