SpaceXAI — the combined entity formed after xAI's integration into SpaceX in May 2026 — has launched its voice APIs on the Vercel AI Gateway. The move gives developers a direct path to production-ready voice technology, from real-time speech agents to text-to-speech and transcription, without needing a separate xAI API key. Here's everything worth knowing.

1. Three Distinct Voice Models Are Available
The integration ships three separate models for different use cases. xai/grok-voice-think-fast-1.0 handles real-time, bidirectional speech-to-speech interactions — it does not process transcription or translation directly. For text-to-speech, there's xai/grok-tts, and for speech-to-text transcription, xai/grok-stt. Developers can mix and match depending on whether they're building a live voice agent, a narration tool, or a transcription pipeline.
2. Real-Time Voice Runs Over WebSockets with Sub-Second Latency
The real-time voice model is designed specifically for low-latency, bidirectional voice agents communicating over WebSockets. According to xAI's documentation, the architecture targets sub-second latency — the kind of responsiveness required for natural conversational agents rather than turn-based voice interfaces. The integration runs on Vercel's AI SDK 7.
3. No Separate xAI API Keys Required
One of the more developer-friendly details: routing through the Vercel AI Gateway means you don't need to manage a separate xAI API key. Authentication is handled at the gateway level, reducing setup friction for teams already working within the Vercel ecosystem.
4. Pricing Is Granular — Here's the Full Breakdown
SpaceXAI has published detailed pricing for each capability tier. According to verified sources:
| Capability | Price |
|---|---|
| Real-time Voice Agents | $0.05/min ($3.00/hr) |
| Real-time Text Input (voice agents) | $0.004/message |
| Text-to-Speech | $15.00 / 1M characters |
| Speech-to-Text (REST) | $0.10/hr |
| Speech-to-Text (Streaming) | $0.20/hr |
| Tool Invocations (Web/X Search) | $5.00 / 1,000 calls |
The real-time voice rate is competitive for production workloads, though tool invocations (like Web Search or X Search) are billed separately on top.
5. 80+ Voices, 28 Languages, and Voice Cloning in Under Two Minutes
The TTS API offers 5 expressive voices alongside more than 80 natural voices spanning 28 languages. The STT model supports transcription in 25 languages. Perhaps the most striking capability: voice cloning. According to xAI's documentation, developers can create a custom voice in under two minutes — a feature that significantly lowers the barrier for branded or personalized audio applications.
6. Audio Data Is Never Stored or Used for Training
xAI has confirmed that all audio processed through the voice APIs is handled in real-time and is never stored or used to train models. For developers building applications in regulated industries or handling sensitive conversations, this is a meaningful compliance detail worth noting upfront.
7. The SpaceXAI Branding Reflects a Structural Shift
The APIs are branded under "SpaceXAI" — not xAI alone. That's because xAI completed its integration into SpaceX in May 2026, merging the two entities under a unified structure. The voice API launch is one of the first major developer-facing products to carry that combined branding, signaling that the SpaceX-xAI merger is now producing tangible product output rather than just organizational changes.
For developers already on Vercel, the path to integrating production-grade voice is now considerably shorter. The bigger question is how quickly third-party applications — including potentially voice-enabled interfaces for Tesla's own ecosystem — will adopt these capabilities now that they're available through a mainstream deployment platform.

Sarah focuses on Tesla Energy, SpaceX missions, and the broader Musk AI portfolio. Former data analyst in clean energy. Based in San Francisco.
Sources verified at publish time. Spotted an inaccuracy? Email editorial@basenor.com.







