Spark - iFlytek's Multimodal AI

iFlytek's multimodal AI with voice, text, and image understanding—strong speech-to-text and audio processing.

locale: “en”

What is Spark?

Spark (讯飞星火) is iFlytek’s advanced large language model with deep roots in speech recognition. It combines text understanding, real-time voice processing, and image analysis into one platform—ideal for voice-first applications and conversational AI in Chinese.

Key Features

Multimodal mastery: Text + speech + images in one model
Real-time speech processing: Ultra-low latency audio understanding
Chinese speech recognition: Industry-leading accuracy for Mandarin dialects
Dialogue-focused: Optimized for natural conversations
Cloud + edge: Deploy on cloud or edge devices (smart speakers, cars)
API + SDK: Simple integration for developers

Versions & Plans

Spark Web Chat (Free)

Access: https://xinghuo.xfyun.cn (free with iFlytek account)
Tier: Generous free tier; upgrades available
Models: Spark-Pro, Spark-Pro-128K

Spark API (Paid)

Pricing: ¥0.005-0.02 per 1K tokens (input), ¥0.015-0.06 (output)
Speech API: ¥0.005 per minute of audio (approximate)
Models:
- Spark-3.0 (balanced)
- Spark-4.0 (advanced reasoning)
- Spark-Voice (audio-optimized)

Strengths

Voice-first: Best-in-class speech-to-text and audio understanding for Mandarin
Low latency: Optimized for real-time conversations and voice apps
Multimodal integration: Handle voice, text, and images seamlessly
Chinese dialects: Supports various regional accents and speech patterns
Edge deployment: Works on IoT devices, cars, smart speakers
Free tier: Generous limits for experimentation
Industry experience: iFlytek has 20+ years in speech AI

Limitations

English voice: Weaker than Chinese for English speech recognition
Text reasoning: Slightly trails Qwen/Claude on pure text analysis
Small global community: Limited English tutorials; docs mainly in Chinese
Niche positioning: Best for voice/audio; less ideal if you only need text
API rate limits: Lower throughput than Baidu/Alibaba on free tier
Signup barriers: May require Chinese ID or phone number for full features

Pricing (Typical)

Service	Cost
Text API (1K tokens)	¥0.005-0.02
Speech Recognition (per min)	¥0.005-0.01
Image Upload (per call)	¥0.002
Voice Synthesis (per char)	¥0.0001-0.0005

Prices as of Jan 2026.

Core Capabilities

Conversation

Natural dialogue with voice or text
Intent recognition
Multi-turn context awareness

Voice & Audio

Real-time speech-to-text (streaming)
Accent and dialect understanding
Noise filtering and enhancement

Content Understanding

Image captioning and Q&A
Document analysis
Chart interpretation

Voice Synthesis (Text-to-Speech)

Natural-sounding Mandarin voices
Multiple styles and speakers
Real-time streaming output

Common Workflows

Scenario 1: Smart speaker developer

Goal: Build voice assistant that understands Sichuan dialect
Tool: Spark API (voice mode) + edge deployment
Result: Low-latency, accurate voice interaction in regional accent

Scenario 2: Customer service center

Goal: Auto-transcribe and summarize customer calls
Tool: Spark speech API + text analysis
Result: Real-time transcripts, key point extraction, agent coaching

Scenario 3: Accessibility app for elderly

Goal: Voice-only interface for news and weather
Tool: Spark-Voice (multimodal understanding) + TTS
Result: Elderly users can ask questions verbally, get spoken answers

Scenario 4: Car infotainment system

Goal: Hands-free navigation and music control
Tool: Spark edge deployment (on-device processing)
Result: Zero cloud latency; works offline; privacy preserved

Comparison

Aspect	Spark	ChatGPT	Qwen
Speech recognition	⭐⭐⭐⭐	Limited	Limited
Voice synthesis	⭐⭐⭐⭐	Limited	Limited
Chinese text	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐
Reasoning	⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐
Edge deployment	✅	❌	⚠️
Cost	💰	💰💰💰	💰

Privacy & Security

API mode: Data sent to iFlytek servers (Beijing); subject to Chinese data laws
Edge mode: Processing on-device = full privacy preservation
Compliance: SOC 2 Type II certified; enterprise data protection available
Data retention: Configurable; can be deleted on request

Getting Started

Try It Free

Visit https://xinghuo.xfyun.cn
Sign up with mobile number or WeChat
Start chatting (text or voice)

Use Text API (Python)

import requests

url = "https://spark-api.xf-yun.com/v1/chat/completions"
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json"
}
payload = {
    "model": "spark-pro",
    "messages": [
        {"role": "user", "content": "你好，讯飞星火！"}
    ]
}
response = requests.post(url, json=payload, headers=headers)
print(response.json())

Use Voice API (Real-time Speech-to-Text)

import pyaudio
import websocket
import json

# WebSocket connection for streaming audio
ws = websocket.create_connection("wss://spark-api.xf-yun.com/v1/asr")
# Configure audio parameters
config = {"sample_rate": 16000, "channels": 1}
ws.send(json.dumps(config))

# Stream audio chunks
# Receive transcripts in real-time
# Close connection
ws.close()

Integrate Voice into Your App

Use official SDKs (Python, Java, iOS, Android)
Docs: https://www.xfyun.cn/document

Resources

Official: https://xinghuo.xfyun.cn
Developer Docs: https://www.xfyun.cn/document
GitHub: https://github.com/iFlytek
Community: iFlytek Developer Forum, WeChat groups

What’s New (Jan 2026)

Spark-4.0 released with enhanced reasoning
Improved Sichuan/Cantonese dialect support
Faster real-time speech processing
New voice synthesis voices and styles

Summary

Spark is the go-to choice if your application needs voice at its core. Whether you’re building smart home devices, customer service automation, or accessibility tools for Chinese speakers, Spark’s combination of speech excellence and multimodal AI is unmatched.

Best for: Smart device developers, voice app creators, customer service centers, accessibility platforms, anyone building for the Chinese voice market.

Try it: Start free on Xinghuo.xfyun.cn; upgrade to API as you scale.