Gradium at this time launched two real-time speech translation fashions. stt-translate and s2s-translate. Each run in 5 languages and the outcomes are streamed reside to your browser.
Gradium claims a greater accuracy/latency tradeoff. gpt-realtime-translate and gemini-3.5-live-translate. It additionally provides output audio controls, together with cloning. gpt-realtime-translate It is lacking.
TL;DR
- Gradium introduced two real-time speech translation fashions.
stt-translate(voice → textual content) ands2s-translate(Speech → Speech). - It covers 5 languages (EN, FR, DE, ES, PT) and 20 pairs, folding the standard three-model cascade into two.
- precision lead
gemini-3.5-live-translateAbout BLEU, MetricX, and beatsgpt-realtime-translateIn BLEU (equal in MetricX). - Latency averages 3.0 seconds, higher than earlier than
gpt-realtime-translate(3.6 seconds), proper behindgemini-3.5-live-translate(2.9 seconds). - Not like
gpt-realtime-translateselect the output audio or clone your personal over one twin WebSocket.
stt translation
stt-translate Obtain audio in a single language and return textual content in one other language. Helps English (EN), French (FR), German (DE), Spanish (ES), Portuguese (PT).
Any supply maps to any goal throughout its set. That is a complete of 20 language pairs in all instructions.
An vital design alternative is to mix two steps into one. Transcription and translation happen in a single cross throughout the speech mannequin. There are not any intermediate transcripts to attend for, and no handoffs between methods.
In accordance with Gradium, this method Hibiki-Zero Framework. This mannequin concurrently optimizes for low latency and excessive accuracy by way of reinforcement studying. This implies fewer shifting elements within the pipeline.
s2s-translation
s2s-translate Convert audio in a single language to audio in one other end-to-end. it’s constructed on stt-translate And mix it with the Gradium TTS mannequin in a single service.
Stream audio over WebSocket. You’ll obtain each the synthesized output audio and the generated translated transcript.
This eliminates the necessity for integration work. You do not have to wire STT and TTS your self or handle the connections between the 2. The server executes the pipeline and streams the outcomes again.
Enter audio is 24 kHz PCM, 16-bit signed mono. Output audio is 48 kHz PCM, 16-bit signed mono. WAV, Opus, mu-law, and A-law are additionally supported.
How Gradium measures high quality: BLEU and MetricX
Translation high quality is just not a single quantity, so Gradium studies two complementary metrics.
blue (Bilingual Analysis Understudy) is a long-standing machine translation customary (Papineni et al.). Measure the N-gram overlap between mannequin output and human reference translation. Values vary from 0 to 100, with larger values being higher.
BLEU is quick, reproducible, and comparable between methods. Its limitation is that it rewards superficial phrase matches. Right translations utilizing totally different expressions could also be penalized.
Metric X A realized neural high quality metric developed by Google (Juraska et al.). Predict how people will consider translations. That is an error rating, so decrease is best, and it tracks human judgment extra intently than BLEU.
The 2 seize failure in a different way. BLEU checks vocabulary constancy. MetricX checks semantic adequacy.
benchmark
Gradium benchmarks a singular dataset of conversational audio. The info displays on a regular basis matters corresponding to work, journey, and climate, fairly than scripted textual content.
in opposition to gemini-3.5-live-translateGradium leads in each BLEU and MetricX. in opposition to gpt-realtime-translateGradium leads in BLEU and is on par with MetricX.
| capability | gladium | gpt-realtime-translate |
gemini-3.5-live-translate |
|---|---|---|---|
| Common latency (all pairs) | 3.0 seconds | 3.6 seconds | 2.9 seconds |
| BLEU (larger is best) | lead each | decrease than gladium | decrease than gladium |
| MetricX (the smaller the error, the higher) | Akin to GPT. Lead Gemini | corresponding to gladium | Bigger error than Gradium |
| Choose output audio | Sure (catalog) | no | Not listed |
| clone your voice | sure | no | Not listed |
| language | 20 teams in 5 languages | Not listed | Not listed |
Accuracy (BLEU and MetricX) is measured as follows: stt-translatetranslation. Latency is ideal s2s-translate pipeline. Learn this as a tradeoff, not a whole wipeout. Gemini is barely quicker. Gradium is extra exact and provides voice management.
Why two fashions are higher than three?
A normal speech synthesis stack makes use of three fashions: Speech-To-Textual content, then Textual content-To-Textual content conversion, then Textual content-To-Speech. Every stage is a separate inference name. Every provides processing time and handoff.
Gradium makes use of two. stt-translate Carry out transcription and translation in a single cross. The devoted Textual content-To-Textual content stage disappears utterly.
This removes one full mannequin together with its latencies and handoffs from the vital path. The top-to-end path is shorter than a three-model cascade of comparable high quality.
The numbers assist the design. s2s-translate The typical time for all language pairs is 3.0 seconds. it wins gpt-realtime-translate Sit close by in 3.6 seconds gemini-3.5-live-translate In 2.9 seconds.
Utilization and examples
- Reside dubbing and localization: Clone the presenter’s voice as soon as. Translate your French keynote speech into Spanish that sounds similar to the unique speaker.
- multilingual voice agent: Route assist calls
s2s-translate. The English agent listens to the German caller in English and responds in German. - actual time convention: Pipe the microphone audio through WebSocket. Every participant will obtain a translated speech and transcript in their very own language.
- Accessibility and captions: use
stt-translateAlone in the event you solely want the textual content. Render reside translated captions with out producing audio.
Translate with just a few strains of code
The Python SDK streams audio by way of the Speech-To-Speech endpoint and returns translated audio and a transcript.
import asyncio
import numpy as np
from gradium import shopper as gradium_client
grc = gradium_client.GradiumClient() # reads GRADIUM_API_KEY from the surroundings
setup = {
"model_name": "s2s-translate",
"input_format": "pcm_24000", # 24 kHz, 16-bit signed mono enter
"output_format": "pcm_48000", # 48 kHz, 16-bit signed mono output
"voice_id": "cLONiZ4hQ8VpQ4Sz", # should be a voice within the goal language
"stt_model_name": "stt-translate",
"tts_model_name": "default",
"target_language": "en",
}
# Uncooked 24 kHz, 16-bit mono PCM bytes (from a file, buffer, or microphone).
with open("input_24k_mono.pcm", "rb") as f:
pcm = f.learn()
async def primary() -> np.ndarray:
audio_out: checklist[bytes] = []
async with grc.s2s_realtime(wait_for_ready_on_start=True, **setup) as s2s:
async def send_loop():
for i in vary(0, len(pcm), 1920): # 1920 bytes = 40 ms at 24 kHz
await s2s.send_audio(pcm[i : i + 1920])
await s2s.send_eos() # sign finish of enter
async def recv_loop():
async for msg in s2s:
if msg["type"] == "audio":
audio_out.append(msg["audio"]) # translated speech (bytes)
elif msg["type"] == "textual content":
print(msg["text"], finish=" ", flush=True) # translated transcript
elif msg["type"] == "end_of_stream":
break
async with asyncio.TaskGroup() as tg:
tg.create_task(send_loop())
tg.create_task(recv_loop())
return np.frombuffer(b"".be a part of(audio_out), dtype=np.int16) # 48 kHz mono PCM
translated_pcm = asyncio.run(primary())
The SDK exposes 3 ways to drive S2S. use s2s_realtime For reside sources, s2s_stream For finite iterable objects, and s2s For buffered information. all three folks speak wss://api.gradium.ai/api/speech/s2s.
Benefits and drawbacks
Strengths
- single cross
stt-translateTake away one mannequin from the latency path - lead
gemini-3.5-live-translateEach BLEU and MetricX - Choose and duplicate output audio.
gpt-realtime-translatelacking - One twin WebSocket replaces manually wired STT plus TTS pipelines
Weak point
- Out there in 5 languages at launch, solely 20 pairs in the complete set
gemini-3.5-live-translateLatency is barely decrease at 2.9 seconds- MetricX is just corresponding to, not higher than:
gpt-realtime-translate - Benchmarks use proprietary datasets, so exterior replication is restricted
interactive explainer
You may take a look at real-time translations in your browser. gradium.ai/translationfor extra info on the combination. API documentation. Please be at liberty to comply with us too Twitter Remember to affix us 150k+ML subreddit and subscribe our newsletter. dangle on! Are you on telegram? You can now also participate by telegram.
Must accomplice with us to advertise your GitHub repository, Hug Face Web page, product launch, webinar, and many others.? connect with us

