Introducing Kani-TTS-2: 400M Param open supply text-to-speech mannequin working on 3GB VRAM and supporting voice cloning

by root February 15, 2026

written by root February 15, 2026 0 comment 85 views

The generative audio panorama is altering in the direction of effectivity. New open supply candidates, Crab-TTS-2Launched by the staff. 9 9 six.ai. This mannequin represents a departure from heavy and computationally costly TTS methods. As an alternative, it treats audio as a language and offers high-fidelity speech synthesis in a really small footprint.

Kani-TTS-2 offers a lean, high-performance different to closed-source APIs. At present out there in each Hugging Faces. English (JP) and Portuguese (P.T.) model.

Structure: LFM2 and NanoCodec

Kani-TTS-2 is “Audio as a language”‘Philosophy. This mannequin doesn’t use a standard mel spectrogram pipeline. As an alternative, it makes use of neural codecs to transform uncooked audio into particular person tokens.

This method depends on a two-step course of.

Language spine: The mannequin is constructed on LiquidAI’s LFM2 (350M) structure. This spine generates an “audio intent” by predicting the following audio token. LFM (Liquid Basis Mannequin) is designed for effectivity, offering a quick different to straightforward transformers.
Neural codec: it’s, NVIDIA Nano Codec Convert these tokens to a 22kHz waveform.

Through the use of this structure, the mannequin captures human-like prosody, or speech rhythm and intonation, with out the “robotic” artifacts present in older TTS methods.

Effectivity: 10,000 hours in 6 hours

Kani-TTS-2 coaching metrics is a masterclass in optimization. The English mannequin was skilled as follows. 10,000 hours Gives prime quality audio information.

The dimensions is spectacular, however the velocity of coaching is actual. The analysis staff skilled the mannequin within the following manner. 6 hours utilizing a cluster of 8x NVIDIA H100 GPUs. This, mixed with an environment friendly structure like LFM2, proves that giant datasets don’t require weeks of compute time.

Zero-shot audio cloning and efficiency

The standout options for builders are: Zero-shot voice cloning. Not like earlier fashions that require fine-tuning new voices, Kani-TTS-2 Embedding the speaker.

construction: Gives brief informative audio clips.
end result: The mannequin extracts the distinctive options of that speech and immediately applies them to the generated textual content.

From an implementation perspective, this mannequin may be very accessible.

Variety of parameters: 400M (0.4B) parameters.
velocity: The traits are Actual-time issue (RTF) 0.2. Which means that 10 seconds of audio will be generated in about 2 seconds.
{Hardware}: What you want is 3GB VRAMis appropriate with consumer-grade GPUs akin to RTX 3060 and 4050.
license: was launched below Apache 2.0 License permitting industrial use.

Essential factors

Environment friendly structure: The mannequin is 400M parameters primarily based on spine LiquidAI’s LFM2 (350M). This “audio as language” method treats speech as particular person tokens, permitting for quicker processing and extra human-like intonation in comparison with conventional architectures.
Fast coaching at scale: Kani-TTS-2-EN was skilled as follows. 10,000 hours Excessive-quality audio information in only one go 6 hours utilizing 8x NVIDIA H100 GPUs.
On the spot zero-shot cloning: No have to tweak to breed particular sounds. By offering a brief reference audio clip, the mannequin makes use of: Embedding the speaker Immediately synthesize textual content inside the goal speaker’s voice.
Excessive efficiency on edge {hardware}: and Actual-time issue (RTF) 0.2this mannequin can generate 10 seconds of audio in about 2 seconds. What you want is 3GB VRAMabsolutely useful with consumer-grade GPUs such because the RTX 3060.
Developer-friendly license: was launched below Apache 2.0 licenseKani-TTS-2 is prepared for industrial integration. It offers a local-first, low-latency different to costly closed-source TTS APIs.

Please test model weight. Additionally, be happy to observe us Twitter Do not forget to affix us 100,000+ ML subreddits and subscribe our newsletter. hold on! Are you on telegram? You can now also participate by telegram.

Michal Sutter is an information science knowledgeable with a grasp’s diploma in information science from the College of Padova. With a powerful basis in statistical evaluation, machine studying, and information engineering, Michal excels at remodeling advanced datasets into actionable insights.

Welcome to Ivugangingo!

At Ivugangingo, we're passionate about delivering insightful content that empowers and informs our readers across a spectrum of crucial topics. Whether you're delving into the world of insurance, navigating the complexities of cryptocurrency, or seeking wellness tips in health and fitness, we've got you covered.

Introducing Kani-TTS-2: 400M Param open supply text-to-speech mannequin working on 3GB VRAM and supporting voice cloning

Structure: LFM2 and NanoCodec

Effectivity: 10,000 hours in 6 hours

Zero-shot audio cloning and efficiency

Essential factors

Enterprise Insurance coverage Quotes: The best way to Examine

Genetically engineered Listeria monocytogenes may turn into a strong new weapon towards most cancers

Converter

Editors Pick

Newsletter

Categories

Related Posts

Leave a Comment Cancel Reply

Latest

Best selling

Top rated

Products