Neophhonic has launched Newton AirOpen Supply Textual content From Speech (TTS) Speech Language Mannequin It’s designed to run regionally in actual time on the CPU. Embracing face model card listing 748m parameter (QWEN2 structure) and GGUF quantization (This fall/Q8) vessels, enabling inference llama.cpp/llama-cpp-python No cloud dependencies. It’s licensed below Apache-2.0 Contains viable issues demo And for instance.
So, what’s new?
Newton Air Couple a 0.5B Class Qwen Spine Neophonic’s and Neucodec Audio codec. Neophhonic locations the system as a “surreal, on-device” TTS LM from. ~3 seconds of reference audio Goal voice brokers and privacy-sensitive functions and combine speeches in that type. Mannequin playing cards and repositories are explicitly highlighted Actual-time CPU Improvement of manufacturing and small prints.
Vital options
- Realism on the sub-1B scale: ~0.7B (QWEN2 class) Preservation of human-like prosodic and tones from speech LM from textual content.
- Machine deployment: It’s distributed in gguf (This fall/Q8) CPU first move. Appropriate for laptops, telephones, and raspberry PI class boards.
- Instantaneous speaker cloning: Model switch from3 seconds Reference Audio (Reference WAV + Transcript).
- Compact LM+Codeck Caught: Qwen 0.5b It’s paired with the spine Neucodec (0.8 kbps / 24 kHz) Steadiness delay, footprint, and output high quality.
Explains the mannequin structure and runtime path?
- spine: Qwen 0.5b It’s used as a light-weight LM to situation speech manufacturing. Hosted artifacts are reported as follows: 748m parameters Beneath QWEN2 Embracing face structure.
- Codec: Neucodec Offers low vitrate acoustic tokenization/decoding. Turn into a goal 0.8 kbps and 24 kHz Output permits compact illustration for environment friendly gadget use.
- Quantization and Kind: It was pre-built gguf The spine (This fall/Q8) is out there. The repository incorporates directions
llama-cpp-pythonAnd choices onnx Decoder move. - Dependencies: Makes use of
espeakFor phonemicization. Examples and Jupyter notebooks are offered for end-to-end synthesis.
Efficiency concentrate on the gadget
Newton Air Showcase ‘Actual-time era on midrange gadgets‘And provide CPU-First Default; GGUF quantization is aimed toward laptops and single-board computer systems. The FPS/RTF quantity isn’t revealed on the cardboard, however the distribution goal is Native inference with no GPU Reveals the offered examples and dealing movement via the house.
🚨 [Recommended Read] Vipe (Video Pause Engine): A strong and versatile 3D video annotation software for spatial AI
Voice Cloning Workflow
Impartial air (1)a Reference WAV (2) Transcript Textual content For that reference. Encodes a reference to a method token and synthesizes any textual content With reference speaker sound. Advisable by the Neophhonic group 3–15 seconds Offers clear, mono audio and pre-encoded samples.
Privateness, accountability and watermarks
Neophonic frames fashions In-device privateness (Audio/textual content is not going to go away the machine with out person approval), and all generated audio will likely be Perce (perceptual threshold) watermark To help accountable use and supply.
How is it in contrast?
Open, native TTS programs exist (e.g. GGUF-based pipelines), however Newton’s air is value noting in packaging Small LM + Neural Codec and Instantaneous cloning, CPU-first quantizationand watermark Beneath a suitable license. The phrasing of “the world’s first surreal, on-device speech LM” is the seller’s declare. The verifiable truth is Measurement, format, cloning process, license, and offered runtime.
The main target is on system trade-offs. The spine of the A-0.7B QWEN class is paired with GGUF quantization and 0.8 kbps/24 kHz Neucodec, making it a sensible, real-time recipe. The Apache-2.0 license and built-in watermarks are good for deployment, however exposing RTF/latency on the commodity CPU and utilizing cloning high quality and reference size curves permits for strict benchmarking in opposition to present native pipelines. Operationally, offline paths with minimal dependencies (ESPEAK, llama.cpp/onnx) scale back the privateness/compliance danger of edge brokers with out sacrificing readability.
Please examine Embracing face model card and github page. Please be happy to examine GitHub pages for tutorials, code and notebooks. Additionally, please be happy to comply with us Twitter And remember to hitch us 100k+ ml subreddit And subscribe Our Newsletter. grasp on! Are you on a telegram? You can now join Telegram.
Mikal Sutter is a knowledge science professional with a Grasp’s diploma in Knowledge Science from Padova College. With its stable foundations of statistical evaluation, machine studying, and information engineering, Michal excels at reworking complicated datasets into actionable insights.
🔥[Recommended Read] Nvidia AI Open-Sources Vipe (Video Pause Engine): A strong and versatile 3D video annotation software for spatial AI

