ollm It’s a light-weight Python library constructed on prime of a hagging face transformer and Pytorch, operating a big context trance on Nvidia GPUs by aggressively offloading Weights and KV-Cache to a quick native SSD. The mission targets offline, single GPU workloads and makes use of FP16/BF16 weights with Flashat 2 and a KV cache for disks to deal with the context of as much as 100k tokens whereas retaining vRAM inside 8-10 GB.
However what’s new?
(1) KV cache learn/write its bypass mmap
To cut back host RAM utilization. (2) DiskCache help for QWEN3-NEXT-80B. (3) llama-3 flash attention-2 for stability. (4) GPT-OSS reminiscence discount by way of “flash attention-like” kernel and chunked MLP. Tables printed by the maintainer report end-to-end reminiscence/I/O footprints on RTX 3060 TI (8 GB):
- QWEN3-NEXT-80B (BF16, 160 GB weight, 50K CTX) →~7.5 GB VRAM +~180 GB SSD; Word the throughput “≈1Tok/2 s”.
- gpt-oss-20b (packed bf16, 10k ctx) →~7.3 GB VRAM + 15 GB SSD.
- llama-3.1-8b (FP16, 100k ctx) →~6.6 GB VRAM + 69 GB SSD.
The way it works
OLLM streams weight instantly from the SSD to the GPU, offloads the warning kv cache to the SSD, and optionally offloads the layers to the CPU. As a result of utilizing Flashattention-2 utilizing on-line softmax, the complete consideration matrix is ​​not embodied, chunking massive MLP projections to bind peak reminiscence. This shifts the bottleneck from VRAM to storage bandwidth and latency. Subsequently, the OLLM mission highlights the NVME class SSDS and KVIKIO/CUFILE (GPudirect Storage) of high-throughput file I/O.
Supported fashions and GPUs
Take it out of the field and canopy the examples llama-3 (1b/3b/8b), GPT-OSS-20Band qwen3-next-80b. The library targets Nvidia Ampere (RTX 30xx, A-Sequence), ADA (RTX 40xx, L4), and hoppers. QWEN3-NEXT requires the event of transformers (4.57.0.dev or larger). Specifically, the QWEN3-NEXT-80B is a sparse MOE (complete 80B, ~3B energetic) with distributors sometimes positioned in Multi-A100/H100 deployments. OLLM’s declare is that you are able to do it Run Take a single shopper GPU offline by paying an SSD penalty and accepting low throughput. That is in distinction to the VLLM doc that proposes a multi-GPU server in the identical mannequin household.
Set up and minimal utilization
This mission is licensed MIT and is out there on PYPI (pip set up ollm
) with further kvikio-cu{cuda_version}
Quick disk I/O dependencies. For QWEN3-Subsequent fashions, set up the transformer from GitHub. A easy instance of the README present Inference(...).DiskCache(...)
Wiring and generate(...)
Consists of streaming textual content callbacks. (At present, PYPI lists 0.4.1. Adjustments to 10.4.2 are referenced in readme.)
Efficiency Expectations and Commerce-offs
- throughput: Maintainer experiences ~0.5 TOK/s on QWEN3-NEXT-80B in 50K context of RTX 3060 TI. It may be used for batch/offline evaluation moderately than interactive chat. SSD latency is dominant.
- Storage stress: A really massive KV cache is required for lengthy contexts. OLLM writes these to the SSD to maintain the VRAM flat. This displays the broader trade’s work on KV offloading (e.g. Nvidia Dynamo/NIXL and neighborhood discussions), however the strategy remains to be storagebound and workload-specific.
- {Hardware} actuality examine: qwen3-next-80b’s “on shopper {hardware}” execution Might be completed In OLLM’s disk-centric design, the standard high-throughput inference of this mannequin nonetheless expects a multi-GPU server. OLLM treats it as a significant contest, offline move execution path, moderately than a manufacturing drop-in alternate that gives stacks equivalent to VLLM/TGI.
Conclusion
OLLM presses a transparent design level: preserve excessive accuracy, preserve reminiscence pressed and maintained on the SSD, permitting ultra-long contexts to be run on a single 8GB NVIDIA GPU. It does not match the information heart throughput, however for offline doc/log evaluation, compliance evaluations, or massive context abstract, it is a sensible strategy to step as much as the MOE-80B in case you can comfortably run the 8b-20b mannequin and face up to 100-200gb for quick native storage and sub-talk/s generations.
Please examine Github Repo is here. Please be at liberty to examine GitHub pages for tutorials, code and notebooks. Additionally, please be at liberty to comply with us Twitter And remember to affix us 100k+ ml subreddit And subscribe Our Newsletter.
Asif Razzaq is CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, ASIF is dedicated to leveraging the chances of synthetic intelligence for social advantages. His newest efforts are the launch of MarkTechPost, a synthetic intelligence media platform. That is distinguished by its detailed protection of machine studying and deep studying information, and is simple to know by a technically sound and vast viewers. The platform has over 2 million views every month, indicating its recognition amongst viewers.
🔥[Recommended Read] Nvidia AI Open-Sources Vipe (Video Pause Engine): A strong and versatile 3D video annotation device for spatial AI