How do you design an AI system that may plan, purpose, and act on lengthy sequences of selections with out steady human steering? Moonshot AI has launched Kimi K2 Pondering, an open-source pondering agent mannequin that exposes the whole inference stream of the Kimi K2 Professional Blended Structure. It’s aimed toward workloads that require deep inference, long-term instrument utilization, and steady agent conduct over many steps.

What are you K2 pondering??
Kim K2 Pondering is described as the newest and most succesful model of Moonshot’s open supply pondering mannequin. It’s constructed as a pondering agent that infers step-by-step and dynamically invokes instruments throughout inference. The mannequin is designed to interleave chains of thought and performance calls, permitting you to learn, suppose, name a instrument, suppose once more, and repeat tons of of steps.
This mannequin establishes a brand new state-of-the-art in Humanity’s Final Examination and BrowseComp whereas sustaining constant conduct over roughly 200-300 consecutive instrument invocations with out human intervention.
On the identical time, K2 Pondering is launched as an open weight mannequin with a 256K token context window and native INT4 inference to cut back latency and GPU reminiscence utilization whereas sustaining benchmark efficiency.
K2 Pondering is already reside in chat mode on kimi.com and accessible via the Moonshot platform API. We plan to reveal the complete instrument utilization conduct in devoted agent mode.
Structure, MoE design, and context size
Kimi K2 Pondering continues the design of Kimi K2 Combination of Specialists. This mannequin makes use of a MoE structure with 1T complete parameters and 32B activated parameters per token. It has 61 layers with 1 dense layer, 384 specialists with 8 specialists chosen per token, 1 shared professional, 64 consideration heads, and 7168 consideration hidden dimensions. The hidden dimensions of MoE are 2048 per professional.
The vocabulary measurement is 160K tokens and the context size is 256K. The eye mechanism is multi-head latent consideration, and the activation perform is SwiGLU.
Take a look at time scaling and long-term pondering
Kim K2 Pondering is explicitly optimized for scaling check occasions. The mannequin is educated to develop the size of inference and the depth of instrument calls when confronted with tougher duties, fairly than counting on mounted, brief chains of thought.


In humanity’s final check in a tool-free atmosphere, K2 Pondering scored 23.9. Utilizing the instrument will increase the rating to 44.9, and on heavier settings it reaches 51.0. AIME25 with Python experiences 99.1 and HMMT25 with Python experiences 95.1. It has a rating of 78.6 on IMO AnswerBench and 84.5 on GPQA.
The check protocol limits the sinking token price range for HLE, AIME25, HMMT25, and GPQA to 96K. IMO AnswerBench, LiveCodeBench, and OJ Bench use 128K suppose tokens and Longform Writing makes use of 32K completion tokens. In HLE, the utmost step restrict is 120 and the inference price range per step is 48K. For agent search duties, the restrict is 300 steps with an inference price range of 24K per step.
Agent search and coding benchmark
For agent search duties utilizing instruments, K2 Pondering experiences 60.2 for BrowseComp, 62.3 for BrowseComp ZH, 56.3 for Seal 0, 47.4 for FinSearchComp T3, and 87.0 for Frames.
Common data benchmarks report 84.6 for MMLU Professional, 94.4 for MMLU Redux, 73.8 for Longform Writing, and 58.0 for HealthBench.
By way of coding, K2 Pondering scores 71.3 on Verified SWE Bench with instruments, 61.1 with SWE Bench Multilingual with instruments, 41.9 with Multi SWE Bench with instruments, 44.8 with SciCode, 83.1 with LiveCodeBenchV6, 48.7 with OJ Bench with C Plus Plus configuration, and Terminal with Simulated Instruments. He hit 47.1 on the bench.
The Moonshot group has additionally outlined a heavy mode that runs eight trajectories in parallel and aggregates them to supply the ultimate reply. That is utilized in some inference benchmarks to squeeze additional precision out of the identical fundamental mannequin.
Native INT4 quantization and enlargement
K2 Pondering is educated as a local INT4 mannequin. The researchers apply quantization-aware coaching within the post-training stage and use INT4 weight-only quantization within the MoE element. This will increase era pace by roughly 2x in low-latency mode and helps INT4 inference whereas sustaining state-of-the-art efficiency. All benchmark scores reported are obtained with INT4 precision.
Checkpoints are saved in compressed tensor format and will be decompressed to increased precision codecs akin to FP8 or BF16 utilizing official compressed tensor instruments. Really helpful inference engines embody vLLM, SGLang, and KTransformers.
Necessary factors
- Kimi K2 Pondering is an open-weight pondering agent that extends the blended structure of Kimi K2 specialists with brief chat-style responses in addition to express long-term reasoning and gear utilization.
- The mannequin makes use of a trillion-parameter MoE design with roughly tens of billions of energetic parameters per token, a 256K context window, and is educated as a local INT4 mannequin with quantization-aware coaching to ship roughly 2x sooner inference whereas preserving benchmark efficiency steady.
- K2 Pondering is optimized for check time scaling, can carry out tons of of consecutive instrument calls in a single activity, and is evaluated below giant suppose token budgets and strict step caps. That is vital when making an attempt to breed inference and agent outcomes.
- Public benchmarks lead or compete in inference, agent search, and coding duties akin to HLE with the instrument, BrowseComp, and SWE bench validation with the instrument, displaying that the thinking-oriented variant presents clear benefits over the bottom non-thinking K2 mannequin.
Kimi K2 Pondering is a powerful sign that scaling check time is a first-class design purpose for open supply inference fashions. Moonshot AI does this by not solely exposing a 1T parameter Combination of Specialists system with 32B energetic parameters and 256K context home windows, but in addition utilizing native INT4 quantization, quantization-aware coaching, and gear orchestration that runs in tons of of steps in a production-like setting. Total, Kimi K2 Pondering exhibits that with long-term planning and gear utilization, open-weight inference brokers have gotten extra of a working infrastructure than only a analysis demo.
Please test model weights and technical details. Please be happy to test it out GitHub page for tutorials, code, and notebooks. Additionally, be happy to comply with us Twitter Do not forget to affix us 100,000+ ML subreddits and subscribe our newsletter. grasp on! Are you on telegram? You can now also participate by telegram.
Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of synthetic intelligence for social good. His newest endeavor is the launch of Marktechpost, a man-made intelligence media platform. It stands out for its thorough protection of machine studying and deep studying information, which is technically sound and simply understood by a large viewers. The platform boasts over 2 million views monthly, demonstrating its reputation amongst viewers.

