Annually, international locations competing within the Worldwide Arithmetic Olympiad (IMO) arrive with a booklet containing their nation’s greatest and most authentic issues. These booklets are shared amongst delegates after which quietly disappear. It wasn’t for the AI researchers testing the bounds of mathematical reasoning, nor for the scholars around the globe coaching for these competitions, totally on their very own.
Researchers at MIT’s Pc Science and Synthetic Intelligence Laboratory (CSAIL), King Abdullah College of Science and Know-how (KAUST), and HUMAIN Company have achieved simply that.
MathNet is the biggest high-quality dataset of proof-based math issues ever created. Comprised of over 30,000 expert-authored issues and options throughout 47 international locations, 17 languages, and 143 sports activities, it is 5 instances bigger than the subsequent largest dataset of its sort. The analysis can be introduced on the Worldwide Convention on Studying Representations (ICLR) in Brazil later this month.
What makes MathNet completely different is not only its measurement, however its breadth. Up to now, Olympic-level datasets have come virtually totally from competitions in america and China. MathNet spans dozens of nations on six continents, covers 17 languages, contains each text-based and image-based issues and options, and spans 40 years of aggressive arithmetic. The aim is to seize all mathematical views and problem-solving traditions that exist the world over’s mathematical neighborhood, not simply probably the most seen ones.
“Each nation has include a booklet of their most novel and most artistic issues,” says Shayden Alshamari, an MIT doctoral pupil and lead creator of the paper. “They’re sharing booklets with one another, however nobody has made the hassle to gather them, set up them, and add them on-line.”
Constructing MathNet required monitoring 1,595 PDF volumes totaling greater than 25,000 pages, spanning digital paperwork in additional than a dozen languages and scans from many years in the past. A lot of that archive got here from surprising sources. Co-author Navid Safaei has been lively within the IMO neighborhood for a few years and has been gathering and scanning these booklets by hand since 2006. His private archive shaped a lot of the spine of the dataset.
Sourcing is simply as essential as scale. Whereas most current arithmetic datasets acquire issues from neighborhood boards such because the Artwork of Drawback Fixing (AoPS), MathNet obtains solely from official nationwide convention booklets. The options in these booklets are written by consultants, peer-reviewed, and infrequently span a number of pages the place the authors clarify a number of approaches to the identical downside. This depth provides AI fashions a lot richer alerts to study mathematical reasoning than the quick, casual options typical of community-sourced datasets. This additionally signifies that this dataset is absolutely helpful for college students. Anybody getting ready for an IMO or nationwide conference will now have entry to a centralized, searchable assortment of basic high-quality issues and sensible options from around the globe.
“I bear in mind for a lot of college students that it was a person effort. Nobody of their nation was coaching them for this type of competitors,” says Al-Shammari, who himself competed at IMO as a pupil. “We hope this supplies a central location for studying about high-quality issues and options.”
This crew has deep roots within the IMO neighborhood. Co-author Sultan Albarakati at present serves on the IMO Board of Administrators, and the researchers are engaged on sharing the dataset straight with the IMO Basis. To validate the dataset, they assembled an analysis group of greater than 30 human raters from international locations together with Armenia, Russia, Ukraine, Vietnam, and Poland, who labored collectively to validate 1000’s of options.
“The MathNet database is usually a nice useful resource for each college students and instructors who’re tackling new issues or searching for options to troublesome issues,” mentioned Tanish Patil, IMO Deputy Chief for Switzerland. “Though different archives exist on Olympic points (specifically the AoPS competitors assortment discussion board), these assets lack standardized formatting methods, validated options, and demanding downside metadata wanted for subjects and theories. It can even be attention-grabbing to see how this dataset can be utilized to enhance the efficiency of inference fashions, and whether or not it will probably rapidly and reliably reply a key query when creating new Olympic issues: figuring out whether or not an issue is really authentic.
MathNet additionally serves as a rigorous benchmark for AI efficiency, and its outcomes reveal a extra advanced image than latest headlines about AI’s mathematical capabilities counsel. Frontier fashions have made outstanding progress, with some reportedly reaching gold medal efficiency IMO, and now fixing issues that might journey most of us up on customary benchmarks. Nonetheless, MathNet exhibits that progress is uneven. Even GPT-5, the best-performing mannequin examined, averaged about 69.3 % on 6,400 issues, MathNet’s primary benchmark, and failed practically a 3rd of Olympic-level issues. Moreover, when the issue includes numbers, total efficiency drops considerably, revealing visible reasoning to be a constant weak point of even probably the most succesful fashions.
A number of open supply fashions confirmed a rating of 0% on the Mongolian language query, highlighting one other side the place present AI methods fall quick regardless of their total energy.
“The GPT mannequin is equally good for English and different languages,” Alshamari says. “Nonetheless, many open supply fashions don’t work completely in much less frequent languages akin to Mongolian.”
MathNet’s versatility can be designed to deal with severe limitations in how AI fashions study arithmetic. If the coaching knowledge is biased towards English and Chinese language questions, the mannequin absorbs a slender slice of math tradition. A Romanian combinatorics downside or a Brazilian quantity principle downside might strategy the identical underlying idea from utterly completely different angles. When uncovered to that vary, the researchers argue, each people and AI methods change into higher mathematical thinkers.
Past downside fixing, MathNet introduces a search benchmark that asks whether or not a mannequin can acknowledge when two issues share the identical underlying mathematical construction. This is a crucial function each for AI improvement and for the arithmetic neighborhood itself. Through the years, practically duplicate questions have appeared on the precise IMO examination. It is because discovering mathematical equivalence between completely different notations, languages, and codecs is extraordinarily troublesome, even for a panel of human consultants. The researchers examined eight state-of-the-art embedding fashions and located that even probably the most highly effective embedding fashions have been in a position to establish appropriate matches solely about 5% of the time on the primary attempt, and that the fashions ranked structurally unrelated issues extra usually as comparable than equal issues.
This dataset additionally features a search growth era benchmark that assessments whether or not efficiency could be improved by presenting a mannequin with structurally associated issues earlier than asking it to resolve a brand new downside. However provided that the issue retrieved is really related. DeepSeek-V3.2-Speciale improved efficiency by as much as 12 proportion factors for matching searches, however degraded efficiency for unrelated searches in about 22 % of circumstances.
Alshammari co-authored the paper with Safaei, HUMAIN AI engineer Abrar Zainal, KAUST Academy director Sultan Albarakati, and fellow MIT CSAIL grasp’s pupil Kevin Wen SB ’25. Microsoft Principal Engineering Supervisor Mark Hamilton SM 22 years, PhD 25 years. and Professors William Freeman and Antonio Torralba. Their analysis was funded partially by a Schwarzman Faculty of Computing Fellowship and the Nationwide Science Basis.
MathNet is printed at: mathnet.csail.mit.edu.

