Last 7 Days (April 15 โ April 21, 2026)
Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts. MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at https://mathnet.mit.edu.
Primary: MIT
All Institutions: MIT
MathNet is poised to have a significant broader impact on the field of machine learning. 1. **Foundational Benchmark**: It provides the largest, highest-quality, multimodal, and multilingual Olympiad-level math dataset and benchmark to date, setting a new standard for evaluating mathematical reasoning and retrieval. This will become a go-to resource for researchers developing advanced AI systems. 2. **Driving Research Directions**: The stark findings regarding the failure of current embedding models to capture deep mathematical equivalence will spur significant research into novel representations and architectures for mathematical knowledge, potentially leading to advancements in symbolic AI and neuro-symbolic approaches. 3. **Improved RAG for Reasoning**: By highlighting the sensitivity of RAG to retrieval quality, MathNet will guide efforts to develop more "math-aware" retrieval systems, ultimately enabling more effective retrieval-augmented generation for complex reasoning tasks beyond mathematics. 4. **Educational and Societal Impact**: The dataset, comprising problems from diverse countries and languages, can serve as a valuable resource for mathematical education, talent development, and cross-cultural exchange in problem-solving. It could also aid in preparing students for international competitions. 5. **Understanding AI Capabilities**: MathNet offers a robust framework for probing the true reasoning capabilities of LLMs and LMMs, moving beyond superficial performance metrics to understand their underlying grasp of mathematical structure and analogy. MathNet introduces a high-quality, large-scale, multimodal, and multilingual benchmark for Olympiad-level mathematical reasoning and retrieval, revealing critical limitations in current models' ability to capture deep mathematical equivalence. This work provides a foundational resource and a rigorous evaluation framework that will drive future research in mathematical AI, retrieval-augmented generation, and symbolic reasoning.
The methodology for constructing MathNet is exceptionally rigorous and well-designed. The dataset, MathNet-Solve, comprises 30,676 expert-authored Olympiad-level problems sourced from official national competition booklets across 47 countries and 17 languages. This commitment to high-quality, expert-validated content from official sources, rather than community platforms, is a significant strength. The data extraction pipeline is sophisticated, leveraging `dots-ocr` for multilingual document parsing and a novel three-stage LLM-based pipeline (Gemini-2.5-Flash for segmentation, GPT-4.1 for extraction, and a three-stage verification process involving rule-based checks, GPT-4.1 as a judge, and human annotators) for problem-solution alignment. This hybrid approach ensures both scalability and accuracy, addressing the inherent challenges of heterogeneous mathematical documents. A key methodological innovation is the introduction of a fine-grained taxonomy of mathematical similarity (Invariance, Resonance, Affinity), which underpins the Math-Aware Retrieval task. MathNet-Retrieve is built from 10,000 anchor problems, with GPT-5 generating 1 equivalent positive and 3 hard negatives for each, totaling 40,000 synthetic problems. The generation of "hard negatives" that preserve surface form but alter underlying mathematics is crucial for truly testing deep understanding. MathNet-RAG, for Retrieval-Augmented Problem Solving, uses 35 expert-curated pairs of real Olympiad problems exhibiting structural resonance, providing a non-synthetic, high-quality evaluation for RAG. The evaluation protocols, including GPT-5-based grading (verified against human experts) and Recall@k for retrieval, are robust and well-justified.
The experimental evaluation is comprehensive, benchmarking 27 state-of-the-art models (LLMs, LMMs, and embedding models) across the three defined tasks. 1. **Problem Solving on MathNet-Solve**: Results show a clear performance stratification, with frontier models like Gemini-3.1-Pro and GPT-5 significantly outperforming others, yet still leaving substantial room for improvement (e.g., Gemini-3.1-Pro at 76.3% overall). This confirms the benchmark's challenging nature. 2. **Math-Aware Retrieval on MathNet-Retrieve**: This section reveals a critical and surprising finding: even strong general-purpose embedding models (e.g., Qwen3-embedding-4B, Gemini-embedding-001) achieve very low Recall@1 (around 5-6%). The analysis of cosine similarity distributions further highlights that these models often prioritize superficial lexical overlap over true mathematical equivalence, leading to mis-ranking. This is a profound insight into the limitations of current learned representations for mathematical structure. 3. **Retrieval-Augmented Problem Solving on MathNet-RAG**: The experiments demonstrate that RAG performance is highly sensitive to retrieval quality. Expert-RAG, using human-curated relevant problems, yields significant gains (up to 12% for DeepSeek-V3.2-Speciale), while Embed-RAG (using current embedding models) shows limited or even negative impact. This underscores the need for improved math-aware retrieval systems to unlock the full potential of RAG for complex reasoning. The comparison between LLM graders and human expert graders also adds valuable meta-evaluation. The detailed analysis of performance sensitivity to image presence and language further enriches the evaluation.
The paper demonstrates a strong commitment to reproducibility. The dataset and benchmark are publicly released at `https://mathnet.mit.edu`, `https://huggingface.co/datasets/ShadenA/MathNet`, and `https://github.com/ShadeAlsha/MathNet`. The appendix provides detailed prompts used for solution generation, grading, and metadata extraction, which is crucial for replicating the LLM-based components of the methodology. The description of the data collection, extraction, and annotation pipeline is thorough, allowing researchers to understand and potentially extend the process. The list of models evaluated and metrics used is also clearly stated.
While the paper is exceptionally strong, some limitations can be noted: 1. **Reliance on LLMs for Data Generation/Grading**: Although extensively verified by humans and rule-based systems, the use of GPT-5 for generating equivalent positives and hard negatives for MathNet-Retrieve, and GPT-5 for grading problem solutions, introduces a dependency on proprietary models. While the paper addresses this by comparing LLM graders to human graders, the inherent biases or limitations of these models could subtly influence the benchmark's characteristics. 2. **General-Purpose Embeddings**: The retrieval experiments primarily use general-purpose embedding models. While this highlights their current limitations, it would be interesting to see if specialized mathematical embedding models (if they exist or are developed) could bridge the observed gap. The paper implicitly calls for this, but the current evaluation is limited by the available models. 3. **Scope of Multimodality**: While MathNet includes multimodal content, the paper notes that "limited gains from visual augmentation further suggest that multimodal integration for symbolic tasks remains underdeveloped." This indicates that the current multimodal aspect, while present, might not yet fully capture the potential or challenges of visual reasoning in Olympiad math.
MathNet is poised to have a significant broader impact on the field of machine learning. 1. **Foundational Benchmark**: It provides the largest, highest-quality, multimodal, and multilingual Olympiad-level math dataset and benchmark to date, setting a new standard for evaluating mathematical reasoning and retrieval. This will become a go-to resource for researchers developing advanced AI systems. 2. **Driving Research Directions**: The stark findings regarding the failure of current embedding models to capture deep mathematical equivalence will spur significant research into novel representations and architectures for mathematical knowledge, potentially leading to advancements in symbolic AI and neuro-symbolic approaches. 3. **Improved RAG for Reasoning**: By highlighting the sensitivity of RAG to retrieval quality, MathNet will guide efforts to develop more "math-aware" retrieval systems, ultimately enabling more effective retrieval-augmented generation for complex reasoning tasks beyond mathematics. 4. **Educational and Societal Impact**: The dataset, comprising problems from diverse countries and languages, can serve as a valuable resource for mathematical education, talent development, and cross-cultural exchange in problem-solving. It could also aid in preparing students for international competitions. 5. **Understanding AI Capabilities**: MathNet offers a robust framework for probing the true reasoning capabilities of LLMs and LMMs, moving beyond superficial performance metrics to understand their underlying grasp of mathematical structure and analogy. MathNet introduces a high-quality, large-scale, multimodal, and multilingual benchmark for Olympiad-level mathematical reasoning and retrieval, revealing critical limitations in current models' ability to capture deep mathematical equivalence. This work provides a foundational resource and a rigorous evaluation framework that will drive future research in mathematical AI, retrieval-augmented generation, and symbolic reasoning.