Machine Learning Papers

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Shaden Alshammari, Kevin Wen, Abrar Zainal ... · Proceedings of the International Conference on Learning Representations (ICLR), 2026 · Proceedings of the International Conference on Learning Representations (ICLR), 2026

Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts. MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at https://mathnet.mit.edu.

Institutional Affiliations

Primary: MIT

All Institutions: MIT

GitHub

ML Relevance Analysis (91)

MathNet is poised to have a significant broader impact on the field of machine learning. 1. **Foundational Benchmark**: It provides the largest, highest-quality, multimodal, and multilingual Olympiad-level math dataset and benchmark to date, setting a new standard for evaluating mathematical reasoning and retrieval. This will become a go-to resource for researchers developing advanced AI systems. 2. **Driving Research Directions**: The stark findings regarding the failure of current embedding models to capture deep mathematical equivalence will spur significant research into novel representations and architectures for mathematical knowledge, potentially leading to advancements in symbolic AI and neuro-symbolic approaches. 3. **Improved RAG for Reasoning**: By highlighting the sensitivity of RAG to retrieval quality, MathNet will guide efforts to develop more "math-aware" retrieval systems, ultimately enabling more effective retrieval-augmented generation for complex reasoning tasks beyond mathematics. 4. **Educational and Societal Impact**: The dataset, comprising problems from diverse countries and languages, can serve as a valuable resource for mathematical education, talent development, and cross-cultural exchange in problem-solving. It could also aid in preparing students for international competitions. 5. **Understanding AI Capabilities**: MathNet offers a robust framework for probing the true reasoning capabilities of LLMs and LMMs, moving beyond superficial performance metrics to understand their underlying grasp of mathematical structure and analogy. MathNet introduces a high-quality, large-scale, multimodal, and multilingual benchmark for Olympiad-level mathematical reasoning and retrieval, revealing critical limitations in current models' ability to capture deep mathematical equivalence. This work provides a foundational resource and a rigorous evaluation framework that will drive future research in mathematical AI, retrieval-augmented generation, and symbolic reasoning.

Comprehensive Analysis

Methodology Assessment

The methodology for constructing MathNet is exceptionally rigorous and well-designed. The dataset, MathNet-Solve, comprises 30,676 expert-authored Olympiad-level problems sourced from official national competition booklets across 47 countries and 17 languages. This commitment to high-quality, expert-validated content from official sources, rather than community platforms, is a significant strength. The data extraction pipeline is sophisticated, leveraging `dots-ocr` for multilingual document parsing and a novel three-stage LLM-based pipeline (Gemini-2.5-Flash for segmentation, GPT-4.1 for extraction, and a three-stage verification process involving rule-based checks, GPT-4.1 as a judge, and human annotators) for problem-solution alignment. This hybrid approach ensures both scalability and accuracy, addressing the inherent challenges of heterogeneous mathematical documents. A key methodological innovation is the introduction of a fine-grained taxonomy of mathematical similarity (Invariance, Resonance, Affinity), which underpins the Math-Aware Retrieval task. MathNet-Retrieve is built from 10,000 anchor problems, with GPT-5 generating 1 equivalent positive and 3 hard negatives for each, totaling 40,000 synthetic problems. The generation of "hard negatives" that preserve surface form but alter underlying mathematics is crucial for truly testing deep understanding. MathNet-RAG, for Retrieval-Augmented Problem Solving, uses 35 expert-curated pairs of real Olympiad problems exhibiting structural resonance, providing a non-synthetic, high-quality evaluation for RAG. The evaluation protocols, including GPT-5-based grading (verified against human experts) and Recall@k for retrieval, are robust and well-justified.

Experimental Evaluation

The experimental evaluation is comprehensive, benchmarking 27 state-of-the-art models (LLMs, LMMs, and embedding models) across the three defined tasks. 1. **Problem Solving on MathNet-Solve**: Results show a clear performance stratification, with frontier models like Gemini-3.1-Pro and GPT-5 significantly outperforming others, yet still leaving substantial room for improvement (e.g., Gemini-3.1-Pro at 76.3% overall). This confirms the benchmark's challenging nature. 2. **Math-Aware Retrieval on MathNet-Retrieve**: This section reveals a critical and surprising finding: even strong general-purpose embedding models (e.g., Qwen3-embedding-4B, Gemini-embedding-001) achieve very low Recall@1 (around 5-6%). The analysis of cosine similarity distributions further highlights that these models often prioritize superficial lexical overlap over true mathematical equivalence, leading to mis-ranking. This is a profound insight into the limitations of current learned representations for mathematical structure. 3. **Retrieval-Augmented Problem Solving on MathNet-RAG**: The experiments demonstrate that RAG performance is highly sensitive to retrieval quality. Expert-RAG, using human-curated relevant problems, yields significant gains (up to 12% for DeepSeek-V3.2-Speciale), while Embed-RAG (using current embedding models) shows limited or even negative impact. This underscores the need for improved math-aware retrieval systems to unlock the full potential of RAG for complex reasoning. The comparison between LLM graders and human expert graders also adds valuable meta-evaluation. The detailed analysis of performance sensitivity to image presence and language further enriches the evaluation.

Reproducibility

The paper demonstrates a strong commitment to reproducibility. The dataset and benchmark are publicly released at `https://mathnet.mit.edu`, `https://huggingface.co/datasets/ShadenA/MathNet`, and `https://github.com/ShadeAlsha/MathNet`. The appendix provides detailed prompts used for solution generation, grading, and metadata extraction, which is crucial for replicating the LLM-based components of the methodology. The description of the data collection, extraction, and annotation pipeline is thorough, allowing researchers to understand and potentially extend the process. The list of models evaluated and metrics used is also clearly stated.

Limitations

While the paper is exceptionally strong, some limitations can be noted: 1. **Reliance on LLMs for Data Generation/Grading**: Although extensively verified by humans and rule-based systems, the use of GPT-5 for generating equivalent positives and hard negatives for MathNet-Retrieve, and GPT-5 for grading problem solutions, introduces a dependency on proprietary models. While the paper addresses this by comparing LLM graders to human graders, the inherent biases or limitations of these models could subtly influence the benchmark's characteristics. 2. **General-Purpose Embeddings**: The retrieval experiments primarily use general-purpose embedding models. While this highlights their current limitations, it would be interesting to see if specialized mathematical embedding models (if they exist or are developed) could bridge the observed gap. The paper implicitly calls for this, but the current evaluation is limited by the available models. 3. **Scope of Multimodality**: While MathNet includes multimodal content, the paper notes that "limited gains from visual augmentation further suggest that multimodal integration for symbolic tasks remains underdeveloped." This indicates that the current multimodal aspect, while present, might not yet fully capture the potential or challenges of visual reasoning in Olympiad math.

Broader Impact

Analysis: Full Paper • Full text: 41,160 characters

#2 TOP PAPER (Score: 89)

Relaxation-Informed Training of Neural Network Surrogate Models

Calvin Tsay · arXiv

ReLU neural networks trained as surrogate models can be embedded exactly in mixed-integer linear programs (MILPs), enabling global optimization over the learned function. The tractability of the resulting MILP depends on structural properties of the network, i.e., the number of binary variables in associated formulations and the tightness of the continuous LP relaxation. These properties are determined during training, yet standard training objectives (prediction loss with classical weight regularization) offer no mechanism to directly control them. This work studies training regularizers that directly target downstream MILP tractability. Specifically, we propose simple bound-based regularizers that penalize the big-M constants of MILP formulations and/or the number of unstable neurons. Moreover, we introduce an LP relaxation gap regularizer that explicitly penalizes the per-sample gap of the continuous relaxation at training points. We derive its associated gradient and provide an implementation from LP dual variables without custom automatic differentiation tools. We show that combining the above regularizers can approximate the full total derivative of the LP gap with respect to the network parameters, capturing both direct and indirect sensitivities. Experiments on non-convex benchmark functions and a two-stage stochastic programming problem with quantile neural network surrogates demonstrate that the proposed regularizers can reduce MILP solve times by up to four orders of magnitude relative to an unregularized baseline, while maintaining competitive surrogate model accuracy.

Institutional Affiliations

Primary: Imperial College London

All Institutions: Imperial College London

ML Relevance Analysis (89)

This paper has significant broader impact across several domains: * **Mathematical Optimization:** It provides a powerful new tool for integrating neural network surrogates into global optimization problems, particularly those formulated as MILPs. This can unlock new capabilities in fields where complex black-box functions need to be optimized. * **Engineering Design and Operations:** Applications in process design, energy systems, and planning, where NN surrogates are increasingly used, will directly benefit from the ability to train more tractable models. This can lead to faster design cycles and more efficient operational decisions. * **Decision-Focused Learning:** The work contributes to the broader paradigm of training ML models with their downstream use in mind. While decision-focused learning often targets solution quality, this paper focuses on *computational tractability*, offering a complementary and equally important objective. * **Certified Robustness and Verification:** The techniques share methodological roots with certified robustness, demonstrating how insights from that field can be repurposed for optimization tractability. * **ML System Design:** It highlights the importance of considering the entire ML-to-optimization pipeline, suggesting that training objectives should be informed by the downstream application's computational characteristics. This could lead to more holistic ML system designs. The dramatic speedups demonstrated could make previously intractable problems solvable within reasonable timeframes, thereby expanding the practical applicability of NN surrogates in optimization. This paper introduces novel regularization techniques that enable the training of ReLU neural network surrogate models which are dramatically more tractable for downstream Mixed-Integer Linear Program (MILP) optimization, achieving up to four orders of magnitude speedup in MILP solve times while maintaining competitive accuracy. The work makes significant methodological contributions, including a novel LP relaxation gap regularizer with an elegant gradient derivation using LP dual variables and a practical straight-through estimator implementation, alongside a theoretical decomposition linking combined regularizers to the total derivative of the LP gap. This research provides a critical advancement for integrating machine learning models into mathematical optimization, with profound implications for engineering, design, and decision-making applications.

Comprehensive Analysis

Methodology Assessment

The paper proposes a family of novel regularization terms designed to improve the tractability of Mixed-Integer Linear Programs (MILPs) that embed ReLU neural network surrogate models. This addresses a critical bottleneck: while ReLU NNs can be exactly formulated as MILPs, the resulting optimization problems are often intractable. The methodology is well-grounded and comprises three main types of regularizers: 1. **Shrinkage Regularizers ($R_{L1}, R_{L2}$):** These are standard baselines, indirectly influencing MILP tractability by promoting smaller weights, which can lead to tighter bounds. 2. **Bound-based Regularizers ($R_{BW}, R_{SN}, R_{SN2}$):** * $R_{BW}$ (Bound-Width): Directly penalizes the mean width of Interval Bound Propagation (IBP) pre-activation bounds across all hidden neurons. This directly targets the big-M constants in MILP formulations, which are crucial for relaxation tightness. Its gradient is computed via automatic differentiation through the IBP forward pass. * $R_{SN}$ (Stable-Neuron): Penalizes the "distance to stability" for unstable neurons, encouraging them to become stably active or inactive, thus reducing the number of binary variables needed. It uses a piecewise-linear formulation with a clear subgradient. * $R_{SN2}$ (RS Loss): An alternative stability regularizer from prior work, included for comparison. 3. **LP Relaxation Gap Regularizer ($R_{LP}$):** This is the most novel and technically sophisticated contribution. It directly penalizes the per-sample continuous LP relaxation gap at training points. The paper elegantly derives its gradient using sensitivity analysis for parametric LPs, specifically leveraging LP dual variables. Crucially, it provides a practical implementation using a "straight-through estimator" to avoid custom automatic differentiation tools, making it accessible for standard ML frameworks like PyTorch. A significant theoretical contribution is Proposition 2, which demonstrates that the combined regularizer $R_{LP} + \lambda R_{BW}$ approximates the full total derivative of the LP gap with respect to network parameters. This decomposition captures both direct sensitivity (through constraint right-hand sides) and indirect sensitivity (through big-M constants via IBP), providing a strong theoretical justification for combining these regularizers. The methodology is robust, combining established concepts (IBP, MILP formulations) with novel gradient derivations and practical implementation strategies.

Experimental Evaluation

The experimental evaluation is comprehensive and compelling. * **Benchmarks:** The methods are tested on standard non-convex benchmark functions (Himmelblau, Peaks, Ackley) and a more complex, real-world relevant problem: a two-stage stochastic programming problem with quantile neural network surrogates. This demonstrates applicability across different problem types. * **Network Architectures:** Various network sizes (2, 3, 5 hidden layers, 25-50 neurons per layer) are explored, showing the robustness of the approach across different model complexities. * **Metrics:** The evaluation uses a comprehensive set of metrics: * **Accuracy:** Normalized test MSE ratios are reported to assess the trade-off between tractability and prediction accuracy. * **MILP Tractability:** Key metrics include the number of unstable neurons, LP relaxation gap, MILP node count, and MILP solve time. * **Results:** The results are outstanding. The proposed regularizers, especially combinations like $R_{BW}+R_{LP}$, achieve reductions in MILP solve times by *up to four orders of magnitude* (e.g., from hours to seconds) compared to unregularized baselines. This is achieved while maintaining competitive surrogate model accuracy, demonstrating a highly favorable trade-off. The paper shows that $R_{LP}$ is particularly effective at reducing the LP relaxation gap, while $R_{SN}$ and $R_{BW}$ contribute to reducing unstable neurons and tightening bounds, respectively. The computational overhead during training is analyzed, with $R_{LP}$ being the most expensive (5-10x baseline training time), but this cost is amortized over potentially many downstream optimization tasks. The visual examples (Figure 1, 2, 3) effectively illustrate the impact of regularization on relaxation tightness and prediction quality.

Reproducibility

The paper provides sufficient detail for reproducibility. * **Implementation Details:** The use of PyTorch for NN models and regularizers, Gurobi for MILP, and HiGHS for LP solves is clearly stated. The specific version of Gurobi is mentioned. * **Gradient Derivations:** The gradients for all regularizers are explicitly derived, and the "straight-through estimator" implementation for $R_{LP}$ is clearly explained, which is crucial for practical implementation in standard ML frameworks. * **Experimental Setup:** Details on training data generation (Latin Hypercube sampling), sample sizes, normalization, and validation splits are provided. * **Computational Environment:** The server specifications (AMD EPYC 7742, 8 CPU cores, 16 GB memory) are mentioned. * **Tooling:** The choice of HiGHS over Gurobi for LPs during training is justified, aiding reproducibility with open-source tools. The acknowledgment of using Anthropic's Claude for server setup is unusual but transparent. Overall, the level of detail is high, making the work highly reproducible.

Limitations

* **Computational Cost of $R_{LP}$:** While the benefits are immense, the LP-based regularizer significantly increases training time (5-10x). This might be a barrier for very large networks or datasets, although the paper suggests GPU-based LP solvers as a future direction. * **Reliance on IBP:** The bound-based regularizers and the indirect sensitivity path in Proposition 2 rely on IBP, which provides valid but often loose bounds. While the paper acknowledges this, more sophisticated OBBT methods could potentially yield even tighter relaxations at higher computational cost. * **Approximation in Combined Regularizer:** The combined regularizer $R_{LP} + \lambda R_{BW}$ approximates the full total derivative by using a uniform weight $\lambda$ instead of the true, sample-dependent LP dual multipliers for big-M sensitivity. While effective, this is an approximation. * **Scope of MILP Formulations:** The work primarily focuses on the standard big-M formulation for ReLU networks. While widely used, other more sophisticated MILP formulations exist, and the generalizability of these specific regularizers to those might require further investigation. * **ReLU-specific:** The methods are tailored for ReLU activation functions due to their piecewise-linear nature and exact MILP embedding. Generalization to other activation functions (e.g., sigmoid, tanh, or more complex non-linearities) would require different MILP formulations or convex relaxations, which is beyond the current scope.

Broader Impact

Analysis: Full Paper • Full text: 50,026 characters