Last 7 Days (April 21 โ April 27, 2026)
ReLU neural networks trained as surrogate models can be embedded exactly in mixed-integer linear programs (MILPs), enabling global optimization over the learned function. The tractability of the resulting MILP depends on structural properties of the network, i.e., the number of binary variables in associated formulations and the tightness of the continuous LP relaxation. These properties are determined during training, yet standard training objectives (prediction loss with classical weight regularization) offer no mechanism to directly control them. This work studies training regularizers that directly target downstream MILP tractability. Specifically, we propose simple bound-based regularizers that penalize the big-M constants of MILP formulations and/or the number of unstable neurons. Moreover, we introduce an LP relaxation gap regularizer that explicitly penalizes the per-sample gap of the continuous relaxation at training points. We derive its associated gradient and provide an implementation from LP dual variables without custom automatic differentiation tools. We show that combining the above regularizers can approximate the full total derivative of the LP gap with respect to the network parameters, capturing both direct and indirect sensitivities. Experiments on non-convex benchmark functions and a two-stage stochastic programming problem with quantile neural network surrogates demonstrate that the proposed regularizers can reduce MILP solve times by up to four orders of magnitude relative to an unregularized baseline, while maintaining competitive surrogate model accuracy.
Primary: Imperial College London
All Institutions: Imperial College London
This paper has significant broader impact across several domains: * **Mathematical Optimization:** It provides a powerful new tool for integrating neural network surrogates into global optimization problems, particularly those formulated as MILPs. This can unlock new capabilities in fields where complex black-box functions need to be optimized. * **Engineering Design and Operations:** Applications in process design, energy systems, and planning, where NN surrogates are increasingly used, will directly benefit from the ability to train more tractable models. This can lead to faster design cycles and more efficient operational decisions. * **Decision-Focused Learning:** The work contributes to the broader paradigm of training ML models with their downstream use in mind. While decision-focused learning often targets solution quality, this paper focuses on *computational tractability*, offering a complementary and equally important objective. * **Certified Robustness and Verification:** The techniques share methodological roots with certified robustness, demonstrating how insights from that field can be repurposed for optimization tractability. * **ML System Design:** It highlights the importance of considering the entire ML-to-optimization pipeline, suggesting that training objectives should be informed by the downstream application's computational characteristics. This could lead to more holistic ML system designs. The dramatic speedups demonstrated could make previously intractable problems solvable within reasonable timeframes, thereby expanding the practical applicability of NN surrogates in optimization. This paper introduces novel regularization techniques that enable the training of ReLU neural network surrogate models which are dramatically more tractable for downstream Mixed-Integer Linear Program (MILP) optimization, achieving up to four orders of magnitude speedup in MILP solve times while maintaining competitive accuracy. The work makes significant methodological contributions, including a novel LP relaxation gap regularizer with an elegant gradient derivation using LP dual variables and a practical straight-through estimator implementation, alongside a theoretical decomposition linking combined regularizers to the total derivative of the LP gap. This research provides a critical advancement for integrating machine learning models into mathematical optimization, with profound implications for engineering, design, and decision-making applications.
The paper proposes a family of novel regularization terms designed to improve the tractability of Mixed-Integer Linear Programs (MILPs) that embed ReLU neural network surrogate models. This addresses a critical bottleneck: while ReLU NNs can be exactly formulated as MILPs, the resulting optimization problems are often intractable. The methodology is well-grounded and comprises three main types of regularizers: 1. **Shrinkage Regularizers ($R_{L1}, R_{L2}$):** These are standard baselines, indirectly influencing MILP tractability by promoting smaller weights, which can lead to tighter bounds. 2. **Bound-based Regularizers ($R_{BW}, R_{SN}, R_{SN2}$):** * $R_{BW}$ (Bound-Width): Directly penalizes the mean width of Interval Bound Propagation (IBP) pre-activation bounds across all hidden neurons. This directly targets the big-M constants in MILP formulations, which are crucial for relaxation tightness. Its gradient is computed via automatic differentiation through the IBP forward pass. * $R_{SN}$ (Stable-Neuron): Penalizes the "distance to stability" for unstable neurons, encouraging them to become stably active or inactive, thus reducing the number of binary variables needed. It uses a piecewise-linear formulation with a clear subgradient. * $R_{SN2}$ (RS Loss): An alternative stability regularizer from prior work, included for comparison. 3. **LP Relaxation Gap Regularizer ($R_{LP}$):** This is the most novel and technically sophisticated contribution. It directly penalizes the per-sample continuous LP relaxation gap at training points. The paper elegantly derives its gradient using sensitivity analysis for parametric LPs, specifically leveraging LP dual variables. Crucially, it provides a practical implementation using a "straight-through estimator" to avoid custom automatic differentiation tools, making it accessible for standard ML frameworks like PyTorch. A significant theoretical contribution is Proposition 2, which demonstrates that the combined regularizer $R_{LP} + \lambda R_{BW}$ approximates the full total derivative of the LP gap with respect to network parameters. This decomposition captures both direct sensitivity (through constraint right-hand sides) and indirect sensitivity (through big-M constants via IBP), providing a strong theoretical justification for combining these regularizers. The methodology is robust, combining established concepts (IBP, MILP formulations) with novel gradient derivations and practical implementation strategies.
The experimental evaluation is comprehensive and compelling. * **Benchmarks:** The methods are tested on standard non-convex benchmark functions (Himmelblau, Peaks, Ackley) and a more complex, real-world relevant problem: a two-stage stochastic programming problem with quantile neural network surrogates. This demonstrates applicability across different problem types. * **Network Architectures:** Various network sizes (2, 3, 5 hidden layers, 25-50 neurons per layer) are explored, showing the robustness of the approach across different model complexities. * **Metrics:** The evaluation uses a comprehensive set of metrics: * **Accuracy:** Normalized test MSE ratios are reported to assess the trade-off between tractability and prediction accuracy. * **MILP Tractability:** Key metrics include the number of unstable neurons, LP relaxation gap, MILP node count, and MILP solve time. * **Results:** The results are outstanding. The proposed regularizers, especially combinations like $R_{BW}+R_{LP}$, achieve reductions in MILP solve times by *up to four orders of magnitude* (e.g., from hours to seconds) compared to unregularized baselines. This is achieved while maintaining competitive surrogate model accuracy, demonstrating a highly favorable trade-off. The paper shows that $R_{LP}$ is particularly effective at reducing the LP relaxation gap, while $R_{SN}$ and $R_{BW}$ contribute to reducing unstable neurons and tightening bounds, respectively. The computational overhead during training is analyzed, with $R_{LP}$ being the most expensive (5-10x baseline training time), but this cost is amortized over potentially many downstream optimization tasks. The visual examples (Figure 1, 2, 3) effectively illustrate the impact of regularization on relaxation tightness and prediction quality.
The paper provides sufficient detail for reproducibility. * **Implementation Details:** The use of PyTorch for NN models and regularizers, Gurobi for MILP, and HiGHS for LP solves is clearly stated. The specific version of Gurobi is mentioned. * **Gradient Derivations:** The gradients for all regularizers are explicitly derived, and the "straight-through estimator" implementation for $R_{LP}$ is clearly explained, which is crucial for practical implementation in standard ML frameworks. * **Experimental Setup:** Details on training data generation (Latin Hypercube sampling), sample sizes, normalization, and validation splits are provided. * **Computational Environment:** The server specifications (AMD EPYC 7742, 8 CPU cores, 16 GB memory) are mentioned. * **Tooling:** The choice of HiGHS over Gurobi for LPs during training is justified, aiding reproducibility with open-source tools. The acknowledgment of using Anthropic's Claude for server setup is unusual but transparent. Overall, the level of detail is high, making the work highly reproducible.
* **Computational Cost of $R_{LP}$:** While the benefits are immense, the LP-based regularizer significantly increases training time (5-10x). This might be a barrier for very large networks or datasets, although the paper suggests GPU-based LP solvers as a future direction. * **Reliance on IBP:** The bound-based regularizers and the indirect sensitivity path in Proposition 2 rely on IBP, which provides valid but often loose bounds. While the paper acknowledges this, more sophisticated OBBT methods could potentially yield even tighter relaxations at higher computational cost. * **Approximation in Combined Regularizer:** The combined regularizer $R_{LP} + \lambda R_{BW}$ approximates the full total derivative by using a uniform weight $\lambda$ instead of the true, sample-dependent LP dual multipliers for big-M sensitivity. While effective, this is an approximation. * **Scope of MILP Formulations:** The work primarily focuses on the standard big-M formulation for ReLU networks. While widely used, other more sophisticated MILP formulations exist, and the generalizability of these specific regularizers to those might require further investigation. * **ReLU-specific:** The methods are tailored for ReLU activation functions due to their piecewise-linear nature and exact MILP embedding. Generalization to other activation functions (e.g., sigmoid, tanh, or more complex non-linearities) would require different MILP formulations or convex relaxations, which is beyond the current scope.
This paper has significant broader impact across several domains: * **Mathematical Optimization:** It provides a powerful new tool for integrating neural network surrogates into global optimization problems, particularly those formulated as MILPs. This can unlock new capabilities in fields where complex black-box functions need to be optimized. * **Engineering Design and Operations:** Applications in process design, energy systems, and planning, where NN surrogates are increasingly used, will directly benefit from the ability to train more tractable models. This can lead to faster design cycles and more efficient operational decisions. * **Decision-Focused Learning:** The work contributes to the broader paradigm of training ML models with their downstream use in mind. While decision-focused learning often targets solution quality, this paper focuses on *computational tractability*, offering a complementary and equally important objective. * **Certified Robustness and Verification:** The techniques share methodological roots with certified robustness, demonstrating how insights from that field can be repurposed for optimization tractability. * **ML System Design:** It highlights the importance of considering the entire ML-to-optimization pipeline, suggesting that training objectives should be informed by the downstream application's computational characteristics. This could lead to more holistic ML system designs. The dramatic speedups demonstrated could make previously intractable problems solvable within reasonable timeframes, thereby expanding the practical applicability of NN surrogates in optimization. This paper introduces novel regularization techniques that enable the training of ReLU neural network surrogate models which are dramatically more tractable for downstream Mixed-Integer Linear Program (MILP) optimization, achieving up to four orders of magnitude speedup in MILP solve times while maintaining competitive accuracy. The work makes significant methodological contributions, including a novel LP relaxation gap regularizer with an elegant gradient derivation using LP dual variables and a practical straight-through estimator implementation, alongside a theoretical decomposition linking combined regularizers to the total derivative of the LP gap. This research provides a critical advancement for integrating machine learning models into mathematical optimization, with profound implications for engineering, design, and decision-making applications.