Last 14 Days (April 23 – May 06, 2026)
ReLU neural networks trained as surrogate models can be embedded exactly in mixed-integer linear programs (MILPs), enabling global optimization over the learned function. The tractability of the resulting MILP depends on structural properties of the network, i.e., the number of binary variables in associated formulations and the tightness of the continuous LP relaxation. These properties are determined during training, yet standard training objectives (prediction loss with classical weight regularization) offer no mechanism to directly control them. This work studies training regularizers that directly target downstream MILP tractability. Specifically, we propose simple bound-based regularizers that penalize the big-M constants of MILP formulations and/or the number of unstable neurons. Moreover, we introduce an LP relaxation gap regularizer that explicitly penalizes the per-sample gap of the continuous relaxation at training points. We derive its associated gradient and provide an implementation from LP dual variables without custom automatic differentiation tools. We show that combining the above regularizers can approximate the full total derivative of the LP gap with respect to the network parameters, capturing both direct and indirect sensitivities. Experiments on non-convex benchmark functions and a two-stage stochastic programming problem with quantile neural network surrogates demonstrate that the proposed regularizers can reduce MILP solve times by up to four orders of magnitude relative to an unregularized baseline, while maintaining competitive surrogate model accuracy.
Primary: Imperial College London
All Institutions: Imperial College London
This paper has significant broader impact across several domains: * **Mathematical Optimization:** It provides a powerful new tool for integrating neural network surrogates into global optimization problems, particularly those formulated as MILPs. This can unlock new capabilities in fields where complex black-box functions need to be optimized. * **Engineering Design and Operations:** Applications in process design, energy systems, and planning, where NN surrogates are increasingly used, will directly benefit from the ability to train more tractable models. This can lead to faster design cycles and more efficient operational decisions. * **Decision-Focused Learning:** The work contributes to the broader paradigm of training ML models with their downstream use in mind. While decision-focused learning often targets solution quality, this paper focuses on *computational tractability*, offering a complementary and equally important objective. * **Certified Robustness and Verification:** The techniques share methodological roots with certified robustness, demonstrating how insights from that field can be repurposed for optimization tractability. * **ML System Design:** It highlights the importance of considering the entire ML-to-optimization pipeline, suggesting that training objectives should be informed by the downstream application's computational characteristics. This could lead to more holistic ML system designs. The dramatic speedups demonstrated could make previously intractable problems solvable within reasonable timeframes, thereby expanding the practical applicability of NN surrogates in optimization. This paper introduces novel regularization techniques that enable the training of ReLU neural network surrogate models which are dramatically more tractable for downstream Mixed-Integer Linear Program (MILP) optimization, achieving up to four orders of magnitude speedup in MILP solve times while maintaining competitive accuracy. The work makes significant methodological contributions, including a novel LP relaxation gap regularizer with an elegant gradient derivation using LP dual variables and a practical straight-through estimator implementation, alongside a theoretical decomposition linking combined regularizers to the total derivative of the LP gap. This research provides a critical advancement for integrating machine learning models into mathematical optimization, with profound implications for engineering, design, and decision-making applications.
The paper proposes a family of novel regularization terms designed to improve the tractability of Mixed-Integer Linear Programs (MILPs) that embed ReLU neural network surrogate models. This addresses a critical bottleneck: while ReLU NNs can be exactly formulated as MILPs, the resulting optimization problems are often intractable. The methodology is well-grounded and comprises three main types of regularizers: 1. **Shrinkage Regularizers ($R_{L1}, R_{L2}$):** These are standard baselines, indirectly influencing MILP tractability by promoting smaller weights, which can lead to tighter bounds. 2. **Bound-based Regularizers ($R_{BW}, R_{SN}, R_{SN2}$):** * $R_{BW}$ (Bound-Width): Directly penalizes the mean width of Interval Bound Propagation (IBP) pre-activation bounds across all hidden neurons. This directly targets the big-M constants in MILP formulations, which are crucial for relaxation tightness. Its gradient is computed via automatic differentiation through the IBP forward pass. * $R_{SN}$ (Stable-Neuron): Penalizes the "distance to stability" for unstable neurons, encouraging them to become stably active or inactive, thus reducing the number of binary variables needed. It uses a piecewise-linear formulation with a clear subgradient. * $R_{SN2}$ (RS Loss): An alternative stability regularizer from prior work, included for comparison. 3. **LP Relaxation Gap Regularizer ($R_{LP}$):** This is the most novel and technically sophisticated contribution. It directly penalizes the per-sample continuous LP relaxation gap at training points. The paper elegantly derives its gradient using sensitivity analysis for parametric LPs, specifically leveraging LP dual variables. Crucially, it provides a practical implementation using a "straight-through estimator" to avoid custom automatic differentiation tools, making it accessible for standard ML frameworks like PyTorch. A significant theoretical contribution is Proposition 2, which demonstrates that the combined regularizer $R_{LP} + \lambda R_{BW}$ approximates the full total derivative of the LP gap with respect to network parameters. This decomposition captures both direct sensitivity (through constraint right-hand sides) and indirect sensitivity (through big-M constants via IBP), providing a strong theoretical justification for combining these regularizers. The methodology is robust, combining established concepts (IBP, MILP formulations) with novel gradient derivations and practical implementation strategies.
The experimental evaluation is comprehensive and compelling. * **Benchmarks:** The methods are tested on standard non-convex benchmark functions (Himmelblau, Peaks, Ackley) and a more complex, real-world relevant problem: a two-stage stochastic programming problem with quantile neural network surrogates. This demonstrates applicability across different problem types. * **Network Architectures:** Various network sizes (2, 3, 5 hidden layers, 25-50 neurons per layer) are explored, showing the robustness of the approach across different model complexities. * **Metrics:** The evaluation uses a comprehensive set of metrics: * **Accuracy:** Normalized test MSE ratios are reported to assess the trade-off between tractability and prediction accuracy. * **MILP Tractability:** Key metrics include the number of unstable neurons, LP relaxation gap, MILP node count, and MILP solve time. * **Results:** The results are outstanding. The proposed regularizers, especially combinations like $R_{BW}+R_{LP}$, achieve reductions in MILP solve times by *up to four orders of magnitude* (e.g., from hours to seconds) compared to unregularized baselines. This is achieved while maintaining competitive surrogate model accuracy, demonstrating a highly favorable trade-off. The paper shows that $R_{LP}$ is particularly effective at reducing the LP relaxation gap, while $R_{SN}$ and $R_{BW}$ contribute to reducing unstable neurons and tightening bounds, respectively. The computational overhead during training is analyzed, with $R_{LP}$ being the most expensive (5-10x baseline training time), but this cost is amortized over potentially many downstream optimization tasks. The visual examples (Figure 1, 2, 3) effectively illustrate the impact of regularization on relaxation tightness and prediction quality.
The paper provides sufficient detail for reproducibility. * **Implementation Details:** The use of PyTorch for NN models and regularizers, Gurobi for MILP, and HiGHS for LP solves is clearly stated. The specific version of Gurobi is mentioned. * **Gradient Derivations:** The gradients for all regularizers are explicitly derived, and the "straight-through estimator" implementation for $R_{LP}$ is clearly explained, which is crucial for practical implementation in standard ML frameworks. * **Experimental Setup:** Details on training data generation (Latin Hypercube sampling), sample sizes, normalization, and validation splits are provided. * **Computational Environment:** The server specifications (AMD EPYC 7742, 8 CPU cores, 16 GB memory) are mentioned. * **Tooling:** The choice of HiGHS over Gurobi for LPs during training is justified, aiding reproducibility with open-source tools. The acknowledgment of using Anthropic's Claude for server setup is unusual but transparent. Overall, the level of detail is high, making the work highly reproducible.
* **Computational Cost of $R_{LP}$:** While the benefits are immense, the LP-based regularizer significantly increases training time (5-10x). This might be a barrier for very large networks or datasets, although the paper suggests GPU-based LP solvers as a future direction. * **Reliance on IBP:** The bound-based regularizers and the indirect sensitivity path in Proposition 2 rely on IBP, which provides valid but often loose bounds. While the paper acknowledges this, more sophisticated OBBT methods could potentially yield even tighter relaxations at higher computational cost. * **Approximation in Combined Regularizer:** The combined regularizer $R_{LP} + \lambda R_{BW}$ approximates the full total derivative by using a uniform weight $\lambda$ instead of the true, sample-dependent LP dual multipliers for big-M sensitivity. While effective, this is an approximation. * **Scope of MILP Formulations:** The work primarily focuses on the standard big-M formulation for ReLU networks. While widely used, other more sophisticated MILP formulations exist, and the generalizability of these specific regularizers to those might require further investigation. * **ReLU-specific:** The methods are tailored for ReLU activation functions due to their piecewise-linear nature and exact MILP embedding. Generalization to other activation functions (e.g., sigmoid, tanh, or more complex non-linearities) would require different MILP formulations or convex relaxations, which is beyond the current scope.
This paper has significant broader impact across several domains: * **Mathematical Optimization:** It provides a powerful new tool for integrating neural network surrogates into global optimization problems, particularly those formulated as MILPs. This can unlock new capabilities in fields where complex black-box functions need to be optimized. * **Engineering Design and Operations:** Applications in process design, energy systems, and planning, where NN surrogates are increasingly used, will directly benefit from the ability to train more tractable models. This can lead to faster design cycles and more efficient operational decisions. * **Decision-Focused Learning:** The work contributes to the broader paradigm of training ML models with their downstream use in mind. While decision-focused learning often targets solution quality, this paper focuses on *computational tractability*, offering a complementary and equally important objective. * **Certified Robustness and Verification:** The techniques share methodological roots with certified robustness, demonstrating how insights from that field can be repurposed for optimization tractability. * **ML System Design:** It highlights the importance of considering the entire ML-to-optimization pipeline, suggesting that training objectives should be informed by the downstream application's computational characteristics. This could lead to more holistic ML system designs. The dramatic speedups demonstrated could make previously intractable problems solvable within reasonable timeframes, thereby expanding the practical applicability of NN surrogates in optimization. This paper introduces novel regularization techniques that enable the training of ReLU neural network surrogate models which are dramatically more tractable for downstream Mixed-Integer Linear Program (MILP) optimization, achieving up to four orders of magnitude speedup in MILP solve times while maintaining competitive accuracy. The work makes significant methodological contributions, including a novel LP relaxation gap regularizer with an elegant gradient derivation using LP dual variables and a practical straight-through estimator implementation, alongside a theoretical decomposition linking combined regularizers to the total derivative of the LP gap. This research provides a critical advancement for integrating machine learning models into mathematical optimization, with profound implications for engineering, design, and decision-making applications.
Every document format in existence was designed for a human reader moving linearly through text. Autonomous LLM agents do not read - they retrieve. This fundamental mismatch forces agents to inject entire documents into their context window, wasting tokens on irrelevant content, compounding state across multi-turn loops, and broadcasting information indiscriminately across agent roles. We argue this is not a prompt engineering problem, not a retrieval problem, and not a compression problem: it is a format problem. We introduce OBJECTGRAPH (.og), a file format that reconceives the document as a typed, directed knowledge graph to be traversed rather than a string to be injected. OBJECTGRAPH is a strict superset of Markdown - every .md file is a valid .og file - requires no infrastructure beyond a two-primitive query protocol, and is readable by both humans and agents without tooling. We formalize the Document Consumption Problem, characterise six structural properties no existing format satisfies simultaneously, and prove OBJECTGRAPH satisfies all six. We further introduce the Progressive Disclosure Model, the Role-Scoped Access Protocol, and Executable Assertion Nodes as native format primitives. Empirical evaluation across five document classes and eight agent task types demonstrates up to 95.3 percent token reduction with no statistically significant degradation in task accuracy (p > 0.05). Transpiler fidelity reaches 98.7 percent content preservation on a held-out document benchmark.
Primary: Open Gigantic
All Institutions: Open Gigantic
ObjectGraph has the potential for significant broader impact across several dimensions: 1. **Cost and Efficiency**: The dramatic reduction in token consumption (up to 95.3%) and mitigation of context compounding (36.5x reduction) can substantially lower the operational costs of LLM agents and enable more complex, multi-turn workflows within existing context window limits. 2. **Agent Capabilities**: By providing structured, queryable knowledge, ObjectGraph can enhance agent reasoning, planning, and execution capabilities, leading to more reliable and autonomous agents. 3. **System Simplification**: The "ObjectGraph as Infrastructure" concept is powerful. Role-scoped access control, executable assertions, and delta loading natively within the document format can eliminate the need for external middleware, validation prompt templates, and change tracking systems, simplifying the architecture of multi-agent deployments. 4. **Human-Agent Collaboration**: Being a strict superset of Markdown, ObjectGraph allows both humans and agents to interact with the same source document, reducing maintenance overhead and fostering better alignment between human-authored instructions and agent execution. 5. **Knowledge Management**: It offers a more robust framework for managing agent knowledge bases, enabling features like automated staleness detection and structured updates. 6. **New Paradigm for Documents**: This work challenges the fundamental assumption of linear document consumption, proposing a new paradigm for how information is structured and accessed in the agentic era. If widely adopted, it could lead to a new ecosystem of tools and practices for agent-native content creation and consumption. This paper introduces ObjectGraph, a novel file format that re-imagines documents as typed knowledge graphs for LLM agents, achieving up to 95.3% token reduction and significant context compounding mitigation without degrading task accuracy. The work presents a comprehensive, well-designed solution to a fundamental problem in LLM agent deployment, offering a paradigm shift in document consumption that promises to enhance agent efficiency, capabilities, and simplify multi-agent system architectures.
The paper introduces ObjectGraph (.og), a novel file format designed to address the "Document Consumption Problem" for LLM agents. The core methodology reconceives documents as typed, directed knowledge graphs rather than linear text strings. The authors formalize this problem and derive six structural properties (Query-Addressable Index, Layered Compression, Typed Dependency Graph, Role-Scoped Access Control, Executable Assertions, Human Readability) that existing formats fail to satisfy simultaneously. ObjectGraph is presented as a strict superset of Markdown, ensuring backward compatibility. Key methodological components include: 1. **ObjectGraph Format Specification**: A detailed structure comprising a file-level manifest (meta, index, changelog blocks) and atomic knowledge units (nodes). Nodes are typed containers with stable identifiers, scope annotations, confidence scores, and versioning metadata. Content-type tags (e.g., `code`, `steps`, `warning`) provide explicit semantic meaning beyond visual cues. 2. **Progressive Disclosure Model (PDM)**: A three-pass reading model (Index, Dense, Full) that enables agents to retrieve only relevant information at the necessary fidelity level, significantly reducing token consumption. 3. **Typed Edge Declarations**: Supports explicit, machine-traversable relationships between nodes (e.g., `:requires`, `:precedes`, `:see-also`), allowing for automatic dependency resolution. 4. **Role-Based Access Control**: The `scope` attribute on nodes and index entries enables content filtering at the format level, eliminating the need for external middleware in multi-agent systems. 5. **Executable Assertion Nodes**: Allows embedding validation logic, retry mechanisms, and escalation paths directly within the document, triggered by the query protocol. 6. **Delta Loading via Changelog**: A `__changelog` meta-node facilitates incremental document updates, reducing the cost of checking for changes. 7. **LLM-Native Query Protocol**: A minimal two-primitive interface (`search_index`, `resolve_context`) that leverages the LLM itself as a "Router" for semantic index search, rather than relying on traditional keyword matching or embeddings. This is a particularly clever design choice. 8. **Transpiler**: A hybrid Markdown-to-ObjectGraph transpiler that uses deterministic parsers for content extraction and bounded LLM calls for metadata synthesis (dense blocks, index keywords), ensuring high fidelity and bounding hallucination risk. The methodology is comprehensive, well-articulated, and addresses the identified problems systematically. The design choices, such as Markdown superset and LLM-as-Router, are pragmatic and innovative.
The empirical evaluation is robust and addresses key research questions effectively. 1. **Corpus**: A benchmark of 240 documents across five classes (Skill Files, Operational Runbooks, Execution Plans, Technical Documentation, Knowledge Bases), ranging from 200 to 15,000 tokens, provides a diverse testbed. 2. **Task Suite**: Eight distinct task types (information lookup, procedure execution, multi-step planning, role-conditional access, cross-node reasoning, update detection, assertion verification, multi-agent handoff) cover a broad range of agent interactions. 3. **Models & Baselines**: Evaluation uses Claude Sonnet 4.5 (primary), Claude Haiku 4.5 (Router), and GPT-4o (cross-model validation). Baselines include Full Markdown injection, RAG (text-embedding-3-large), and SkillReducer-optimized Markdown. 4. **RQ1: Token Consumption**: ObjectGraph achieved a mean token reduction from 2,340 to 187 tokens (92.0% average, up to 95.3%), demonstrating significant cost savings. 5. **RQ2: Context Compounding Reduction**: In a 5-turn workflow, ObjectGraph (Architecture B) reduced cumulative token cost by 36.5x compared to Markdown (46,000 vs. 1,260 tokens), effectively mitigating the super-linear growth of context. 6. **RQ3: Task Accuracy**: ObjectGraph matched or exceeded Markdown accuracy on 7 of 8 task types. Notably, it showed dramatic improvements on Role-conditional access (+18.4%) and Update detection (+30.1%), tasks where Markdown lacks native support. The "less-is-more" effect, where reduced context improves accuracy by reducing attention dilution, is a significant finding. 7. **RQ4: Transpiler Fidelity**: The transpiler achieved a mean fidelity of 0.987 (SD=0.018) on 180 held-out documents, ensuring high content preservation. 8. **RQ5: Human Authoring Burden**: A user study with 18 participants rated authoring burden as low (mean 2.8/7), suggesting good usability for human authors. 9. **Ablation Study**: An ablation study clearly demonstrated the individual contributions of different ObjectGraph features to token reduction, providing valuable insights into the design's effectiveness. The experimental setup is comprehensive, the results are statistically significant (p > 0.05 for accuracy degradation), and the findings strongly support the claims of the paper.
The paper provides a detailed specification of the ObjectGraph format, including its structure, node types, edge syntax, and query protocol. The LLM prompt template for metadata synthesis is explicitly provided. The algorithms for structural extraction and the query protocol are outlined. While no direct code repository or dataset links are provided, the level of detail in the format specification and methodology sections is high enough that a motivated researcher could likely implement the format and protocol. The benchmark corpus is described in terms of document classes and token ranges, but the specific documents are not publicly available. The LLM models used are identified. Overall, the paper offers a strong foundation for reproducibility, though direct code access would enhance it further.
The authors acknowledge several limitations: 1. **Scale**: The benchmark of 240 documents, while curated, may not fully represent the diversity of real-world enterprise-scale corpora. 2. **Cross-file Federation**: The current specification does not support cross-file edge resolution, limiting its applicability to mono-repo or single-domain knowledge bases. This is a significant limitation for truly distributed knowledge graphs. 3. **Standardisation**: Without a standards body or broad community adoption, the format risks fragmentation into incompatible dialects. 4. **Adversarial Inputs**: The evaluation did not consider adversarial document authors who might craft misleading `dense` blocks or `index` entries to manipulate agent routing. Additional minor limitations could include the reliance on LLMs for routing, which, while a feature, could introduce its own set of challenges (e.g., prompt engineering for optimal routing, potential for misinterpretation if the index is poorly crafted).
ObjectGraph has the potential for significant broader impact across several dimensions: 1. **Cost and Efficiency**: The dramatic reduction in token consumption (up to 95.3%) and mitigation of context compounding (36.5x reduction) can substantially lower the operational costs of LLM agents and enable more complex, multi-turn workflows within existing context window limits. 2. **Agent Capabilities**: By providing structured, queryable knowledge, ObjectGraph can enhance agent reasoning, planning, and execution capabilities, leading to more reliable and autonomous agents. 3. **System Simplification**: The "ObjectGraph as Infrastructure" concept is powerful. Role-scoped access control, executable assertions, and delta loading natively within the document format can eliminate the need for external middleware, validation prompt templates, and change tracking systems, simplifying the architecture of multi-agent deployments. 4. **Human-Agent Collaboration**: Being a strict superset of Markdown, ObjectGraph allows both humans and agents to interact with the same source document, reducing maintenance overhead and fostering better alignment between human-authored instructions and agent execution. 5. **Knowledge Management**: It offers a more robust framework for managing agent knowledge bases, enabling features like automated staleness detection and structured updates. 6. **New Paradigm for Documents**: This work challenges the fundamental assumption of linear document consumption, proposing a new paradigm for how information is structured and accessed in the agentic era. If widely adopted, it could lead to a new ecosystem of tools and practices for agent-native content creation and consumption. This paper introduces ObjectGraph, a novel file format that re-imagines documents as typed knowledge graphs for LLM agents, achieving up to 95.3% token reduction and significant context compounding mitigation without degrading task accuracy. The work presents a comprehensive, well-designed solution to a fundamental problem in LLM agent deployment, offering a paradigm shift in document consumption that promises to enhance agent efficiency, capabilities, and simplify multi-agent system architectures.
Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting errors can matter more than average benchmark performance. We introduce SaFE-Scale, a framework for measuring how clinical LLM safety changes across model scale, evidence quality, retrieval strategy, context exposure, and inference-time compute. To instantiate this framework, we introduce RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined clean evidence, conflict evidence, and option-level labels for high-risk error, unsafe answer, and evidence contradiction. We evaluated 34 locally deployed LLMs across six deployment conditions: closed-book prompting (zero-shot), clean evidence, conflict evidence, standard RAG, agentic RAG, and max-context prompting. Clean evidence produced the strongest improvement, increasing mean accuracy from 73.5% to 94.1%, while reducing high-risk error from 12.0% to 2.6%, contradiction from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. Standard RAG and agentic RAG did not reproduce this safety profile: agentic RAG improved accuracy over standard RAG and reduced contradiction, but high-risk error and dangerous overconfidence remained elevated. Max-context prompting increased latency without closing the safety gap, and additional inference-time compute produced only limited gains. Worst-case analysis showed that clinically consequential errors concentrated in a small subset of questions. Clinical LLM safety is therefore not a passive consequence of scaling, but a deployment property shaped by evidence quality, retrieval design, context construction, and collective failure behavior.
Primary: Friedrich-Alexander-Universität Erlangen-Nürnberg
All Institutions: Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen National High Performance Computing Center, Institute of Radiology, University Hospital Erlangen, Lab for AI in Medicine, RWTH Aachen University, Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Chair of Computer Science 10
This paper has significant broader impact, particularly for the development and deployment of AI in high-stakes domains like medicine. It fundamentally challenges the prevailing assumption that scaling LLMs (larger models, longer contexts, more compute) automatically leads to safer behavior. The finding that evidence quality is paramount and that safety and accuracy decouple will necessitate a paradigm shift in how clinical LLMs are evaluated and deployed. It highlights the critical need for multi-dimensional safety metrics beyond accuracy, including high-risk error, contradiction, and dangerous overconfidence. The identification of "synchronized failure" in ensembles is a crucial warning for system designers relying on model agreement for robustness. The paper provides a valuable framework (SaFE-Scale) and benchmark (RadSaFE-200) that can guide future research and development towards truly safe and reliable clinical AI systems. Its insights are also relevant to other high-stakes applications of LLMs where confident, high-risk errors are unacceptable. This study rigorously demonstrates that clinical LLM safety is not a passive consequence of scaling but a deployment property critically shaped by evidence quality, retrieval design, and context construction, often decoupling from accuracy. The paper introduces SaFE-Scale, a novel framework, and RadSaFE-200, a benchmark with clinician-defined multi-dimensional safety labels, to empirically show that clean evidence dramatically improves both accuracy and safety, while model scale, retrieval, and inference-time compute offer limited or even misleading safety gains, particularly due to unreliable confidence and synchronized failures in ensembles. This comprehensive analysis provides crucial insights for developing and deploying safer LLMs in high-stakes clinical environments, urging a shift from accuracy-centric evaluation to explicit safety-focused monitoring of high-risk errors.
The paper introduces SaFE-Scale, a well-structured framework for evaluating clinical LLM safety across various scaling dimensions. This framework is instantiated with RadSaFE-200, a novel benchmark of 200 multiple-choice radiology questions. A key methodological strength is the clinician-defined, multi-dimensional safety labels at the option level: high-risk error, unsafe answer, and evidence contradiction. This moves beyond simple accuracy to capture the nuanced risks in clinical settings. The experimental design is comprehensive, evaluating 34 diverse LLMs across six deployment conditions (closed-book, clean evidence, conflict evidence, standard RAG, agentic RAG, max-context prompting) and additional inference-time compute strategies (self-consistency, ensembling). The use of Radiopaedia as an external evidence source for RAG is appropriate for the radiology domain. The metrics chosen (high-risk error rate, unsafe answer rate, contradiction rate, dangerous overconfidence rate, alongside accuracy) are directly relevant to clinical safety. The variance decomposition analysis to quantify the contributions of model family vs. deployment condition is a robust statistical approach. The worst-case analysis at the question level further strengthens the methodology by identifying specific, recurrent failure modes.
The experimental evaluation is exceptionally thorough and rigorous. The study's scale, involving 34 LLMs from various families and sizes, provides a broad and representative assessment of current LLM capabilities. The comparison across six distinct deployment conditions is critical for understanding how practical choices impact safety. The results consistently demonstrate that evidence quality, specifically clinician-written clean evidence, is the most dominant factor for both accuracy and safety, far outweighing model scale or inference-time compute. This is a significant empirical finding. The decoupling of accuracy and safety is clearly illustrated, with agentic RAG improving accuracy but not necessarily safety. The analysis of confidence as an unreliable safety signal, with high confidence observed even in high-risk errors, is a crucial and concerning finding. The investigation into self-consistency and ensembling reveals their limited safety gains and introduces the important concept of "synchronized failure" in ensembles, where multiple models make the same high-risk error. The worst-case analysis effectively highlights that critical failures are not random but concentrate in specific, challenging questions, which is highly valuable for targeted mitigation efforts. The statistical analysis, including variance decomposition, supports the conclusions robustly.
The paper states that "Full prompt templates, output-format instructions, and inference protocols are provided in Supplementary Note [REF]". This commitment to detailing the experimental setup is a strong indicator of reproducibility. The RadSaFE-200 benchmark is intended for public release, albeit with source-specific redistribution restrictions for some components, which is understandable given the use of copyrighted material like RSNA Case Collection and Radiopaedia. The detailed description of benchmark construction, safety augmentation protocol, and model panel specifications further aids reproducibility. While no direct code repository URL is provided in the text, the level of detail suggests that the experiments could be replicated by other researchers with sufficient effort and access to the benchmark.
The authors provide a comprehensive and transparent discussion of limitations. These include: 1. **Benchmark Scope:** Text-based, multiple-choice format does not capture the full complexity of radiology practice (image interpretation, open-ended reasoning, multimodal aspects). 2. **Benchmark Size:** 200 questions, while curated, may be insufficient for highly granular subgroup analyses. 3. **Question Balance:** The benchmark is primarily diagnostic/classification-oriented, reflecting Radiopaedia case structures, and not fully balanced across all question types. 4. **Subjectivity of Safety Labels:** Clinician-defined labels, while informed by rules, involve clinical judgment and implicit assumptions, especially for technical, physics, radiation therapy, and negation-type questions. Future work should include multiple annotators and inter-rater agreement. 5. **Null Responses:** Final null responses were scored as incorrect but not assigned safety labels, potentially underestimating option-level safety failures. 6. **Controlled Evidence:** Clean and conflict evidence are experimental constructs; real-world RAG evidence can be noisier, redundant, or irrelevant in more complex ways. 7. **Specific Implementations:** The RAG and agentic RAG implementations are specific choices; other methods might yield different safety profiles. 8. **Confidence Measurement:** Confidence was derived from entropy-normalized repeated-sampling stability, not calibrated token probabilities, limiting its interpretation as a full calibration study. 9. **Inference-time Compute:** Self-consistency and ensemble experiments were targeted, not exhaustive, leaving room for more advanced aggregation methods.
This paper has significant broader impact, particularly for the development and deployment of AI in high-stakes domains like medicine. It fundamentally challenges the prevailing assumption that scaling LLMs (larger models, longer contexts, more compute) automatically leads to safer behavior. The finding that evidence quality is paramount and that safety and accuracy decouple will necessitate a paradigm shift in how clinical LLMs are evaluated and deployed. It highlights the critical need for multi-dimensional safety metrics beyond accuracy, including high-risk error, contradiction, and dangerous overconfidence. The identification of "synchronized failure" in ensembles is a crucial warning for system designers relying on model agreement for robustness. The paper provides a valuable framework (SaFE-Scale) and benchmark (RadSaFE-200) that can guide future research and development towards truly safe and reliable clinical AI systems. Its insights are also relevant to other high-stakes applications of LLMs where confident, high-risk errors are unacceptable. This study rigorously demonstrates that clinical LLM safety is not a passive consequence of scaling but a deployment property critically shaped by evidence quality, retrieval design, and context construction, often decoupling from accuracy. The paper introduces SaFE-Scale, a novel framework, and RadSaFE-200, a benchmark with clinician-defined multi-dimensional safety labels, to empirically show that clean evidence dramatically improves both accuracy and safety, while model scale, retrieval, and inference-time compute offer limited or even misleading safety gains, particularly due to unreliable confidence and synchronized failures in ensembles. This comprehensive analysis provides crucial insights for developing and deploying safer LLMs in high-stakes clinical environments, urging a shift from accuracy-centric evaluation to explicit safety-focused monitoring of high-risk errors.
Every document format in existence was designed for a human reader moving linearly through text. Autonomous LLM agents do not read - they retrieve. This fundamental mismatch forces agents to inject entire documents into their context window, wasting tokens on irrelevant content, compounding state across multi-turn loops, and broadcasting information indiscriminately across agent roles. We argue this is not a prompt engineering problem, not a retrieval problem, and not a compression problem: it is a format problem. We introduce OBJECTGRAPH (.og), a file format that reconceives the document as a typed, directed knowledge graph to be traversed rather than a string to be injected. OBJECTGRAPH is a strict superset of Markdown - every .md file is a valid .og file - requires no infrastructure beyond a two-primitive query protocol, and is readable by both humans and agents without tooling. We formalize the Document Consumption Problem, characterise six structural properties no existing format satisfies simultaneously, and prove OBJECTGRAPH satisfies all six. We further introduce the Progressive Disclosure Model, the Role-Scoped Access Protocol, and Executable Assertion Nodes as native format primitives. Empirical evaluation across five document classes and eight agent task types demonstrates up to 95.3 percent token reduction with no statistically significant degradation in task accuracy (p > 0.05). Transpiler fidelity reaches 98.7 percent content preservation on a held-out document benchmark.
Primary: Open Gigantic
All Institutions: Open Gigantic
ObjectGraph has the potential for significant broader impact across several dimensions: 1. **Cost and Efficiency**: The dramatic reduction in token consumption (up to 95.3%) and mitigation of context compounding (36.5x reduction) can substantially lower the operational costs of LLM agents and enable more complex, multi-turn workflows within existing context window limits. 2. **Agent Capabilities**: By providing structured, queryable knowledge, ObjectGraph can enhance agent reasoning, planning, and execution capabilities, leading to more reliable and autonomous agents. 3. **System Simplification**: The "ObjectGraph as Infrastructure" concept is powerful. Role-scoped access control, executable assertions, and delta loading natively within the document format can eliminate the need for external middleware, validation prompt templates, and change tracking systems, simplifying the architecture of multi-agent deployments. 4. **Human-Agent Collaboration**: Being a strict superset of Markdown, ObjectGraph allows both humans and agents to interact with the same source document, reducing maintenance overhead and fostering better alignment between human-authored instructions and agent execution. 5. **Knowledge Management**: It offers a more robust framework for managing agent knowledge bases, enabling features like automated staleness detection and structured updates. 6. **New Paradigm for Documents**: This work challenges the fundamental assumption of linear document consumption, proposing a new paradigm for how information is structured and accessed in the agentic era. If widely adopted, it could lead to a new ecosystem of tools and practices for agent-native content creation and consumption. This paper introduces ObjectGraph, a novel file format that re-imagines documents as typed knowledge graphs for LLM agents, achieving up to 95.3% token reduction and significant context compounding mitigation without degrading task accuracy. The work presents a comprehensive, well-designed solution to a fundamental problem in LLM agent deployment, offering a paradigm shift in document consumption that promises to enhance agent efficiency, capabilities, and simplify multi-agent system architectures.
The paper introduces ObjectGraph (.og), a novel file format designed to address the "Document Consumption Problem" for LLM agents. The core methodology reconceives documents as typed, directed knowledge graphs rather than linear text strings. The authors formalize this problem and derive six structural properties (Query-Addressable Index, Layered Compression, Typed Dependency Graph, Role-Scoped Access Control, Executable Assertions, Human Readability) that existing formats fail to satisfy simultaneously. ObjectGraph is presented as a strict superset of Markdown, ensuring backward compatibility. Key methodological components include: 1. **ObjectGraph Format Specification**: A detailed structure comprising a file-level manifest (meta, index, changelog blocks) and atomic knowledge units (nodes). Nodes are typed containers with stable identifiers, scope annotations, confidence scores, and versioning metadata. Content-type tags (e.g., `code`, `steps`, `warning`) provide explicit semantic meaning beyond visual cues. 2. **Progressive Disclosure Model (PDM)**: A three-pass reading model (Index, Dense, Full) that enables agents to retrieve only relevant information at the necessary fidelity level, significantly reducing token consumption. 3. **Typed Edge Declarations**: Supports explicit, machine-traversable relationships between nodes (e.g., `:requires`, `:precedes`, `:see-also`), allowing for automatic dependency resolution. 4. **Role-Based Access Control**: The `scope` attribute on nodes and index entries enables content filtering at the format level, eliminating the need for external middleware in multi-agent systems. 5. **Executable Assertion Nodes**: Allows embedding validation logic, retry mechanisms, and escalation paths directly within the document, triggered by the query protocol. 6. **Delta Loading via Changelog**: A `__changelog` meta-node facilitates incremental document updates, reducing the cost of checking for changes. 7. **LLM-Native Query Protocol**: A minimal two-primitive interface (`search_index`, `resolve_context`) that leverages the LLM itself as a "Router" for semantic index search, rather than relying on traditional keyword matching or embeddings. This is a particularly clever design choice. 8. **Transpiler**: A hybrid Markdown-to-ObjectGraph transpiler that uses deterministic parsers for content extraction and bounded LLM calls for metadata synthesis (dense blocks, index keywords), ensuring high fidelity and bounding hallucination risk. The methodology is comprehensive, well-articulated, and addresses the identified problems systematically. The design choices, such as Markdown superset and LLM-as-Router, are pragmatic and innovative.
The empirical evaluation is robust and addresses key research questions effectively. 1. **Corpus**: A benchmark of 240 documents across five classes (Skill Files, Operational Runbooks, Execution Plans, Technical Documentation, Knowledge Bases), ranging from 200 to 15,000 tokens, provides a diverse testbed. 2. **Task Suite**: Eight distinct task types (information lookup, procedure execution, multi-step planning, role-conditional access, cross-node reasoning, update detection, assertion verification, multi-agent handoff) cover a broad range of agent interactions. 3. **Models & Baselines**: Evaluation uses Claude Sonnet 4.5 (primary), Claude Haiku 4.5 (Router), and GPT-4o (cross-model validation). Baselines include Full Markdown injection, RAG (text-embedding-3-large), and SkillReducer-optimized Markdown. 4. **RQ1: Token Consumption**: ObjectGraph achieved a mean token reduction from 2,340 to 187 tokens (92.0% average, up to 95.3%), demonstrating significant cost savings. 5. **RQ2: Context Compounding Reduction**: In a 5-turn workflow, ObjectGraph (Architecture B) reduced cumulative token cost by 36.5x compared to Markdown (46,000 vs. 1,260 tokens), effectively mitigating the super-linear growth of context. 6. **RQ3: Task Accuracy**: ObjectGraph matched or exceeded Markdown accuracy on 7 of 8 task types. Notably, it showed dramatic improvements on Role-conditional access (+18.4%) and Update detection (+30.1%), tasks where Markdown lacks native support. The "less-is-more" effect, where reduced context improves accuracy by reducing attention dilution, is a significant finding. 7. **RQ4: Transpiler Fidelity**: The transpiler achieved a mean fidelity of 0.987 (SD=0.018) on 180 held-out documents, ensuring high content preservation. 8. **RQ5: Human Authoring Burden**: A user study with 18 participants rated authoring burden as low (mean 2.8/7), suggesting good usability for human authors. 9. **Ablation Study**: An ablation study clearly demonstrated the individual contributions of different ObjectGraph features to token reduction, providing valuable insights into the design's effectiveness. The experimental setup is comprehensive, the results are statistically significant (p > 0.05 for accuracy degradation), and the findings strongly support the claims of the paper.
The paper provides a detailed specification of the ObjectGraph format, including its structure, node types, edge syntax, and query protocol. The LLM prompt template for metadata synthesis is explicitly provided. The algorithms for structural extraction and the query protocol are outlined. While no direct code repository or dataset links are provided, the level of detail in the format specification and methodology sections is high enough that a motivated researcher could likely implement the format and protocol. The benchmark corpus is described in terms of document classes and token ranges, but the specific documents are not publicly available. The LLM models used are identified. Overall, the paper offers a strong foundation for reproducibility, though direct code access would enhance it further.
The authors acknowledge several limitations: 1. **Scale**: The benchmark of 240 documents, while curated, may not fully represent the diversity of real-world enterprise-scale corpora. 2. **Cross-file Federation**: The current specification does not support cross-file edge resolution, limiting its applicability to mono-repo or single-domain knowledge bases. This is a significant limitation for truly distributed knowledge graphs. 3. **Standardisation**: Without a standards body or broad community adoption, the format risks fragmentation into incompatible dialects. 4. **Adversarial Inputs**: The evaluation did not consider adversarial document authors who might craft misleading `dense` blocks or `index` entries to manipulate agent routing. Additional minor limitations could include the reliance on LLMs for routing, which, while a feature, could introduce its own set of challenges (e.g., prompt engineering for optimal routing, potential for misinterpretation if the index is poorly crafted).
ObjectGraph has the potential for significant broader impact across several dimensions: 1. **Cost and Efficiency**: The dramatic reduction in token consumption (up to 95.3%) and mitigation of context compounding (36.5x reduction) can substantially lower the operational costs of LLM agents and enable more complex, multi-turn workflows within existing context window limits. 2. **Agent Capabilities**: By providing structured, queryable knowledge, ObjectGraph can enhance agent reasoning, planning, and execution capabilities, leading to more reliable and autonomous agents. 3. **System Simplification**: The "ObjectGraph as Infrastructure" concept is powerful. Role-scoped access control, executable assertions, and delta loading natively within the document format can eliminate the need for external middleware, validation prompt templates, and change tracking systems, simplifying the architecture of multi-agent deployments. 4. **Human-Agent Collaboration**: Being a strict superset of Markdown, ObjectGraph allows both humans and agents to interact with the same source document, reducing maintenance overhead and fostering better alignment between human-authored instructions and agent execution. 5. **Knowledge Management**: It offers a more robust framework for managing agent knowledge bases, enabling features like automated staleness detection and structured updates. 6. **New Paradigm for Documents**: This work challenges the fundamental assumption of linear document consumption, proposing a new paradigm for how information is structured and accessed in the agentic era. If widely adopted, it could lead to a new ecosystem of tools and practices for agent-native content creation and consumption. This paper introduces ObjectGraph, a novel file format that re-imagines documents as typed knowledge graphs for LLM agents, achieving up to 95.3% token reduction and significant context compounding mitigation without degrading task accuracy. The work presents a comprehensive, well-designed solution to a fundamental problem in LLM agent deployment, offering a paradigm shift in document consumption that promises to enhance agent efficiency, capabilities, and simplify multi-agent system architectures.
Multimodal Large Language Models (MLLMs) have demonstrated strong image-level visual understanding and reasoning, yet their pixel-level perception across both images and videos remains limited. Foundation segmentation models such as the SAM series produce high-quality masks, but they rely on low-level visual prompts and cannot natively interpret complex conversational instructions. Existing segmentation MLLMs narrow this gap, but are usually specialized for either images or videos and rarely support both textual and visual prompts in one interface. We introduce X2SAM, a unified segmentation MLLM that extends any-segmentation capabilities from images to videos. Given conversational instructions and visual prompts, X2SAM couples an LLM with a Mask Memory module that stores guided vision features for temporally consistent video mask generation. The same formulation supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. We further introduce the Video Visual Grounded (V-VGD) segmentation benchmark, which evaluates whether a model can segment object tracks in videos from interactive visual prompts. With a unified joint training strategy over heterogeneous image and video datasets, X2SAM delivers strong video segmentation performance, remains competitive on image segmentation benchmarks, and preserves general image and video chat ability.
Primary: unknown
All Institutions: unknown
X2SAM represents a significant step towards more generalized and intuitive multimodal AI. * **Enhanced Human-Computer Interaction:** The conversational interface supporting both text and visual prompts for pixel-level control across images and videos could lead to more natural and powerful interaction paradigms for visual editing, content creation, and data annotation. * **Advanced Video Understanding:** The ability to perform complex segmentation tasks with temporal consistency in videos opens doors for applications in autonomous driving, surveillance, robotics, and medical imaging, where precise spatio-temporal object understanding is critical. * **Foundation for Future MLLMs:** By demonstrating effective unification of image and video segmentation within an MLLM, X2SAM provides a strong baseline and architectural insights for developing even more capable multimodal foundation models. * **New Benchmarking:** The V-VGD benchmark provides a valuable tool for the community to evaluate and advance research in video visual grounded segmentation. X2SAM introduces a unified MLLM framework that extends "any-segmentation" from images to videos, integrating a novel Mask Memory module for temporal consistency and a unified joint training strategy. This paper makes a substantial technical contribution by enabling a single model to perform a wide array of image and video segmentation tasks with both textual and visual prompts, achieving strong performance across modalities and introducing a valuable new benchmark for video visual grounded segmentation.
X2SAM proposes a unified segmentation MLLM designed to extend "any-segmentation" capabilities from images to videos, supporting both textual and visual prompts. The core methodology addresses three key challenges: comprehensive prompt integration, spatio-temporal task formulation, and temporal coherence. 1. **Comprehensive Prompt Integration:** The model augments an LLM to process interleaved textual instructions and visual prompts (V-Prompts) for both image and video inputs. This is achieved by using special `
` and `
` tokens to demarcate object conditions and a `The experimental evaluation is comprehensive and rigorous, covering 14 segmentation tasks across images and videos, along with out-of-domain benchmarks. * **Task Coverage:** X2SAM is evaluated on a broad suite of tasks including generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation for both images and videos. * **Datasets:** Training involves SA-1B for agnostic segmentation, and a diverse mix of image (COCO, RefCOCO/+/g, ReasonSeg, GLaMM-derived, COCO-VGD, LLaVA-1.5) and video (VIPSeg, VSPW, YT-VIS19, YT-RefVOS21, DAVIS17-RefVOS, ReVOS, VideoGLaMM-derived, YT-VOS19, YT19-VGD, VIPSeg-VGD, VideoInstruct100K) datasets. The introduction of the Video Visual Grounded (V-VGD) segmentation benchmark (YT19-VGD and VIPSeg-VGD) is a significant contribution. * **Performance:** * **Image Segmentation:** X2SAM remains competitive with image-centric generalists like X-SAM, notably improving image open-vocabulary segmentation (I-OV) from 20.9 to 31.2 PQ. * **Video Segmentation:** It significantly outperforms existing MLLM-based video generalists. For instance, it improves V-Ref. on Ref-YT21 and Ref-DV17 over UniPixel-7B, and achieves a +21.5 mIoU gain on V-GCG over VideoGLaMM (75.8 vs. 54.3). * **Reasoning Segmentation:** Achieves state-of-the-art results on both image (I-Rea. Seg.) and video (V-Rea. Seg.) reasoning tasks, outperforming HyperSeg and even the video-specialist ReferFormer-B. * **Out-of-Domain Generalization:** Demonstrates strong generalization on gRefCOCO, ADE20K, and YT-VIS-21, surpassing specialists and other MLLM generalists. * **Visual Grounded Segmentation:** Shows substantial improvements over SAM2-H in the video domain (V-VGD Seg.), with impressive AP scores on YT-VIS19 and VIPSeg. * **Ablation Studies:** Thorough ablations validate key components: * **Mask Decoder:** Zero-initialization for Token-to-Image Attention is shown to be crucial for stable training and performance gains. * **Joint Training:** The unified joint training strategy significantly reduces training cost (3.3K vs 5.2K GPU hours) while maintaining performance. * **Mask Memory:** Mask guidance, class guidance, and multi-scale features in the Mask Memory module are shown to bring consistent and substantial gains, especially for video tasks. * **Memory Size:** An optimal memory size of 6 frames is identified, balancing historical information with potential noise.
The paper provides a good level of detail for reproducibility. * **Model Initialization:** Vision encoder, projector, and LLM from Qwen3-VL; mask encoder from SAM2; mask decoder from pre-trained agnostic segmentor. LoRA used for LLM fine-tuning. * **Training Details:** Specifics for both agnostic segmentor training (batch size 128, LR 1e-4) and unified joint training (projectors, LoRA, encoders, decoder, memory optimized; LR 1e-5 for mask encoder, 1e-4 for others; effective batch size 32 for video, 128 for image; AdamW optimizer, weight decay 0.05). * **Loss Functions:** Mask loss (BCE + Dice), auto-regressive loss, and focal loss. * **Dataset Sampling:** Consecutive frame sampling for video segmentation, global sampling for video GCG, 64 frames for video chat. * **Memory Capacity:** Default K=8 for ablations, K=6 for final model. The level of detail provided in the "Implementation Details" and "More Model Details" sections is sufficient for researchers to attempt to reproduce the results, although the sheer scale of training (32 NVIDIA H800 GPUs) might be a practical barrier for some.
The authors candidly discuss several limitations: 1. **Computational Expense:** Unified training over heterogeneous image and video datasets remains computationally expensive, especially for video samples with high memory costs. 2. **Fixed-Size Memory:** The fixed-size FIFO memory (K=6 frames) may be insufficient for very long videos, scenarios with prolonged occlusions, large appearance changes, or sparse target reappearance, limiting long-term temporal understanding. 3. **Generalist vs. Specialist Performance:** As a unified generalist model, X2SAM may still lag behind highly specialized models on narrowly focused tasks (e.g., optimized video object segmentation or image-only segmentation).
X2SAM represents a significant step towards more generalized and intuitive multimodal AI. * **Enhanced Human-Computer Interaction:** The conversational interface supporting both text and visual prompts for pixel-level control across images and videos could lead to more natural and powerful interaction paradigms for visual editing, content creation, and data annotation. * **Advanced Video Understanding:** The ability to perform complex segmentation tasks with temporal consistency in videos opens doors for applications in autonomous driving, surveillance, robotics, and medical imaging, where precise spatio-temporal object understanding is critical. * **Foundation for Future MLLMs:** By demonstrating effective unification of image and video segmentation within an MLLM, X2SAM provides a strong baseline and architectural insights for developing even more capable multimodal foundation models. * **New Benchmarking:** The V-VGD benchmark provides a valuable tool for the community to evaluate and advance research in video visual grounded segmentation. X2SAM introduces a unified MLLM framework that extends "any-segmentation" from images to videos, integrating a novel Mask Memory module for temporal consistency and a unified joint training strategy. This paper makes a substantial technical contribution by enabling a single model to perform a wide array of image and video segmentation tasks with both textual and visual prompts, achieving strong performance across modalities and introducing a valuable new benchmark for video visual grounded segmentation.
ReLU neural networks trained as surrogate models can be embedded exactly in mixed-integer linear programs (MILPs), enabling global optimization over the learned function. The tractability of the resulting MILP depends on structural properties of the network, i.e., the number of binary variables in associated formulations and the tightness of the continuous LP relaxation. These properties are determined during training, yet standard training objectives (prediction loss with classical weight regularization) offer no mechanism to directly control them. This work studies training regularizers that directly target downstream MILP tractability. Specifically, we propose simple bound-based regularizers that penalize the big-M constants of MILP formulations and/or the number of unstable neurons. Moreover, we introduce an LP relaxation gap regularizer that explicitly penalizes the per-sample gap of the continuous relaxation at training points. We derive its associated gradient and provide an implementation from LP dual variables without custom automatic differentiation tools. We show that combining the above regularizers can approximate the full total derivative of the LP gap with respect to the network parameters, capturing both direct and indirect sensitivities. Experiments on non-convex benchmark functions and a two-stage stochastic programming problem with quantile neural network surrogates demonstrate that the proposed regularizers can reduce MILP solve times by up to four orders of magnitude relative to an unregularized baseline, while maintaining competitive surrogate model accuracy.
Primary: Imperial College London
All Institutions: Imperial College London
This paper has significant broader impact across several domains: * **Mathematical Optimization:** It provides a powerful new tool for integrating neural network surrogates into global optimization problems, particularly those formulated as MILPs. This can unlock new capabilities in fields where complex black-box functions need to be optimized. * **Engineering Design and Operations:** Applications in process design, energy systems, and planning, where NN surrogates are increasingly used, will directly benefit from the ability to train more tractable models. This can lead to faster design cycles and more efficient operational decisions. * **Decision-Focused Learning:** The work contributes to the broader paradigm of training ML models with their downstream use in mind. While decision-focused learning often targets solution quality, this paper focuses on *computational tractability*, offering a complementary and equally important objective. * **Certified Robustness and Verification:** The techniques share methodological roots with certified robustness, demonstrating how insights from that field can be repurposed for optimization tractability. * **ML System Design:** It highlights the importance of considering the entire ML-to-optimization pipeline, suggesting that training objectives should be informed by the downstream application's computational characteristics. This could lead to more holistic ML system designs. The dramatic speedups demonstrated could make previously intractable problems solvable within reasonable timeframes, thereby expanding the practical applicability of NN surrogates in optimization. This paper introduces novel regularization techniques that enable the training of ReLU neural network surrogate models which are dramatically more tractable for downstream Mixed-Integer Linear Program (MILP) optimization, achieving up to four orders of magnitude speedup in MILP solve times while maintaining competitive accuracy. The work makes significant methodological contributions, including a novel LP relaxation gap regularizer with an elegant gradient derivation using LP dual variables and a practical straight-through estimator implementation, alongside a theoretical decomposition linking combined regularizers to the total derivative of the LP gap. This research provides a critical advancement for integrating machine learning models into mathematical optimization, with profound implications for engineering, design, and decision-making applications.
The paper proposes a family of novel regularization terms designed to improve the tractability of Mixed-Integer Linear Programs (MILPs) that embed ReLU neural network surrogate models. This addresses a critical bottleneck: while ReLU NNs can be exactly formulated as MILPs, the resulting optimization problems are often intractable. The methodology is well-grounded and comprises three main types of regularizers: 1. **Shrinkage Regularizers ($R_{L1}, R_{L2}$):** These are standard baselines, indirectly influencing MILP tractability by promoting smaller weights, which can lead to tighter bounds. 2. **Bound-based Regularizers ($R_{BW}, R_{SN}, R_{SN2}$):** * $R_{BW}$ (Bound-Width): Directly penalizes the mean width of Interval Bound Propagation (IBP) pre-activation bounds across all hidden neurons. This directly targets the big-M constants in MILP formulations, which are crucial for relaxation tightness. Its gradient is computed via automatic differentiation through the IBP forward pass. * $R_{SN}$ (Stable-Neuron): Penalizes the "distance to stability" for unstable neurons, encouraging them to become stably active or inactive, thus reducing the number of binary variables needed. It uses a piecewise-linear formulation with a clear subgradient. * $R_{SN2}$ (RS Loss): An alternative stability regularizer from prior work, included for comparison. 3. **LP Relaxation Gap Regularizer ($R_{LP}$):** This is the most novel and technically sophisticated contribution. It directly penalizes the per-sample continuous LP relaxation gap at training points. The paper elegantly derives its gradient using sensitivity analysis for parametric LPs, specifically leveraging LP dual variables. Crucially, it provides a practical implementation using a "straight-through estimator" to avoid custom automatic differentiation tools, making it accessible for standard ML frameworks like PyTorch. A significant theoretical contribution is Proposition 2, which demonstrates that the combined regularizer $R_{LP} + \lambda R_{BW}$ approximates the full total derivative of the LP gap with respect to network parameters. This decomposition captures both direct sensitivity (through constraint right-hand sides) and indirect sensitivity (through big-M constants via IBP), providing a strong theoretical justification for combining these regularizers. The methodology is robust, combining established concepts (IBP, MILP formulations) with novel gradient derivations and practical implementation strategies.
The experimental evaluation is comprehensive and compelling. * **Benchmarks:** The methods are tested on standard non-convex benchmark functions (Himmelblau, Peaks, Ackley) and a more complex, real-world relevant problem: a two-stage stochastic programming problem with quantile neural network surrogates. This demonstrates applicability across different problem types. * **Network Architectures:** Various network sizes (2, 3, 5 hidden layers, 25-50 neurons per layer) are explored, showing the robustness of the approach across different model complexities. * **Metrics:** The evaluation uses a comprehensive set of metrics: * **Accuracy:** Normalized test MSE ratios are reported to assess the trade-off between tractability and prediction accuracy. * **MILP Tractability:** Key metrics include the number of unstable neurons, LP relaxation gap, MILP node count, and MILP solve time. * **Results:** The results are outstanding. The proposed regularizers, especially combinations like $R_{BW}+R_{LP}$, achieve reductions in MILP solve times by *up to four orders of magnitude* (e.g., from hours to seconds) compared to unregularized baselines. This is achieved while maintaining competitive surrogate model accuracy, demonstrating a highly favorable trade-off. The paper shows that $R_{LP}$ is particularly effective at reducing the LP relaxation gap, while $R_{SN}$ and $R_{BW}$ contribute to reducing unstable neurons and tightening bounds, respectively. The computational overhead during training is analyzed, with $R_{LP}$ being the most expensive (5-10x baseline training time), but this cost is amortized over potentially many downstream optimization tasks. The visual examples (Figure 1, 2, 3) effectively illustrate the impact of regularization on relaxation tightness and prediction quality.
The paper provides sufficient detail for reproducibility. * **Implementation Details:** The use of PyTorch for NN models and regularizers, Gurobi for MILP, and HiGHS for LP solves is clearly stated. The specific version of Gurobi is mentioned. * **Gradient Derivations:** The gradients for all regularizers are explicitly derived, and the "straight-through estimator" implementation for $R_{LP}$ is clearly explained, which is crucial for practical implementation in standard ML frameworks. * **Experimental Setup:** Details on training data generation (Latin Hypercube sampling), sample sizes, normalization, and validation splits are provided. * **Computational Environment:** The server specifications (AMD EPYC 7742, 8 CPU cores, 16 GB memory) are mentioned. * **Tooling:** The choice of HiGHS over Gurobi for LPs during training is justified, aiding reproducibility with open-source tools. The acknowledgment of using Anthropic's Claude for server setup is unusual but transparent. Overall, the level of detail is high, making the work highly reproducible.
* **Computational Cost of $R_{LP}$:** While the benefits are immense, the LP-based regularizer significantly increases training time (5-10x). This might be a barrier for very large networks or datasets, although the paper suggests GPU-based LP solvers as a future direction. * **Reliance on IBP:** The bound-based regularizers and the indirect sensitivity path in Proposition 2 rely on IBP, which provides valid but often loose bounds. While the paper acknowledges this, more sophisticated OBBT methods could potentially yield even tighter relaxations at higher computational cost. * **Approximation in Combined Regularizer:** The combined regularizer $R_{LP} + \lambda R_{BW}$ approximates the full total derivative by using a uniform weight $\lambda$ instead of the true, sample-dependent LP dual multipliers for big-M sensitivity. While effective, this is an approximation. * **Scope of MILP Formulations:** The work primarily focuses on the standard big-M formulation for ReLU networks. While widely used, other more sophisticated MILP formulations exist, and the generalizability of these specific regularizers to those might require further investigation. * **ReLU-specific:** The methods are tailored for ReLU activation functions due to their piecewise-linear nature and exact MILP embedding. Generalization to other activation functions (e.g., sigmoid, tanh, or more complex non-linearities) would require different MILP formulations or convex relaxations, which is beyond the current scope.
This paper has significant broader impact across several domains: * **Mathematical Optimization:** It provides a powerful new tool for integrating neural network surrogates into global optimization problems, particularly those formulated as MILPs. This can unlock new capabilities in fields where complex black-box functions need to be optimized. * **Engineering Design and Operations:** Applications in process design, energy systems, and planning, where NN surrogates are increasingly used, will directly benefit from the ability to train more tractable models. This can lead to faster design cycles and more efficient operational decisions. * **Decision-Focused Learning:** The work contributes to the broader paradigm of training ML models with their downstream use in mind. While decision-focused learning often targets solution quality, this paper focuses on *computational tractability*, offering a complementary and equally important objective. * **Certified Robustness and Verification:** The techniques share methodological roots with certified robustness, demonstrating how insights from that field can be repurposed for optimization tractability. * **ML System Design:** It highlights the importance of considering the entire ML-to-optimization pipeline, suggesting that training objectives should be informed by the downstream application's computational characteristics. This could lead to more holistic ML system designs. The dramatic speedups demonstrated could make previously intractable problems solvable within reasonable timeframes, thereby expanding the practical applicability of NN surrogates in optimization. This paper introduces novel regularization techniques that enable the training of ReLU neural network surrogate models which are dramatically more tractable for downstream Mixed-Integer Linear Program (MILP) optimization, achieving up to four orders of magnitude speedup in MILP solve times while maintaining competitive accuracy. The work makes significant methodological contributions, including a novel LP relaxation gap regularizer with an elegant gradient derivation using LP dual variables and a practical straight-through estimator implementation, alongside a theoretical decomposition linking combined regularizers to the total derivative of the LP gap. This research provides a critical advancement for integrating machine learning models into mathematical optimization, with profound implications for engineering, design, and decision-making applications.
Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting errors can matter more than average benchmark performance. We introduce SaFE-Scale, a framework for measuring how clinical LLM safety changes across model scale, evidence quality, retrieval strategy, context exposure, and inference-time compute. To instantiate this framework, we introduce RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined clean evidence, conflict evidence, and option-level labels for high-risk error, unsafe answer, and evidence contradiction. We evaluated 34 locally deployed LLMs across six deployment conditions: closed-book prompting (zero-shot), clean evidence, conflict evidence, standard RAG, agentic RAG, and max-context prompting. Clean evidence produced the strongest improvement, increasing mean accuracy from 73.5% to 94.1%, while reducing high-risk error from 12.0% to 2.6%, contradiction from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. Standard RAG and agentic RAG did not reproduce this safety profile: agentic RAG improved accuracy over standard RAG and reduced contradiction, but high-risk error and dangerous overconfidence remained elevated. Max-context prompting increased latency without closing the safety gap, and additional inference-time compute produced only limited gains. Worst-case analysis showed that clinically consequential errors concentrated in a small subset of questions. Clinical LLM safety is therefore not a passive consequence of scaling, but a deployment property shaped by evidence quality, retrieval design, context construction, and collective failure behavior.
Primary: Friedrich-Alexander-Universität Erlangen-Nürnberg
All Institutions: Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen National High Performance Computing Center, Institute of Radiology, University Hospital Erlangen, Lab for AI in Medicine, RWTH Aachen University, Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Chair of Computer Science 10
This paper has significant broader impact, particularly for the development and deployment of AI in high-stakes domains like medicine. It fundamentally challenges the prevailing assumption that scaling LLMs (larger models, longer contexts, more compute) automatically leads to safer behavior. The finding that evidence quality is paramount and that safety and accuracy decouple will necessitate a paradigm shift in how clinical LLMs are evaluated and deployed. It highlights the critical need for multi-dimensional safety metrics beyond accuracy, including high-risk error, contradiction, and dangerous overconfidence. The identification of "synchronized failure" in ensembles is a crucial warning for system designers relying on model agreement for robustness. The paper provides a valuable framework (SaFE-Scale) and benchmark (RadSaFE-200) that can guide future research and development towards truly safe and reliable clinical AI systems. Its insights are also relevant to other high-stakes applications of LLMs where confident, high-risk errors are unacceptable. This study rigorously demonstrates that clinical LLM safety is not a passive consequence of scaling but a deployment property critically shaped by evidence quality, retrieval design, and context construction, often decoupling from accuracy. The paper introduces SaFE-Scale, a novel framework, and RadSaFE-200, a benchmark with clinician-defined multi-dimensional safety labels, to empirically show that clean evidence dramatically improves both accuracy and safety, while model scale, retrieval, and inference-time compute offer limited or even misleading safety gains, particularly due to unreliable confidence and synchronized failures in ensembles. This comprehensive analysis provides crucial insights for developing and deploying safer LLMs in high-stakes clinical environments, urging a shift from accuracy-centric evaluation to explicit safety-focused monitoring of high-risk errors.
The paper introduces SaFE-Scale, a well-structured framework for evaluating clinical LLM safety across various scaling dimensions. This framework is instantiated with RadSaFE-200, a novel benchmark of 200 multiple-choice radiology questions. A key methodological strength is the clinician-defined, multi-dimensional safety labels at the option level: high-risk error, unsafe answer, and evidence contradiction. This moves beyond simple accuracy to capture the nuanced risks in clinical settings. The experimental design is comprehensive, evaluating 34 diverse LLMs across six deployment conditions (closed-book, clean evidence, conflict evidence, standard RAG, agentic RAG, max-context prompting) and additional inference-time compute strategies (self-consistency, ensembling). The use of Radiopaedia as an external evidence source for RAG is appropriate for the radiology domain. The metrics chosen (high-risk error rate, unsafe answer rate, contradiction rate, dangerous overconfidence rate, alongside accuracy) are directly relevant to clinical safety. The variance decomposition analysis to quantify the contributions of model family vs. deployment condition is a robust statistical approach. The worst-case analysis at the question level further strengthens the methodology by identifying specific, recurrent failure modes.
The experimental evaluation is exceptionally thorough and rigorous. The study's scale, involving 34 LLMs from various families and sizes, provides a broad and representative assessment of current LLM capabilities. The comparison across six distinct deployment conditions is critical for understanding how practical choices impact safety. The results consistently demonstrate that evidence quality, specifically clinician-written clean evidence, is the most dominant factor for both accuracy and safety, far outweighing model scale or inference-time compute. This is a significant empirical finding. The decoupling of accuracy and safety is clearly illustrated, with agentic RAG improving accuracy but not necessarily safety. The analysis of confidence as an unreliable safety signal, with high confidence observed even in high-risk errors, is a crucial and concerning finding. The investigation into self-consistency and ensembling reveals their limited safety gains and introduces the important concept of "synchronized failure" in ensembles, where multiple models make the same high-risk error. The worst-case analysis effectively highlights that critical failures are not random but concentrate in specific, challenging questions, which is highly valuable for targeted mitigation efforts. The statistical analysis, including variance decomposition, supports the conclusions robustly.
The paper states that "Full prompt templates, output-format instructions, and inference protocols are provided in Supplementary Note [REF]". This commitment to detailing the experimental setup is a strong indicator of reproducibility. The RadSaFE-200 benchmark is intended for public release, albeit with source-specific redistribution restrictions for some components, which is understandable given the use of copyrighted material like RSNA Case Collection and Radiopaedia. The detailed description of benchmark construction, safety augmentation protocol, and model panel specifications further aids reproducibility. While no direct code repository URL is provided in the text, the level of detail suggests that the experiments could be replicated by other researchers with sufficient effort and access to the benchmark.
The authors provide a comprehensive and transparent discussion of limitations. These include: 1. **Benchmark Scope:** Text-based, multiple-choice format does not capture the full complexity of radiology practice (image interpretation, open-ended reasoning, multimodal aspects). 2. **Benchmark Size:** 200 questions, while curated, may be insufficient for highly granular subgroup analyses. 3. **Question Balance:** The benchmark is primarily diagnostic/classification-oriented, reflecting Radiopaedia case structures, and not fully balanced across all question types. 4. **Subjectivity of Safety Labels:** Clinician-defined labels, while informed by rules, involve clinical judgment and implicit assumptions, especially for technical, physics, radiation therapy, and negation-type questions. Future work should include multiple annotators and inter-rater agreement. 5. **Null Responses:** Final null responses were scored as incorrect but not assigned safety labels, potentially underestimating option-level safety failures. 6. **Controlled Evidence:** Clean and conflict evidence are experimental constructs; real-world RAG evidence can be noisier, redundant, or irrelevant in more complex ways. 7. **Specific Implementations:** The RAG and agentic RAG implementations are specific choices; other methods might yield different safety profiles. 8. **Confidence Measurement:** Confidence was derived from entropy-normalized repeated-sampling stability, not calibrated token probabilities, limiting its interpretation as a full calibration study. 9. **Inference-time Compute:** Self-consistency and ensemble experiments were targeted, not exhaustive, leaving room for more advanced aggregation methods.
This paper has significant broader impact, particularly for the development and deployment of AI in high-stakes domains like medicine. It fundamentally challenges the prevailing assumption that scaling LLMs (larger models, longer contexts, more compute) automatically leads to safer behavior. The finding that evidence quality is paramount and that safety and accuracy decouple will necessitate a paradigm shift in how clinical LLMs are evaluated and deployed. It highlights the critical need for multi-dimensional safety metrics beyond accuracy, including high-risk error, contradiction, and dangerous overconfidence. The identification of "synchronized failure" in ensembles is a crucial warning for system designers relying on model agreement for robustness. The paper provides a valuable framework (SaFE-Scale) and benchmark (RadSaFE-200) that can guide future research and development towards truly safe and reliable clinical AI systems. Its insights are also relevant to other high-stakes applications of LLMs where confident, high-risk errors are unacceptable. This study rigorously demonstrates that clinical LLM safety is not a passive consequence of scaling but a deployment property critically shaped by evidence quality, retrieval design, and context construction, often decoupling from accuracy. The paper introduces SaFE-Scale, a novel framework, and RadSaFE-200, a benchmark with clinician-defined multi-dimensional safety labels, to empirically show that clean evidence dramatically improves both accuracy and safety, while model scale, retrieval, and inference-time compute offer limited or even misleading safety gains, particularly due to unreliable confidence and synchronized failures in ensembles. This comprehensive analysis provides crucial insights for developing and deploying safer LLMs in high-stakes clinical environments, urging a shift from accuracy-centric evaluation to explicit safety-focused monitoring of high-risk errors.
Recent megakernel designs for Mixture-of-Experts (MoE) inference fuse expert computation with fine-grained, GPU-initiated communication into a single persistent GPU kernel, and outperform collective-based MoE on a single node by overlapping data transfer with compute at tile granularity. This benefit does not carry over cleanly to multi-node inference, where experts span many nodes connected by an RDMA fabric. Communication-bound MoE models regress by up to $10\times$ on 8 nodes, and the regression worsens with node count. We trace this regression to hidden serialization in proxy-based RDMA transports. The ordering requirement between each tile transfer and its completion signal forces a fence that drains the NIC pipeline, and its cost grows with the number of concurrent transfers. As a result, models whose per-expert compute is too small to absorb this inflated network latency expose communication on the critical path. We present \emph{Perseus}, which eliminates this serialization through two techniques. \emph{Decoupled signaling} batches fences at per-destination granularity, reducing fence count by $8\times$. \emph{NIC-side ordering} replaces proxy stalls with hardware fence flags, so the proxy never blocks. On proxy-based transports, Perseus achieves up to 10.3$\times$ end-to-end speedup. Perseus on IBRC matches or exceeds IBGDA GPU-direct by up to 1.2$\times$, which shows that serialization, rather than the choice between proxy-based and GPU-direct transport, is what bounds multi-node megakernel performance.
Primary: Cornell University
All Institutions: Cornell University
Perseus has significant broader impact for the field of large-scale machine learning and distributed systems: - **Enables larger MoE models:** By effectively addressing a critical scaling bottleneck, Perseus allows MoE models to be deployed and inferred across many nodes with much higher efficiency, pushing the boundaries of what's possible with LLMs. - **Challenges conventional wisdom:** The finding that optimized proxy-based RDMA can outperform GPU-direct RDMA is a major insight that could shift design paradigms for distributed ML systems, potentially simplifying development by making proxy-based approaches more viable. - **Influences future hardware/software design:** The identification of hidden serialization and the effectiveness of NIC-side ordering could inform future designs of network interface cards (NICs) and communication libraries, encouraging more hardware-level support for flexible ordering and completion signaling. - **Applicability beyond MoE:** The principles of eliminating hidden serialization in fine-grained, GPU-initiated communication could be beneficial for other distributed workloads that exhibit similar communication patterns. This paper identifies a critical, previously hidden performance bottleneck in multi-node megakernel communication for MoE inference, deeply analyzes its root cause, and proposes elegant and highly effective solutions that yield up to 10.3x speedup and challenge conventional wisdom regarding proxy-based vs. GPU-direct RDMA. The work is technically profound, experimentally rigorous, and has significant implications for scaling large language models and distributed systems research.
The paper identifies a critical and previously "hidden" serialization bottleneck in multi-node megakernel communication for Mixture-of-Experts (MoE) inference, specifically within proxy-based RDMA transports. The core insight is that the ordering requirement between each fine-grained tile transfer and its completion signal (a doorbell write) forces a `wmb` (write memory barrier) on the CPU-side proxy. This `wmb` drains the NIC pipeline, and its cost grows with the number of concurrent transfers, leading to significant performance regression for communication-bound MoE models. Perseus proposes two technically sound and elegant solutions: 1. **Decoupled Signaling:** This technique batches multiple tile transfers before issuing a single doorbell write and its associated `wmb`. By reducing the number of `wmb`s by up to 8x, it significantly mitigates the serialization overhead. The GPU manages completion tracking for these batches. 2. **NIC-side Ordering:** This more fundamental solution leverages the RDMA write with immediate (RDMAD_WRITE_IMM) capability. By embedding the completion signal (immediate value) within the same RDMA operation as the data transfer, the NIC inherently guarantees ordering. This completely eliminates the need for a CPU-side `wmb`, allowing the proxy to never block. This is a particularly clever use of existing hardware capabilities to solve a software-induced serialization problem. The methodology is robust, clearly dissecting the problem, proposing targeted solutions, and explaining their mechanisms in detail.
The experimental evaluation is comprehensive and rigorous. - **Setup:** Experiments are conducted on a realistic multi-node cluster (8 nodes, 16 A100 GPUs) connected by an InfiniBand HDR fabric, which is highly relevant for large-scale ML deployments. - **Baselines:** The evaluation compares Perseus against strong baselines: IBRC (proxy-based RDMA, representing the problematic baseline) and IBGDA (GPU-direct RDMA, often considered the gold standard for high-performance communication). - **Workloads:** Real-world MoE models, including Switch Transformer (1.6B, 2.3B, 137B parameters) and GShard (600M), are used, demonstrating the practical applicability of the solution. - **Key Results:** - Perseus on IBRC achieves up to 10.3x end-to-end speedup over the baseline IBRC, a truly remarkable improvement. - Crucially, Perseus on IBRC matches or even exceeds IBGDA (GPU-direct) by up to 1.2x. This is a surprising and highly impactful finding, challenging the conventional wisdom that GPU-direct is inherently superior to proxy-based approaches for fine-grained communication. It demonstrates that serialization, not the choice of transport mechanism, was the primary bottleneck. - The paper provides a clear breakdown of the individual contributions of Decoupled Signaling and NIC-side Ordering, showing how they incrementally contribute to the overall speedup. - Microbenchmarks confirm the reduction in fence latency, validating the underlying hypothesis. - Sensitivity analysis to expert size further clarifies when Perseus provides the most benefit (models with smaller per-expert compute, where communication is more exposed). The results are convincing, well-supported, and clearly demonstrate the effectiveness and significance of Perseus.
The paper provides a detailed description of the problem, the proposed solutions, and their implementation within a modified UCX transport layer. The experimental setup, including hardware specifications and workloads, is also well-documented. While the source code is not provided (common for arXiv preprints), the level of detail should enable skilled systems researchers to reproduce the core ideas and potentially the results, given access to similar hardware.
- **Specificity to RDMA/InfiniBand:** The solutions are tailored to the specifics of RDMA transports and the `wmb` behavior in proxy-based communication. While the underlying principle of serialization might exist in other network fabrics, the exact solutions might not directly apply without adaptation. - **Generalizability to other communication patterns:** While the paper suggests applicability to other fine-grained, GPU-initiated communication, the primary focus and evaluation are on MoE's all-to-all pattern. Its effectiveness for other patterns would need further investigation. - **Overhead for extremely small messages:** While MoE tiles are fine-grained, they are not necessarily extremely tiny. For truly byte-level communication, the batching overhead of decoupled signaling or the immediate value processing might introduce new trade-offs, though this is not the target use case.
Perseus has significant broader impact for the field of large-scale machine learning and distributed systems: - **Enables larger MoE models:** By effectively addressing a critical scaling bottleneck, Perseus allows MoE models to be deployed and inferred across many nodes with much higher efficiency, pushing the boundaries of what's possible with LLMs. - **Challenges conventional wisdom:** The finding that optimized proxy-based RDMA can outperform GPU-direct RDMA is a major insight that could shift design paradigms for distributed ML systems, potentially simplifying development by making proxy-based approaches more viable. - **Influences future hardware/software design:** The identification of hidden serialization and the effectiveness of NIC-side ordering could inform future designs of network interface cards (NICs) and communication libraries, encouraging more hardware-level support for flexible ordering and completion signaling. - **Applicability beyond MoE:** The principles of eliminating hidden serialization in fine-grained, GPU-initiated communication could be beneficial for other distributed workloads that exhibit similar communication patterns. This paper identifies a critical, previously hidden performance bottleneck in multi-node megakernel communication for MoE inference, deeply analyzes its root cause, and proposes elegant and highly effective solutions that yield up to 10.3x speedup and challenge conventional wisdom regarding proxy-based vs. GPU-direct RDMA. The work is technically profound, experimentally rigorous, and has significant implications for scaling large language models and distributed systems research.