Machine Learning Papers

🏆 Top Papers This Week

#1 TOP PAPER (Score: 86)

Safety and accuracy follow different scaling laws in clinical large language models

Sebastian Wind, Tri-Thien Nguyen, Jeta Sopa ... · arXiv

Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting errors can matter more than average benchmark performance. We introduce SaFE-Scale, a framework for measuring how clinical LLM safety changes across model scale, evidence quality, retrieval strategy, context exposure, and inference-time compute. To instantiate this framework, we introduce RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined clean evidence, conflict evidence, and option-level labels for high-risk error, unsafe answer, and evidence contradiction. We evaluated 34 locally deployed LLMs across six deployment conditions: closed-book prompting (zero-shot), clean evidence, conflict evidence, standard RAG, agentic RAG, and max-context prompting. Clean evidence produced the strongest improvement, increasing mean accuracy from 73.5% to 94.1%, while reducing high-risk error from 12.0% to 2.6%, contradiction from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. Standard RAG and agentic RAG did not reproduce this safety profile: agentic RAG improved accuracy over standard RAG and reduced contradiction, but high-risk error and dangerous overconfidence remained elevated. Max-context prompting increased latency without closing the safety gap, and additional inference-time compute produced only limited gains. Worst-case analysis showed that clinically consequential errors concentrated in a small subset of questions. Clinical LLM safety is therefore not a passive consequence of scaling, but a deployment property shaped by evidence quality, retrieval design, context construction, and collective failure behavior.

Institutional Affiliations

Primary: Friedrich-Alexander-Universität Erlangen-Nürnberg

All Institutions: Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen National High Performance Computing Center, Institute of Radiology, University Hospital Erlangen, Lab for AI in Medicine, RWTH Aachen University, Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Chair of Computer Science 10

ML Relevance Analysis (86)

This paper has significant broader impact, particularly for the development and deployment of AI in high-stakes domains like medicine. It fundamentally challenges the prevailing assumption that scaling LLMs (larger models, longer contexts, more compute) automatically leads to safer behavior. The finding that evidence quality is paramount and that safety and accuracy decouple will necessitate a paradigm shift in how clinical LLMs are evaluated and deployed. It highlights the critical need for multi-dimensional safety metrics beyond accuracy, including high-risk error, contradiction, and dangerous overconfidence. The identification of "synchronized failure" in ensembles is a crucial warning for system designers relying on model agreement for robustness. The paper provides a valuable framework (SaFE-Scale) and benchmark (RadSaFE-200) that can guide future research and development towards truly safe and reliable clinical AI systems. Its insights are also relevant to other high-stakes applications of LLMs where confident, high-risk errors are unacceptable. This study rigorously demonstrates that clinical LLM safety is not a passive consequence of scaling but a deployment property critically shaped by evidence quality, retrieval design, and context construction, often decoupling from accuracy. The paper introduces SaFE-Scale, a novel framework, and RadSaFE-200, a benchmark with clinician-defined multi-dimensional safety labels, to empirically show that clean evidence dramatically improves both accuracy and safety, while model scale, retrieval, and inference-time compute offer limited or even misleading safety gains, particularly due to unreliable confidence and synchronized failures in ensembles. This comprehensive analysis provides crucial insights for developing and deploying safer LLMs in high-stakes clinical environments, urging a shift from accuracy-centric evaluation to explicit safety-focused monitoring of high-risk errors.

Comprehensive Analysis

Methodology Assessment

The paper introduces SaFE-Scale, a well-structured framework for evaluating clinical LLM safety across various scaling dimensions. This framework is instantiated with RadSaFE-200, a novel benchmark of 200 multiple-choice radiology questions. A key methodological strength is the clinician-defined, multi-dimensional safety labels at the option level: high-risk error, unsafe answer, and evidence contradiction. This moves beyond simple accuracy to capture the nuanced risks in clinical settings. The experimental design is comprehensive, evaluating 34 diverse LLMs across six deployment conditions (closed-book, clean evidence, conflict evidence, standard RAG, agentic RAG, max-context prompting) and additional inference-time compute strategies (self-consistency, ensembling). The use of Radiopaedia as an external evidence source for RAG is appropriate for the radiology domain. The metrics chosen (high-risk error rate, unsafe answer rate, contradiction rate, dangerous overconfidence rate, alongside accuracy) are directly relevant to clinical safety. The variance decomposition analysis to quantify the contributions of model family vs. deployment condition is a robust statistical approach. The worst-case analysis at the question level further strengthens the methodology by identifying specific, recurrent failure modes.

Experimental Evaluation

The experimental evaluation is exceptionally thorough and rigorous. The study's scale, involving 34 LLMs from various families and sizes, provides a broad and representative assessment of current LLM capabilities. The comparison across six distinct deployment conditions is critical for understanding how practical choices impact safety. The results consistently demonstrate that evidence quality, specifically clinician-written clean evidence, is the most dominant factor for both accuracy and safety, far outweighing model scale or inference-time compute. This is a significant empirical finding. The decoupling of accuracy and safety is clearly illustrated, with agentic RAG improving accuracy but not necessarily safety. The analysis of confidence as an unreliable safety signal, with high confidence observed even in high-risk errors, is a crucial and concerning finding. The investigation into self-consistency and ensembling reveals their limited safety gains and introduces the important concept of "synchronized failure" in ensembles, where multiple models make the same high-risk error. The worst-case analysis effectively highlights that critical failures are not random but concentrate in specific, challenging questions, which is highly valuable for targeted mitigation efforts. The statistical analysis, including variance decomposition, supports the conclusions robustly.

Reproducibility

The paper states that "Full prompt templates, output-format instructions, and inference protocols are provided in Supplementary Note [REF]". This commitment to detailing the experimental setup is a strong indicator of reproducibility. The RadSaFE-200 benchmark is intended for public release, albeit with source-specific redistribution restrictions for some components, which is understandable given the use of copyrighted material like RSNA Case Collection and Radiopaedia. The detailed description of benchmark construction, safety augmentation protocol, and model panel specifications further aids reproducibility. While no direct code repository URL is provided in the text, the level of detail suggests that the experiments could be replicated by other researchers with sufficient effort and access to the benchmark.

Limitations

The authors provide a comprehensive and transparent discussion of limitations. These include: 1. **Benchmark Scope:** Text-based, multiple-choice format does not capture the full complexity of radiology practice (image interpretation, open-ended reasoning, multimodal aspects). 2. **Benchmark Size:** 200 questions, while curated, may be insufficient for highly granular subgroup analyses. 3. **Question Balance:** The benchmark is primarily diagnostic/classification-oriented, reflecting Radiopaedia case structures, and not fully balanced across all question types. 4. **Subjectivity of Safety Labels:** Clinician-defined labels, while informed by rules, involve clinical judgment and implicit assumptions, especially for technical, physics, radiation therapy, and negation-type questions. Future work should include multiple annotators and inter-rater agreement. 5. **Null Responses:** Final null responses were scored as incorrect but not assigned safety labels, potentially underestimating option-level safety failures. 6. **Controlled Evidence:** Clean and conflict evidence are experimental constructs; real-world RAG evidence can be noisier, redundant, or irrelevant in more complex ways. 7. **Specific Implementations:** The RAG and agentic RAG implementations are specific choices; other methods might yield different safety profiles. 8. **Confidence Measurement:** Confidence was derived from entropy-normalized repeated-sampling stability, not calibrated token probabilities, limiting its interpretation as a full calibration study. 9. **Inference-time Compute:** Self-consistency and ensemble experiments were targeted, not exhaustive, leaving room for more advanced aggregation methods.

Broader Impact

Analysis: Full Paper • Full text: 50,026 characters

#2 TOP PAPER (Score: 83)

A Closed-Form Adaptive-Landmark Kernel for Certified Point-Cloud and Graph Classification

Sushovan Majhi, Atish Mitra, Žiga Virk ... · arXiv

We introduce PALACE (Persistence Adaptive-Landmark Analytic Classification Engine), the data-adaptive companion to PLACE, paying a small cross-validation tier on three knobs (budget, radii, bandwidth; $\leq 5$ choices each). A cover-theoretic core (Lebesgue-number criterion on the landmark cover) yields four closed-form guarantees. (i) A structural lower distortion bound $λ(τ;ν)$ on $\mathcal{D}_n$ under cross-diagram non-interference, with a $(D/L)^2$ budget reduction over the uniform grid when diagrams concentrate. (ii) Equal weights $w_k = K^{-1/2}$ maximizing $λ$, and farthest-point-sampling positions $2$-approximating the optimal $k$-center covering radius; both derived from training labels alone, no gradient training. (iii) A kernel-RKHS classification rate $O((k-1)\sqrt{K}/(γ\sqrt{m_{\min}}))$ with binary necessity threshold $m = Ω(\sqrt K/γ)$ from a matching Le Cam lower bound, and a closed-form filtration-selection rule. The kernel-Mahalanobis margin $\hatρ_{\mathrm{Mah}}$ is the strongest closed-form ranker across the chemical-graph pool (mean Spearman $ρ\approx +0.60$); the isotropic surrogate $\hatγ/\sqrt{K}$ admits a selection-consistency rate, and $\widehatλ$ from (i) provides an independent data-level signal (positive on COX2 and PTC). (iv) A per-prediction certificate, in non-asymptotic Pinelis and asymptotic Gaussian forms, with no calibration split. Empirically, PALACE is the strongest closed-form diagram-based method on Orbit5k ($91.3 \pm 1.0\%$, matching Persformer), leads every diagram-based competitor on COX2 and MUTAG, and is competitive on DHFR (within 1 pp of ECP). At $8\times$ domain inflation, adaptive placement maintains $94\%$ while the uniform grid collapses to chance ($25\%$ on 4-class data).

Institutional Affiliations

Primary: not specified

All Institutions: not specified

ML Relevance Analysis (83)

PALACE has significant broader impact potential, particularly in domains requiring trustworthy and interpretable machine learning. * **Certified AI**: The per-prediction certificates are a major step towards certified AI, offering quantifiable confidence in individual predictions. This is critical for high-stakes applications in medicine, materials science, and security where TDA is increasingly used. * **Topological Data Analysis**: It advances the field of TDA by providing a principled, data-adaptive, and theoretically grounded method for persistence diagram vectorization and classification. It offers a strong alternative to purely black-box deep learning approaches, especially for researchers who prioritize mathematical guarantees and interpretability. * **Graph and Point Cloud Learning**: By improving classification on graph and point cloud data, PALACE can benefit various applications in chemistry (drug discovery), materials science, computer graphics, and robotics. * **Bridging Theory and Practice**: The paper successfully bridges advanced mathematical theory (cover theory, RKHS) with practical machine learning, demonstrating how rigorous theoretical guarantees can lead to competitive empirical performance. This could inspire further research into theoretically sound ML methods. * **Reduced Budget**: The $(D/L)^2$ budget reduction mechanism for landmark placement is important for efficiency, especially when dealing with large datasets or complex persistence diagrams, making TDA more scalable. PALACE introduces a data-adaptive, closed-form kernel for persistence diagram classification, providing novel theoretical guarantees including a lower distortion bound, optimal landmark placement, a kernel-RKHS classification rate, and per-prediction certificates, achieving strong empirical performance. This paper makes a profound contribution to topological data analysis and certified machine learning by providing a robust, theoretically grounded framework for classifying structured data, offering explicit guarantees that are often missing in modern ML approaches. Its rigorous mathematical development, combined with competitive empirical results and the unique feature of per-prediction certificates, positions it as a highly significant work that can influence the development of more trustworthy and interpretable AI systems in TDA applications.

Comprehensive Analysis

Methodology Assessment

PALACE (Persistence Adaptive-Landmark Analytic Classification Engine) is a significant advancement over its predecessor, PLACE, addressing key limitations of fixed-grid persistence diagram vectorizations. The core methodological innovation lies in its data-adaptive landmark placement, which replaces a uniform grid with a configuration learned from training data via class-aware farthest-point sampling (FPS). This allows landmarks to concentrate where diagrams live, leading to a theoretically proven $(D/L)^2$ budget reduction. The paper develops a self-contained non-uniform cover theory based on a Lebesgue-number criterion to establish four closed-form guarantees: 1. **Structural Lower Distortion Bound**: A non-trivial lower bound $\lambda(\tau; \mathcal{C})$ on the embedding distortion, ensuring that bottleneck-separated diagrams remain separated in the embedding space. This is a crucial theoretical contribution, as most existing vectorizations only offer upper bounds. 2. **Optimal Configuration Choices**: It derives that equal weights $w_k = K^{-1/2}$ maximize the certificate and that FPS provides a 2-approximation to the optimal $k$-center covering radius for landmark positions. These choices are derived from training labels alone, without gradient training, maintaining the "closed-form" ethos. 3. **Kernel-RKHS Classification Rate**: A classification rate $O((k-1)\sqrt{K}/(\gamma\sqrt{m_{\min}}))$ for an RKHS-lifted embedding, with a matching Le Cam lower bound. This extends the analysis beyond linear classifiers, which is empirically shown to be necessary. The paper also provides closed-form filtration selection rules (e.g., kernel-Mahalanobis margin) with selection-consistency rates. 4. **Per-Prediction Certificate**: A non-asymptotic Pinelis and asymptotic Gaussian form certificate for individual predictions, requiring no calibration split. This is a strong feature for certified machine learning. The methodology is rigorously grounded in mathematics, leveraging concepts from cover theory, metric geometry, and kernel methods. The transition from raw embedding to an RKHS via an additive landmark kernel is well-justified, and the paper carefully details the connections between the cover-level certificate and the kernel-margin-based classification.

Experimental Evaluation

The experimental evaluation is comprehensive and well-structured, covering both synthetic and real-world datasets. * **Datasets**: PALACE is evaluated on Orbit5k (point clouds), and five chemical graph benchmarks (COX2, DHFR, MUTAG, NCI1, PTC). It also includes a synthetic task to demonstrate the budget reduction. * **Performance**: * On Orbit5k, PALACE achieves $91.3 \pm 1.0\%$, matching Persformer (a gradient-trained black-box transformer) and outperforming all other closed-form diagram-based methods, including its predecessor PLACE. * On COX2 and MUTAG, PALACE leads every diagram-based competitor. * On DHFR, it is competitive, within 1 percentage point of ECP. * The Mahalanobis-margin ranker is shown to be the strongest closed-form ranker across the chemical-graph pool (mean Spearman $\rho \approx +0.60$), providing a consistent positive signal. * The synthetic task clearly demonstrates the $(D/L)^2$ budget reduction, with adaptive placement maintaining $94\%$ accuracy at $8\times$ domain inflation where the uniform grid collapses to chance ($25\%$). * **Certificates and Diagnostics**: The paper provides empirical validation of the certificate $\widehat{\lambda}$ as an independent data-level signal, positive on COX2 and PTC. It also includes an empirical audit of the non-interference hypothesis, acknowledging that it is rarely met pointwise on chemical diagrams but clarifying that the classification machinery operates at the kernel-margin level, which is robust. * **Limitations in Experiments**: The paper notes "descriptor blindness" on NCI1 and PTC, indicating areas where the current features might not be sufficiently discriminative. It also defers headline accuracies for PROTEINS, DD, IMDB-B, IMDB-M, and NCI109 to a future revision, which slightly limits the completeness of the empirical picture but does not detract from the core claims validated. Overall, the experiments provide strong empirical support for PALACE's theoretical claims and its competitive performance against state-of-the-art methods, especially considering its closed-form and certified nature.

Reproducibility

The paper emphasizes its "closed-form" nature, meaning many components (weights, landmark placement strategy, classification rate, certificates) are analytically derived rather than learned via gradient descent. This inherently aids reproducibility. The small cross-validation tier (budget, radii, bandwidth; $\leq 5$ choices each) is clearly stated, indicating a limited hyperparameter search space. The use of `sklearn.svm.SVC` with `kernel='precomputed'` is mentioned, providing a specific implementation detail. The detailed theoretical derivations in the main text and appendix (not provided here, but implied by the text) would further support reproducibility. The methodology is described with sufficient detail for a technically proficient researcher to implement.

Limitations

1. **Non-Interference Hypothesis**: The paper acknowledges that the non-interference condition (a prerequisite for the lower distortion bound) is "essentially never met on chemical persistence diagrams" empirically. While the authors clarify that the classification rate relies on the kernel margin, this highlights a gap between the theoretical ideal and practical data characteristics. 2. **Descriptor Blindness**: PALACE exhibits "descriptor blindness" on NCI1 and PTC, suggesting that the current persistence diagram features, even with adaptive placement, may not capture sufficient information for all datasets. 3. **Cross-Validation Tier**: While small, the need for a cross-validation tier for budget, radii, and bandwidth means PALACE is not entirely tuning-free, unlike its predecessor PLACE. This is a trade-off for adaptivity and RKHS lift. 4. **Deferred Results**: The deferral of headline accuracies for several graph datasets (PROTEINS, DD, IMDB-B, IMDB-M, NCI109) means the full empirical scope is not yet presented. 5. **Computational Cost**: While the paper mentions $O(K|A|)$ for coordinate calculation, the overall computational cost for FPS on large datasets or the Lebesgue number calculation is not explicitly detailed, though it's likely manageable given the closed-form nature.

Broader Impact

Analysis: Full Paper • Full text: 50,026 characters

#3 TOP PAPER (Score: 82)

RVPO: Risk-Sensitive Alignment via Variance Regularization

Ivan Montero, Tomasz Jurczyk, Bhuwan Dhingra · arXiv

Current critic-less RLHF methods aggregate multi-objective rewards via an arithmetic mean, leaving them vulnerable to constraint neglect: high-magnitude success in one objective can numerically offset critical failures in others (e.g., safety or formatting), masking low-performing "bottleneck" rewards vital for reliable multi-objective alignment. We propose Reward-Variance Policy Optimization (RVPO), a risk-sensitive framework that penalizes inter-reward variance during advantage aggregation, shifting the objective from "maximize sum" to "maximize consistency." We show via Taylor expansion that a LogSumExp (SoftMin) operator effectively acts as a smooth variance penalty. We evaluate RVPO on rubric-based medical and scientific reasoning with up to 17 concurrent LLM-judged reward signals (Qwen2.5-3B/7B/14B) and on tool-calling with rule-based constraints (Qwen2.5-1.5B/3B). By preventing the model from neglecting difficult constraints to exploit easier objectives, RVPO improves overall scores on HealthBench (0.261 vs. 0.215 for GDPO at 14B, $p < 0.001$) and maintains competitive accuracy on GPQA-Diamond without the late-stage degradation observed in other multi-reward methods, demonstrating that variance regularization mitigates constraint neglect across model scales without sacrificing general capabilities.

Institutional Affiliations

Primary: Alibaba

All Institutions: Alibaba

ML Relevance Analysis (82)

The work has significant positive broader impacts. By improving the ability of LLMs to reliably balance competing objectives and strictly adhere to constraints (e.g., formatting, safety, clinical accuracy), RVPO contributes to safer and more consistent LLM behavior. This is crucial for the responsible deployment of LLMs in sensitive domains. The method is computationally efficient and integrates well with existing critic-less RLHF methods. The authors also responsibly note a potential negative impact: the algorithm is agnostic to the semantic nature of constraints, meaning it could theoretically be misused if malicious or biased reward models are provided. RVPO introduces a novel and theoretically grounded approach to multi-objective RLHF, addressing the critical problem of constraint neglect and training instability in LLM alignment. By leveraging a LogSumExp variance penalty, the method consistently improves adherence to bottleneck constraints and enhances training stability across diverse tasks and model scales, offering a significant practical advancement for developing more reliable and safer large language models. The comprehensive experimental validation and clear theoretical justification make this a highly impactful contribution to the field.

Comprehensive Analysis

Methodology Assessment

The paper proposes Reward-Variance Policy Optimization (RVPO), a risk-sensitive framework designed to mitigate "constraint neglect" in critic-less multi-objective RLHF. The core idea is to penalize inter-reward variance during advantage aggregation, shifting the objective from merely maximizing the sum of rewards to maximizing consistency across objectives. This directly addresses a known vulnerability in methods like GDPO, where high performance on one objective can numerically offset critical failures on others. The key methodological contribution is the use of the negative LogSumExp (SoftMin) operator for advantage aggregation across reward channels. The authors provide a clear theoretical justification via a second-order Taylor expansion, demonstrating that SoftMin effectively acts as a smooth variance penalty, with the risk coefficient `k` continuously interpolating between mean aggregation (GDPO) and hard-min aggregation. This theoretical grounding is elegant and provides strong intuition for the method's behavior. The annealing schedule for `k` is a practical detail that allows the policy to first establish general capabilities before tightening the variance penalty, demonstrating a thoughtful approach to optimization. The method builds directly on the GDPO framework, making it a natural extension for existing critic-less RLHF pipelines. The algorithm is clearly summarized in the appendix, including details on Z-normalization and masking inactive reward channels.

Experimental Evaluation

The experimental evaluation is comprehensive and robust, covering two distinct multi-objective paradigms and multiple model scales. 1. **LLM-Judged Constraints (Rubrics-as-Rewards):** Evaluated on medical (HealthBench) and scientific (GPQA-Diamond) reasoning tasks using Qwen2.5 models (3B, 7B, 14B) with up to 17 concurrent LLM-judged reward signals. RVPO consistently improved overall scores on HealthBench (e.g., 0.261 vs. 0.215 for GDPO at 14B, p < 0.001), particularly on bottleneck constraints. Crucially, RVPO demonstrated significantly improved training stability, avoiding the late-stage degradation or collapse observed in GRPO, GDPO, and even single-scalar baselines. This stability is a major practical advantage. It also maintained competitive accuracy on GPQA-Diamond, showing generalization without sacrificing general capabilities. 2. **Rule-Based Constraints (Tool Calling):** Evaluated on RLLA-4k with Qwen2.5 models (1.5B, 3B) using two competing reward signals: Execution Correctness (continuous) and Format Adherence (binary). RVPO accelerated convergence on the bottleneck format constraint while preserving execution accuracy, demonstrating its effectiveness in low-dimensional, sparse constraint settings. **Baselines:** The paper compares against strong and relevant baselines: GRPO (summing raw rewards), GDPO (Z-normalizing then summing), and single-scalar baselines from the RaR framework. **Ablations:** Extensive ablations on the risk coefficient `k` (static vs. annealed schedules) provide valuable insights into its sensitivity and optimal operating points. The comparison with an explicit variance penalty ($A_{RVPO-explicit}$) further validates the robustness and stability of the LogSumExp formulation. The results clearly demonstrate that RVPO effectively mitigates constraint neglect and significantly improves training stability across diverse tasks and model scales, making a strong case for its practical utility.

Reproducibility

The paper provides a good level of detail for reproducibility. It specifies the models used (Qwen2.5), frameworks (verl, TRL), datasets (RLLA-4k, RaR-Medicine/Science, HealthBench, GPQA-Diamond, BFCL-v3), and their licenses. Key training hyperparameters such as learning rates, batch sizes, group sizes, and KL penalty are detailed in the appendix. The annealing schedules for the risk coefficient `k` are also described. While direct code links are not provided in the text, the comprehensive description of the methodology and experimental setup should allow for replication by researchers familiar with the specified frameworks.

Limitations

The authors acknowledge several limitations: 1. **Sensitivity of the risk coefficient `k`:** Optimal `k` values are sensitive to reward space dimensionality, group size, and inter-objective conflict, necessitating future work on adaptive scheduling. 2. **Amplification of noise:** RVPO's soft-min approach focuses on the lowest-performing objectives, which could amplify noise from unreliable reward channels if they produce spuriously low Z-scores. 3. **Difficulty vs. declared priority:** The method prioritizes based on empirical difficulty (lowest Z-score) rather than explicitly declared priority weights, suggesting a need for weighted RVPO variants. These are reasonable and well-articulated limitations that point to clear avenues for future research.

Broader Impact

Analysis: Full Paper • Full text: 30,552 characters

General ML

Thursday, May 07, 2026

RVPO: Risk-Sensitive Alignment via Variance Regularization

Ivan Montero, Tomasz Jurczyk, Bhuwan Dhingra · arXiv

Institutional Affiliations

Primary: Alibaba

All Institutions: Alibaba

ML Relevance Analysis (82)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Analysis: Full Paper • Full text: 30,552 characters

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

Minbin Huang, Han Shi, Chuanyang Zheng ... · arXiv

Modern Mixture-of-Experts (MoE) architectures allocate expert capacity through a rigid per-layer rule: each transformer layer owns a separate expert set. This convention couples depth scaling with linear expert-parameter growth and assumes that every layer needs isolated expert capacity. However, recent analyses and our routing probe challenge this allocation rule: replacing a deeper layer's learned top-k router with uniform random routing drops downstream accuracy by only 1.0-1.6 points across multiple production MoE models. Motivated by this redundancy, we propose UniPool, an MoE architecture that treats expert capacity as a global architectural budget by replacing per-layer expert ownership with a single shared pool accessed by independent per-layer routers. To enable stable and balanced training under sharing, we introduce a pool-level auxiliary loss that balances expert utilization across the entire pool, and adopt NormRouter to provide sparse and scale-stable routing into the shared expert pool. Across five LLaMA-architecture model scales (182M, 469M, 650M, 830M, and 978M parameters) trained on 30B tokens from the Pile, UniPool consistently improves validation loss and perplexity over the matched vanilla MoE baselines. Across these scales, UniPool reduces validation loss by up to 0.0386 relative to vanilla MoE. Beyond raw loss improvement, our results identify pool size as an explicit depth-scaling hyperparameter: reduced-pool UniPool variants using only 41.6%-66.7% of the vanilla expert-parameter budget match or outperform layer-wise MoE at the tested scales. This shows that, under a shared-pool design, expert parameters need not grow linearly with depth; they can grow sublinearly while remaining more efficient and effective than vanilla MoE. Further analysis shows that UniPool's benefits compose with finer-grained expert decomposition.

Institutional Affiliations

Primary: Unknown

All Institutions: Unknown

GitHub

ML Relevance Analysis (81)

UniPool introduces a novel Mixture-of-Experts architecture that replaces layer-private expert ownership with a globally shared expert pool, demonstrating consistent performance improvements and significant parameter efficiency by enabling sublinear expert parameter growth with depth. This paper presents a well-motivated architectural innovation, supported by rigorous experiments across multiple scales, thorough ablation studies, and insightful analyses, offering a compelling new direction for scaling MoE models more efficiently.

Comprehensive Analysis

Methodology Assessment

The methodology for UniPool is robust, well-motivated, and addresses a critical architectural limitation in modern Mixture-of-Experts (MoE) models. The core idea of replacing rigid per-layer expert ownership with a single, global shared expert pool is a significant architectural departure. This is directly motivated by an empirical routing probe showing redundancy in deeper layers of vanilla MoE models. To enable stable and balanced training under this shared paradigm, the paper introduces two key technical components: a novel pool-level auxiliary loss that ensures balanced expert utilization across the entire global pool, and the adoption of NormRouter for sparse and scale-stable routing. The derivation of the pool-level auxiliary loss is clearly presented, and its necessity is well-justified by the global ownership structure. NormRouter's properties, such as L2 normalization and ReLU activation, are well-suited for routing into a larger, shared expert set where hidden state norms and logit scales might vary across layers. The overall approach effectively converts depth-induced redundancy into architectural reuse, decoupling the total expert parameter count from linear growth with depth.

Experimental Evaluation

The experimental evaluation is comprehensive and rigorously conducted for the chosen scales. The authors train five LLaMA-architecture models ranging from 182M to 978M parameters on 30B tokens from the Pile dataset, providing a solid foundation for their claims. Crucially, UniPool is compared against vanilla MoE baselines that are matched in total expert FFNs and per-token FLOPs, ensuring that performance gains are attributable to the architectural changes rather than increased compute. UniPool consistently outperforms vanilla MoE in validation loss and perplexity across all tested scales, with significant reductions (up to 0.0386). The most impactful experimental finding is the performance of "reduced-pool" UniPool variants, which achieve comparable or superior performance to vanilla MoE using only 41.6%-66.7% of the expert parameters, demonstrating substantial parameter efficiency. The ablation studies are thorough, clearly isolating the contributions of the shared pool, pool-level auxiliary loss, and NormRouter, showing their synergistic effects. Further analyses, including a routing-randomization probe, effectively demonstrate that UniPool's routers become more "load-bearing" and experts more specialized, supporting the core hypothesis of reduced redundancy. Downstream zero-shot evaluations on seven benchmarks generally show UniPool performing on par or slightly better, indicating that perplexity gains translate to some task-level benefits. Training dynamics presented in the appendix further confirm consistent gains throughout optimization.

Reproducibility

The paper demonstrates a strong commitment to reproducibility. The code for UniPool is open-sourced on GitHub, which is a critical factor. Detailed architectural specifications, MoE configurations, and complete hyperparameter settings are provided in the appendix, allowing for replication of the experiments. The authors also perform variance checks for the 182M model scale by averaging results over three random seeds, which adds confidence to the findings. Implementation details, such as the use of Megatron-LM, specific optimizer settings (AdamW, cosine LR, bf16), and distributed training strategies (sequence parallelism, distributed optimizer, activation checkpointing), are clearly stated.

Limitations

The authors candidly acknowledge several limitations. The primary limitation is the scale of experiments, which are conducted up to 978M parameters and 30B training tokens. While consistent improvements across these scales are encouraging, validation at billion-parameter scales with longer training horizons is an important next step. The paper also notes that wall-clock throughput comparisons are not reported. While reduced-pool UniPool variants offer memory savings, the full-pool UniPool has the same expert parameter count as vanilla MoE, and potential overheads from the pool auxiliary loss (cross-layer statistic accumulation) and routing into a larger candidate pool are identified as areas for future work. Finally, the authors suggest that a broader downstream evaluation, including few-shot settings, would further strengthen the findings.

Broader Impact

UniPool has significant broader impact potential for the design and scaling of large language models. By challenging the conventional per-layer expert ownership in MoE architectures, it introduces a more parameter-efficient paradigm where expert capacity can be treated as a reusable global budget. The demonstration that expert parameters can grow sublinearly with depth while improving performance offers a fundamental shift in MoE scaling laws, potentially leading to the development of larger, more capable models with reduced computational and memory footprints for their expert components. This work provides a valuable architectural blueprint and a methodological approach for identifying and addressing redundancy through targeted parameter sharing, which could inspire similar innovations in other complex neural network architectures. UniPool introduces a novel Mixture-of-Experts architecture that replaces layer-private expert ownership with a globally shared expert pool, demonstrating consistent performance improvements and significant parameter efficiency by enabling sublinear expert parameter growth with depth. This paper presents a well-motivated architectural innovation, supported by rigorous experiments across multiple scales, thorough ablation studies, and insightful analyses, offering a compelling new direction for scaling MoE models more efficiently.

Analysis: Full Paper • Full text: 32,804 characters

Tuesday, May 05, 2026

A Closed-Form Adaptive-Landmark Kernel for Certified Point-Cloud and Graph Classification

Sushovan Majhi, Atish Mitra, Žiga Virk ... · arXiv

Institutional Affiliations

Primary: not specified

All Institutions: not specified

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Analysis: Full Paper • Full text: 50,026 characters

Vision

Tuesday, May 12, 2026

Elastic Attention Cores for Scalable Vision Transformers

Alan Z. Song, Yinjie Chen, Mu Nan ... · arXiv

Vision Transformers (ViTs) achieve strong data-driven scaling by leveraging all-to-all self-attention. However, this flexibility incurs a computational cost that scales quadratically with image resolution, limiting ViTs in high-resolution domains. Underlying this approach is the assumption that pairwise token interactions are necessary for learning rich visual-semantic representations. In this work, we challenge this assumption, demonstrating that effective visual representations can be learned without any direct patch-to-patch interaction. We propose VECA (Visual Elastic Core Attention), a vision transformer architecture that uses efficient linear-time core-periphery structured attention enabled by a small set of learned cores. In VECA, these cores act as a communication interface: patch tokens exchange information exclusively through the core tokens, which are initialized from scratch and propagated across layers. Because the $N$ image patches only directly interact with a resolution invariant set of $C$ learned "core" embeddings, this yields linear complexity $O(N)$ for predetermined $C$, which bypasses quadratic scaling. Compared to prior cross-attention architectures, VECA maintains and iteratively updates the full set of $N$ input tokens, avoiding a small $C$-way bottleneck. Combined with nested training along the core axis, our model can elastically trade off compute and accuracy during inference. Across classification and dense tasks, VECA achieves performance competitive with the latest vision foundation models while reducing computational cost. Our results establish elastic core-periphery attention as a scalable alternative building block for Vision Transformers.

Institutional Affiliations

Primary: Carnegie Mellon University

All Institutions: Carnegie Mellon University, University of Hong Kong, Columbia University

ML Relevance Analysis (79)

VECA introduces an important building block for Vision Transformers that addresses the critical issue of quadratic computational scaling, making ViTs more practical for high-resolution imagery and real-time applications. The concept of elastic inference, enabled by nested training, is particularly impactful as it allows dynamic trade-offs between compute and accuracy, which is highly valuable for deployment on diverse hardware and latency constraints. This work challenges the fundamental assumption that direct pairwise token interactions are necessary for rich visual representations, potentially opening new avenues for designing efficient attention mechanisms. Its strong performance across classification, segmentation, and detection suggests broad applicability, potentially accelerating the adoption of ViTs in domains like medical imaging, autonomous driving, and high-resolution video analysis where efficiency is paramount. VECA introduces an elastic core-periphery attention mechanism that achieves linear complexity for Vision Transformers, demonstrating competitive performance across diverse vision tasks while significantly improving computational efficiency and enabling flexible compute-accuracy trade-offs. This paper presents a well-motivated and empirically strong architectural innovation that addresses a critical scalability bottleneck in Vision Transformers, making them more practical for high-resolution applications and offering a valuable elastic inference capability for real-world deployment.

Comprehensive Analysis

Methodology Assessment

The paper proposes Visual Elastic Core Attention (VECA), an innovative Vision Transformer architecture designed to overcome the quadratic scaling limitations of traditional self-attention. The core idea is to replace direct all-to-all patch interactions with an indirect communication mechanism mediated by a small, fixed set of learned "core" tokens. Specifically, the VECA block introduces Core-Periphery Attention (CPA), where patch tokens interact only with core tokens (Patch-to-Core attention), and core tokens interact with patch tokens (Core-to-Patch attention). Crucially, the core tokens are not derived from the input patches at each layer but are learned from scratch and propagated across layers, acting as a persistent communication interface. This design yields linear computational complexity $O(N \cdot C \cdot D)$ with respect to the number of patches $N$ (for fixed core count $C$ and dimension $D$), making it highly scalable for high-resolution images. A significant methodological contribution is the "nested training along the core axis," which allows the model to be trained with multiple core counts simultaneously. This enables elastic inference, where the number of active core tokens can be adjusted at test time to trade off compute for accuracy, a highly practical feature. The architecture is well-motivated and clearly described, building on ideas from Perceiver-like models but distinguishing itself by maintaining and iteratively updating the full set of input tokens, avoiding a bottleneck.

Experimental Evaluation

The experimental evaluation is comprehensive and rigorous, covering standard vision tasks: ImageNet-1K classification, ADE20K semantic segmentation, and COCO object detection/instance segmentation. VECA models (Tiny, Small, Base) are compared against strong baselines including DeiT, Swin Transformer, ConvNeXt, PVT, CoAtNet, and Perceiver. For ImageNet-1K classification, VECA-Base achieves 83.6% top-1 accuracy, competitive with Swin-B (83.3%) and ConvNeXt-B (83.8%), while demonstrating superior throughput and often lower FLOPs, especially when considering higher resolutions. On dense prediction tasks, VECA-Base integrated into UperNet for ADE20K segmentation achieves 49.6 mIoU, matching Swin-B (49.5) and ConvNeXt-B (49.9). For COCO object detection/instance segmentation with Mask R-CNN, VECA-Base achieves 49.0 box AP / 42.9 mask AP, again competitive with Swin-B (49.0/42.8) and ConvNeXt-B (49.6/43.1). The results consistently show that VECA can achieve state-of-the-art performance while significantly improving computational efficiency and scalability. Ablation studies thoroughly validate key design choices, including the impact of core count, core initialization, core propagation, and the effectiveness of nested training for elastic inference. Visualizations of core attention further provide insights into how cores learn to attend to different semantic regions.

Reproducibility

The paper provides sufficient architectural details, training configurations, and hyperparameters in the main text and appendix for the core components of VECA. Standard datasets and established frameworks (UperNet, Mask R-CNN) are used for dense tasks. While no explicit code repository URL is provided in the paper, the level of detail suggests that a diligent researcher should be able to reproduce the main results. The use of common benchmarks and clear descriptions of the methodology contribute positively to reproducibility.

Limitations

The paper does not explicitly list limitations. One potential limitation is that while the core tokens are learned and propagated, the fixed number of cores ($C$) might still represent a bottleneck for extremely complex scenes or tasks requiring very fine-grained global interactions, although the paper demonstrates strong performance across various tasks. The "no direct patch-to-patch interaction" claim, while empirically supported, implies an indirect interaction through the cores, which still allows for information flow across patches. The optimal choice of $C$ for different tasks and resolutions might require some tuning, although the nested training helps mitigate this by providing flexibility. While efficient, the overall complexity is still $O(NCD)$, which is linear in $N$ but still depends on $C$ and $D$.

Broader Impact

Analysis: Full Paper • Full text: 1,378 characters

NLP

Tuesday, May 05, 2026

Safety and accuracy follow different scaling laws in clinical large language models