Last 7 Days (June 26 – July 02, 2026)
Language models increasingly write probabilistic programs (in NumPyro, Stan, or Pyro), but a program that compiles, runs, and passes every unit test can still be \emph{statistically} wrong -- a Gaussian likelihood for heavy-tailed data, a Poisson for over-dispersed counts, an invalid prior support, or a pathological parameterization. The right verifier is therefore not a test suite but the Bayesian workflow itself: posterior predictive checks, simulation-based calibration, sampler diagnostics ($\hat R$, divergences, ESS), and held-out predictive density. We study this calibration oracle along three axes. \textbf{Detection:} on a benchmark of $14$ misspecification types across $10$ model families ($200$ instances), it flags the bug with AUC $0.97$ ($88\%$ at $2\%$ FPR \emph{when handed the correct reference program, an upper bound}) -- and a fully \emph{reference-free} version that uses no correct program reaches $62$--$78\%$ (the upper figure from a small automated model search), versus $0\%$ for a unit-test oracle. \textbf{Repair:} used as feedback in an LLM repair loop across fifteen models, calibration significantly outperforms unit-test feedback -- which is itself \emph{significantly worse than no feedback at all}, a passing test inducing false confidence that suppresses repair -- and improves over no feedback on strong-but-unsaturated models (GPT-5.1 $33{\to}92\%$, Claude $75{\to}100\%$; paired McNemar, $n{=}228$). \textbf{Reality:} on programs LLMs write from scratch for neutral briefs, $15$--$47\%$ of runnable ones are statistically misspecified (unit tests catch none), and calibration-guided repair significantly beats LLM-as-judge review, a Bayesian-workflow checklist, and data-summary self-debug. Across all three, the lesson is the same: for probabilistic programs, correctness is calibration, not compilation.
Primary: Columbia University
All Institutions: Columbia University
This paper introduces a paradigm shift for verifying LLM-generated probabilistic programs, demonstrating that Bayesian calibration diagnostics are essential for detecting code-invisible statistical errors, significantly outperforming and even revealing the harm of traditional unit tests in LLM repair loops. The comprehensive technical contribution, rigorous empirical validation on both injected and LLM-generated bugs, and the clear, actionable insights make this a genuinely field-wide significant paper that will reshape how LLM-assisted probabilistic modeling systems are designed and evaluated.
The paper proposes a robust methodology for detecting and repairing statistically misspecified probabilistic programs generated by Language Models (LLMs). The central innovation is the "calibration oracle," which formalizes and automates the Bayesian workflow diagnostics (Posterior Predictive Checks, Simulation-Based Calibration, sampler diagnostics like R-hat and divergences, and held-out predictive density) into a programmatic verifier. This oracle is designed to identify "code-invisible misspecification"—statistical errors that traditional unit tests (compilation, execution, output shape) cannot detect. The repair loop integrates this oracle as feedback for LLMs, guiding them to iteratively refine their programs. The methodology is well-grounded in Bayesian principles, and the formal distinction between code-visible and code-invisible bugs is crucial for understanding the limitations of existing LLM code verification paradigms. The aggregation of diverse diagnostics into a single verdict, with defined thresholds, provides a practical and actionable signal for automated systems.
The experimental evaluation is exceptionally thorough and convincing, covering detection, repair, and real-world applicability. 1. **Detection**: A comprehensive benchmark of 200 instances across 14 misspecification types and 10 model families was created. The calibration oracle achieved an impressive AUC of 0.97 (88% detection at 2% FPR), demonstrating its efficacy. Critically, a *reference-free* version of the oracle (without a ground-truth correct program) still achieved 62-78% detection, highlighting its practical utility. In stark contrast, the unit-test oracle achieved 0% detection for code-invisible bugs. 2. **Repair**: Experiments with 15 diverse LLMs (open and API) in a repair loop showed that calibration feedback consistently outperformed "no feedback" and "unit-test feedback." A surprising and impactful finding was that unit-test feedback was often *worse than no feedback*, as it induced false confidence and suppressed repair. Calibration feedback led to substantial improvements for strong-but-unsaturated models (e.g., GPT-5.1 from 33% to 92%), with statistically significant gains ($p < 4.5 \times 10^{-10}$ for pooled invisible-bug repairs). 3. **Real LLM Programs**: This section provides the strongest evidence. LLMs were tasked to write programs from scratch for neutral briefs. The study found that 15-47% of runnable LLM-generated programs were statistically misspecified (none caught by unit tests). Calibration-guided repair significantly outperformed strong baselines, including LLM-as-judge review, Bayesian-workflow checklists, and data-summary self-debug, achieving an 84% fix rate for misspecified programs. The results are robust, with detailed ablations and sensitivity analyses for oracle thresholds.
The paper demonstrates a high commitment to reproducibility. It specifies exact snapshots for API models and HuggingFace revisions for open models. Detailed hyperparameters for LLM generation (temperature, max tokens, repair budget) and NUTS inference (chains, draws, x64) are provided. Oracle thresholds are explicitly stated, and their sensitivity is analyzed. Crucially, the authors commit to releasing "The full system prompt, contract, feedback templates, benchmark generators, and analysis scripts with the paper," which is excellent practice and will enable full replication and extension of their work.
The authors are transparent about several limitations: 1. **PPC power**: The effectiveness of Posterior Predictive Checks depends on the chosen test statistics, meaning some misspecifications might remain undetected if not captured by the tracked statistics. 2. **Right fit, wrong structure**: The oracle primarily assesses distributional fit, not causal or generative correctness, making it blind to structural errors that yield similar predictive distributions. 3. **Over-wide predictive**: An overly confident or under-confident model might "cover" the data and pass PPC despite being fundamentally wrong. 4. **Diagnosis without remedy**: Weaker LLMs may receive accurate diagnostic feedback but lack the capability to translate it into a correct structural fix. 5. **SBC expense**: Simulation-Based Calibration is computationally intensive, limiting its practical use in the repair loop for large or costly models. 6. **Inference failure**: Sampler diagnostics can sometimes fire due to poor inference (e.g., NUTS issues) rather than genuine model misspecification, leading to false positives. 7. **Benchmark scope**: The repair experiments use a subset of bugs, and the tasks are classical low-dimensional models, suggesting that extending the approach to high-dimensional or complex structured models (e.g., deep models, spatial-temporal models) is future work.
This paper has profound broader impact for the burgeoning field of LLM-assisted scientific computing and probabilistic modeling. It fundamentally shifts the paradigm for verifying LLM-generated probabilistic code from "compilation" to "calibration," providing a principled and empirically validated framework for ensuring statistical correctness. This work is crucial for building trustworthy and reliable LLM agents that can assist in scientific discovery, data analysis, and model development. The PPL-agnostic nature of the Bayesian diagnostics means the approach is broadly applicable across popular probabilistic programming languages like Stan, Pyro, and PyMC. The surprising finding that unit-test feedback is actively harmful for LLM repair loops is a critical insight for designing future LLM agent architectures. This paper sets a new standard for evaluating and improving LLM-generated scientific code. This paper introduces a paradigm shift for verifying LLM-generated probabilistic programs, demonstrating that Bayesian calibration diagnostics are essential for detecting code-invisible statistical errors, significantly outperforming and even revealing the harm of traditional unit tests in LLM repair loops. The comprehensive technical contribution, rigorous empirical validation on both injected and LLM-generated bugs, and the clear, actionable insights make this a genuinely field-wide significant paper that will reshape how LLM-assisted probabilistic modeling systems are designed and evaluated.
World models for language agents come in two useful forms. An agent-based world model calls an LLM API and reasons flexibly in language, but its errors appear as hallucinated state changes that are hard to score with ordinary regression losses. A parameterized world model is a trained transition predictor; its errors are easier to measure with quantities such as NodeMSE, delta accuracy, and validity accuracy, but it is usually weaker as a standalone planner. We compare these two families on four graph-structured planning benchmarks and introduce operational hallucination metrics for the agent-based case. The comparison motivates \textbf{Grounded Iterative Language Planning} (GILP), which trains only a small parameterized backbone and combines it with API-based agent reasoning. The backbone supplies valid actions, predicted state deltas, risk, and value; the LLM drafts an action and imagined delta; and a consistency gate asks for revision when the two disagree. On real GPT-4o-mini calls, GILP reduces hallucinated-state rate from 0.176 to 0.035. In calibrated simulator ablations, it raises success from 0.668 to 0.838 while adding only ~22% extra LLM calls.
Primary: The University of Tokyo
All Institutions: The University of Tokyo, Emory University
GILP offers a significant step forward in building more reliable and robust LLM agents. By effectively mitigating hallucination propagation, it enhances the trustworthiness of LLM-driven planning systems, particularly for long-horizon tasks where compounding errors are most problematic. This has broad implications for applications in complex workflow automation, tool use, resource management, and other domains requiring sequential decision-making. The framework for defining and measuring hallucination propagation provides valuable tools for future research in LLM agent evaluation. Furthermore, the demonstration that a small, cheap parametric model can effectively ground powerful, expensive LLMs opens avenues for more cost-efficient and performant hybrid agent architectures, potentially democratizing access to advanced LLM agent capabilities by making them more reliable even with less capable (or self-hosted) LLMs. This paper introduces Grounded Iterative Language Planning (GILP), a novel hybrid world model that significantly reduces hallucination propagation in LLM agents by combining flexible API reasoning with a small, trained parameterized backbone. The comprehensive evaluation, including real API validation and multi-LLM comparisons, demonstrates substantial improvements in task success and state faithfulness, providing a robust and generalizable solution to a critical problem in LLM agent reliability.
The paper introduces Grounded Iterative Language Planning (GILP), a hybrid world model that combines the flexible reasoning of an LLM agent with the measurable, grounded predictions of a small parameterized backbone. The methodology is well-articulated, consisting of four phases: (1) Parameterized Skeleton Scoring, where the backbone predicts action validity, state deltas, risk, and value for candidate actions; (2) LLM Draft, where the LLM generates an action and imagined next-state delta, incorporating the skeleton into its prompt; (3) Consistency Gate, which uses Jaccard similarity to compare the LLM's imagined delta with the backbone's prediction, triggering a targeted re-prompt for revision if they disagree; and (4) Risk Gate, which escalates if the backbone predicts high risk. The definition of "hallucinated state atom" and the operational metrics (Hallucinated-State Rate (HSR), Propagation Depth (PD), Error-Explosion Slope (EES)) are crucial for quantifying the LLM agent's semantic errors, which are otherwise hard to measure. The theoretical "One-step hallucination contraction" proposition provides a formal basis for GILP's error reduction. The approach is elegant in its simplicity and effectiveness, leveraging the strengths of both model types while mitigating their weaknesses. The use of structured JSON for state deltas and the consistency gate's Jaccard similarity are practical and robust design choices.
The experimental evaluation is exceptionally comprehensive and rigorous. The authors use four graph-structured planning benchmarks (TaskGraph, ToolChain, ResourceAlloc, RepairFlow) and conduct extensive comparisons across eleven planning strategies. A key strength is the use of a behavioral simulator calibrated against real GPT-4o-mini calls, which allows for large-scale ablations while maintaining fidelity to real-world LLM behavior. The paper demonstrates significant improvements: GILP raises simulator success from 0.668 to 0.838 and, critically, reduces the hallucinated-state rate (HSR) on real GPT-4o-mini calls from 0.176 to 0.035 (an 80% reduction). The analysis of long-horizon scaling clearly shows GILP's ability to prevent performance degradation due to hallucination propagation. The cost-quality tradeoff analysis is thorough, showing that GILP achieves better performance per successful task despite adding LLM calls. Ablation studies systematically validate the contribution of each component (validity, delta, risk, value, correction gate), confirming the importance of the consistency gate. The multi-API comparison is a standout, demonstrating GILP's generalizability across GPT-4o-mini, Claude-3-Haiku, Gemini-1.5-Flash, and Llama-3-8B, and showing how it equalizes their performance and reveals API-specific hallucination propensities. The inclusion of an AgentBench-style Knowledge Graph traversal task, while showing less statistically significant gains due to dataset limitations, provides valuable insights into applicability boundaries and the need for well-calibrated backbones.
The paper explicitly states the release of the prompt suite, simulator, benchmarks, and code artifacts for reproducible follow-up work, with a GitHub link provided. This commitment to open science significantly enhances reproducibility. The detailed methodology, algorithm, and experimental setup descriptions further support replication.
The authors acknowledge several limitations. The current operationalization of hallucination metrics focuses on status-level delta errors, leaving entity-set hallucinations and reward-attribution errors for future work. The simulator, while calibrated, is still a proxy and might not capture all nuances of real API behavior. The Knowledge Graph traversal results, while insightful, did not show statistically significant improvements in SR or HSR due to the small sample size and potential backbone calibration issues, indicating an applicability boundary where the parametric model might not be sufficiently trained or representative. The cost of GILP, while justified by improved success, still involves additional LLM calls, which can be a factor for extremely cost-sensitive applications, especially with expensive proprietary APIs.
GILP offers a significant step forward in building more reliable and robust LLM agents. By effectively mitigating hallucination propagation, it enhances the trustworthiness of LLM-driven planning systems, particularly for long-horizon tasks where compounding errors are most problematic. This has broad implications for applications in complex workflow automation, tool use, resource management, and other domains requiring sequential decision-making. The framework for defining and measuring hallucination propagation provides valuable tools for future research in LLM agent evaluation. Furthermore, the demonstration that a small, cheap parametric model can effectively ground powerful, expensive LLMs opens avenues for more cost-efficient and performant hybrid agent architectures, potentially democratizing access to advanced LLM agent capabilities by making them more reliable even with less capable (or self-hosted) LLMs. This paper introduces Grounded Iterative Language Planning (GILP), a novel hybrid world model that significantly reduces hallucination propagation in LLM agents by combining flexible API reasoning with a small, trained parameterized backbone. The comprehensive evaluation, including real API validation and multi-LLM comparisons, demonstrates substantial improvements in task success and state faithfulness, providing a robust and generalizable solution to a critical problem in LLM agent reliability.
Technology computer-aided design (TCAD) semiconductor device simulation is fundamentally constrained by the high computational cost of iteratively solving coupled drift-diffusion equations. Existing ML surrogates either reduce internal physics to macroscopic scalar regressions, or rely on single-step mappings that lack the iterative refinement required to resolve stiff, coupled fields. To address this, we introduce PCGD, a Physics-Guided Conditional Graph Diffusion framework operating natively on unstructured TCAD meshes to predict coupled electrostatic and carrier density fields. PCGD employs a Condition-Aware MeshGraphNet denoiser that explicitly injects boundary conditions and device structure context via global cross-attention. By augmenting data-driven denoising with a physics-guided hybrid objective that integrates exponent-free quasi-Fermi gradient matching with noise-aware PDE residuals, PCGD progressively enforce physical constraints in the iterative diffusion trajectory. This strategy successfully bypasses the numerical instabilities typical of stiff drift-diffusion equations. Evaluated on a challenging mixed PN/MOS benchmark, PCGD significantly outperforms deterministic one-step regression (1.207% error) and local diffusion (1.585% error) baselines by achieving a sub-percent mean relative field error of 0.835%, while concurrently reducing maximum PDE residual errors by nearly three orders of magnitude compared to pure diffusion. It also transfers robustly to unseen SOI topologies (0.815% error) via LoRA adaptation, using 5.30$\times$ less data and 14.34$\times$ fewer parameters than full fine-tuning. Ultimately, PCGD bridges the computational efficiency of generative surrogates with the rigorous physical fidelity of traditional TCAD, unlocking highly scalable, field-level analysis for robust device engineering.
Primary: Fudan University
All Institutions: Fudan University, Shanghai Integrated Circuit Manufacturing Innovation Center
PCGD has a profound broader impact, particularly in the semiconductor industry and scientific machine learning. By significantly accelerating TCAD simulations while maintaining high physical fidelity, it can drastically shorten the design cycle for advanced semiconductor devices, fostering innovation in microelectronics for applications ranging from AI hardware to power electronics. The methodology for stabilizing physics-informed diffusion models for highly stiff and nonlinear PDEs is generalizable beyond TCAD. It offers a blueprint for tackling similar challenges in other scientific and engineering domains, such as materials science, fluid dynamics, and biomechanics, where complex multiphysics simulations are computationally expensive and numerically challenging. The concept of a hybrid AI-TCAD simulation workflow, where ML provides robust initial guesses to accelerate or stabilize traditional solvers, represents a powerful paradigm for integrating AI into complex scientific computing pipelines, promising both efficiency and reliability. Furthermore, the successful application of parameter-efficient fine-tuning (LoRA) for domain adaptation in a scientific context highlights its potential for developing more adaptable and data-efficient ML models for specialized scientific tasks. PCGD introduces a physics-guided conditional graph diffusion framework that effectively addresses the computational bottlenecks and physical fidelity challenges in TCAD device simulation. This work makes substantial contributions through its mesh-native graph diffusion approach, condition-aware MeshGraphNet architecture with global cross-attention, and a novel hybrid physics-guided training objective that stabilizes learning for stiff, nonlinear drift-diffusion equations. The impressive empirical results, demonstrating sub-percent accuracy and near three-order-of-magnitude reduction in PDE residuals, coupled with robust transferability to unseen topologies, position PCGD as a significant advancement in ML-accelerated scientific computing, with clear implications for future semiconductor design and broader multiphysics simulation.
The methodology presented in PCGD is highly sophisticated and meticulously designed to tackle the inherent challenges of TCAD device simulation. The core idea of formulating coupled-field prediction as a conditional generative denoising task on native unstructured TCAD meshes is a significant departure from previous ML surrogates. This approach leverages the iterative refinement capabilities of diffusion models to decouple macroscopic field construction from microscopic detail resolution, which is crucial for handling the extreme stiffness and exponential nonlinearities of semiconductor physics. The Condition-Aware MeshGraphNet denoiser is a well-thought-out architectural innovation. By explicitly injecting boundary conditions and device structure context via global cross-attention, it effectively overcomes the limitations of purely local message passing, ensuring that critical global information (like terminal biases and device layout) is directly accessible to every mesh node. This design choice is critical for accelerating convergence to correct physical operating conditions. The most impactful methodological contribution is the hybrid physics-guided objective. It ingeniously combines an exponent-free quasi-Fermi gradient matching loss ($L_G$) for stable guidance during early, noisy diffusion stages with noise-aware, adaptively scaled exact PDE residuals ($L_P$) for fine-grained physical correction in later stages. The dual-tier stabilization strategy for $L_P$ (validity-aware SNR gating and adaptive batch balance) is a robust solution to prevent gradient divergence caused by the extreme nonlinearity of drift-diffusion equations. The detailed mathematical derivation connecting Scharfetter-Gummel flux to quasi-Fermi gradients and the implementation of GPU-accelerated graph operators for discrete PDE residuals further demonstrate the technical depth and rigor of the proposed framework.
The experimental evaluation is comprehensive, rigorous, and strongly supports the claims made in the paper. The use of a challenging mixed PN/MOS benchmark dataset, comprising a large number of training and validation snapshots from distinct devices and operating modes, ensures the robustness of the evaluation. The device-trajectory level splitting for train/validation is critical for assessing generalization. The ablation studies are particularly effective, systematically isolating the contributions of iterative generative refinement, global conditioning, and physics-guided supervision by comparing PCGD against well-chosen baselines (deterministic one-step regression, local diffusion, and condition-aware diffusion with varying levels of physics guidance). The chosen metrics, mean relative $L_2$ error and log-compressed maximum PDE residual error, are appropriate and provide a holistic view of both accuracy and physical consistency. The results are highly impressive: PCGD achieves a sub-percent mean relative field error (0.835%) and significantly outperforms all baselines, reducing maximum PDE residual errors by nearly three orders of magnitude compared to pure diffusion. The analysis of convergence dynamics clearly illustrates the stability gained from the hybrid physics-guided objective. Furthermore, the demonstration of robust transferability to unseen SOI topologies via parameter-efficient LoRA adaptation, requiring significantly less data and fewer parameters than full fine-tuning, is a strong indicator that PCGD learns generalizable physical priors, which is crucial for practical adoption in industrial settings.
The paper provides a commendable level of detail for reproducibility. The methodology is thoroughly described with mathematical formulations for the graph representation, diffusion process, Condition-Aware MeshGraphNet, and the hybrid physics-guided objective. The appendix offers specific architectural settings, coefficients for the loss functions, and details on the baseline architectures. Crucially, numerical safety measures for handling stiff exponentials and precision truncation are explicitly mentioned. The schema for mesh features and interface edges is also detailed. While the absence of publicly available code (no project URL provided) is a common limitation for arXiv preprints, the textual descriptions are sufficiently comprehensive for an experienced researcher with expertise in GNNs, diffusion models, and scientific computing to implement the core components of PCGD. Access to the specific TCAD data generation pipeline would be the final piece for exact replication of the dataset.
Despite its strengths, PCGD exhibits some limitations. The paper acknowledges "severe zero-shot degradation" when transferring to major out-of-distribution topological shifts (e.g., SOI MOSFETs not seen during pretraining). While LoRA adaptation effectively addresses this, it implies that the model, despite learning physical priors, still relies on structural interpolation from its pretraining data, limiting its immediate "universal foundational model" capabilities without a much broader pretraining corpus. The current experimental validation is primarily on 2D devices. While the paper discusses the theoretical O(N) scaling advantage for 3D simulations, empirical validation on massive 3D meshes, including memory footprint and training/inference times, is yet to be demonstrated. The computational cost of training such a complex diffusion model on large graph datasets is likely substantial, though not explicitly quantified. Finally, the proposed hybrid AI-TCAD workflow mentions routing high-residual cases back to traditional solvers, but the practical implementation details, criteria for routing, and the efficiency gains from this hybrid approach are not fully explored.
PCGD has a profound broader impact, particularly in the semiconductor industry and scientific machine learning. By significantly accelerating TCAD simulations while maintaining high physical fidelity, it can drastically shorten the design cycle for advanced semiconductor devices, fostering innovation in microelectronics for applications ranging from AI hardware to power electronics. The methodology for stabilizing physics-informed diffusion models for highly stiff and nonlinear PDEs is generalizable beyond TCAD. It offers a blueprint for tackling similar challenges in other scientific and engineering domains, such as materials science, fluid dynamics, and biomechanics, where complex multiphysics simulations are computationally expensive and numerically challenging. The concept of a hybrid AI-TCAD simulation workflow, where ML provides robust initial guesses to accelerate or stabilize traditional solvers, represents a powerful paradigm for integrating AI into complex scientific computing pipelines, promising both efficiency and reliability. Furthermore, the successful application of parameter-efficient fine-tuning (LoRA) for domain adaptation in a scientific context highlights its potential for developing more adaptable and data-efficient ML models for specialized scientific tasks. PCGD introduces a physics-guided conditional graph diffusion framework that effectively addresses the computational bottlenecks and physical fidelity challenges in TCAD device simulation. This work makes substantial contributions through its mesh-native graph diffusion approach, condition-aware MeshGraphNet architecture with global cross-attention, and a novel hybrid physics-guided training objective that stabilizes learning for stiff, nonlinear drift-diffusion equations. The impressive empirical results, demonstrating sub-percent accuracy and near three-order-of-magnitude reduction in PDE residuals, coupled with robust transferability to unseen topologies, position PCGD as a significant advancement in ML-accelerated scientific computing, with clear implications for future semiconductor design and broader multiphysics simulation.
Model organisms (MOs) - language models trained to exhibit undesired or unnatural behaviours - are frequently used as testbeds for evaluating white-box interpretability techniques. Current MOs are typically constructed via post-hoc supervised fine-tuning (SFT) on behavioural transcripts or synthetic documents. Prior research has shown that interpretability methods can easily identify hidden behaviours in these MOs. However, recent work suggests that such post-hoc training methods may make interpretability unrealistically easy. We investigate this claim by constructing a suite of 54 $\verb|OLMo2-1B|$- and $\verb|gemma-3-1b-it|$-based MOs trained with seven different techniques, including standard post-hoc SFT, post-hoc DPO, and more realistic integration of MO data into the OLMo post-training DPO phase. We use these MO variants to benchmark activation oracles, activation steering, logit lens, and sparse autoencoders. Our findings show that (i) MO interpretability depends strongly on training objective, target behaviour, model architecture, and training data generation pipeline; (ii) substantial variance remains even after controlling for differences in the strength of target behaviour expression; and (iii) our more realistic $\textit{integrated training}$ often yields less interpretable MOs than standard post-hoc methods. Our results cast substantial doubt on the validity of current MOs as interpretability proxies.
Primary: University of Cambridge
All Institutions: LASR Labs, University of Cambridge
This paper has substantial broader impact on the field of LLM interpretability and AI safety. By demonstrating that MO interpretability is highly sensitive to construction choices, it casts significant doubt on the validity of current MOs as reliable proxies for real-world model behaviors. This implies that many existing interpretability benchmarks may be "unrealistically easy," leading to over-optimistic assessments of interpretability techniques. The work provides a crucial methodological critique, urging researchers to adopt more rigorous MO design practices, including diverse training methodologies and QER matching. The finding that "more realistic" integrated training often yields less interpretable MOs is a call to action for the community to develop more robust interpretability methods that can handle such complex, entangled behaviors. The open-sourcing of the MO suite and code will serve as a valuable resource for future research, facilitating the development of more robust benchmarks and interpretability tools. Ultimately, this work contributes to a more calibrated understanding of interpretability progress, which is vital for building trustworthy and safe AI systems. This paper critically re-evaluates the foundational assumptions of Model Organisms (MOs) in LLM interpretability research, demonstrating that interpretability strongly depends on MO training methodology, even when controlling for behavioral strength. Through a rigorous experimental suite of 54 MOs trained with diverse methods, the authors reveal that current MO benchmarks may be unrealistically easy, challenging the generalizability of interpretability findings and providing a crucial methodological contribution to the field of AI safety and interpretability.
The methodology is exceptionally rigorous and well-designed. The core contribution is the systematic construction of a diverse suite of 54 Model Organisms (MOs) to investigate the impact of training methodology on interpretability. The authors define three benign trigger-reaction quirks (CakeBake, ItalianFood, MilitarySubmarine) and train MOs based on two different base models (|OLMo2-1B| and |gemma-3-1b-it|) using seven distinct training methods. These methods span standard post-hoc SFT (Transcript Distillation, Synthetic Document Fine-tuning), post-hoc DPO, and a novel "integrated DPO" approach that more realistically incorporates quirk data into the base model's original post-training DPO phase. A crucial methodological innovation is the "Quirk Expression Rate (QER) matching," where the learning rate and data volume are adjusted to ensure all variants within a family express the quirk to a comparable degree (within 5pp). This control effectively isolates the impact of training methodology from mere behavioral strength, a significant improvement over prior work. The authors also perform black-box validation to ensure low naive black-box interpretability, preventing confounding with white-box techniques. The interpretability evaluation uses four diverse white-box methods: Activation Oracles (AOs), Activation Steering, Logit Lens, and Sparse Autoencoders (SAEs), covering both diffing and non-diffing settings. The use of LLM judges for QER and hypothesis relevance scoring is a modern and appropriate choice, with detailed calibration provided. The exploration of training stochasticity and model architecture robustness further strengthens the methodology.
The experimental evaluation is comprehensive and robust. The suite of 54 MOs is substantial, allowing for a thorough investigation across various dimensions. The choice of |OLMo2-1B| and |gemma-3-1b-it| provides insights into model architecture dependence, although these are smaller models. The experiments clearly demonstrate that MO interpretability varies strongly with training objective, target behavior, model architecture, and training data generation pipeline, even when QER is controlled. The finding that the novel "integrated DPO" often yields *less interpretable* MOs than standard post-hoc methods is a critical and surprising result, challenging the assumption that current MOs are good proxies for real-world behaviors. The paper systematically presents results for each interpretability method, highlighting variability and lack of generalization across MO families and architectures. For instance, the ratio between the most and least interpretable variants ranges unpredictably from 1.2 to 20.4. The analysis of data mixing effects, showing that dilution does not universally decrease interpretability, contradicts prior findings and adds nuance. The robustness checks against training stochasticity (using different data ordering seeds) and model architecture are well-executed, confirming that the observed variance is not merely noise. The comparison between diffing and non-diffing interpretability settings further underscores the limitations of current methods without a reference model. The exclusion of confounded models (OLMo MilitarySubmarine SDF) due to high black-box interpretability demonstrates strong experimental rigor.
The reproducibility of this work is excellent. The authors explicitly state their commitment to open-sourcing their entire suite of 54 quirk expression-matched MOs, along with their training data, and the code used for data generation and training pipelines. This is a significant contribution to the community and will enable future research to build upon their findings directly. Detailed information on MO training, hyperparameters, dataset information, QER evaluation, and interpretability evaluation methods are provided in the appendices, further enhancing reproducibility. The use of publicly available base models (|OLMo2-1B| and |gemma-3-1b-it|) and datasets (e.g., C4, HelpSteer3) also supports reproducibility.
The authors acknowledge several limitations. The quirks studied are benign proxies, and the base models are relatively small (1B parameters), which may limit generalizability to larger, frontier models exhibiting more sophisticated, safety-relevant behaviors. Computational constraints prevented full replication of all experiments (e.g., training data ordering for all quirks, all interpretability methods on Gemma models). The integrated DPO approach only modifies one stage of post-training; earlier instillation of quirks (e.g., during pre-training) might yield even less interpretable results. While QER is matched within families, small differences remain, and QER is not varied *within* a family, meaning the direct impact of varying QER on interpretability is not fully isolated. The paper also briefly touches on the impact of the training data generation pipeline (synthetic vs. externally sourced) but does not fully characterize the specific data features responsible for interpretability differences.
This paper has substantial broader impact on the field of LLM interpretability and AI safety. By demonstrating that MO interpretability is highly sensitive to construction choices, it casts significant doubt on the validity of current MOs as reliable proxies for real-world model behaviors. This implies that many existing interpretability benchmarks may be "unrealistically easy," leading to over-optimistic assessments of interpretability techniques. The work provides a crucial methodological critique, urging researchers to adopt more rigorous MO design practices, including diverse training methodologies and QER matching. The finding that "more realistic" integrated training often yields less interpretable MOs is a call to action for the community to develop more robust interpretability methods that can handle such complex, entangled behaviors. The open-sourcing of the MO suite and code will serve as a valuable resource for future research, facilitating the development of more robust benchmarks and interpretability tools. Ultimately, this work contributes to a more calibrated understanding of interpretability progress, which is vital for building trustworthy and safe AI systems. This paper critically re-evaluates the foundational assumptions of Model Organisms (MOs) in LLM interpretability research, demonstrating that interpretability strongly depends on MO training methodology, even when controlling for behavioral strength. Through a rigorous experimental suite of 54 MOs trained with diverse methods, the authors reveal that current MO benchmarks may be unrealistically easy, challenging the generalizability of interpretability findings and providing a crucial methodological contribution to the field of AI safety and interpretability.
The deployment of Large Language Models (LLMs) with extended context windows is increasingly constrained by the linear growth of Key-Value (KV) cache memory. Vector Quantization (VQ), particularly Residual Quantization (RQ), is a promising approach for pushing KV cache storage toward the sub-1-bit regime by progressively encoding residuals with small codebooks. However, most VQ methods still rely on standard $\ell_2$ $K$-means as the core codebook-learning primitive. We identify a subtle high-dimensional issue of this primitive: Euclidean centroid averaging can induce centroid shrinkage, which weakens the angular alignment term in the $\ell_2$ distortion and makes directional preservation harder. To address this issue, we propose Gain-Shape $K$-means (GSKM), a drop-in replacement for $K$-means that improves directional fidelity while matching, and in some regimes improving, $\ell_2$ distortion. We then build Gain-Shape Residual Quantization (GSRQ) by incorporating a weighted extension of GSKM into an RQ pipeline. On LLaMA-3-8B, GSRQ substantially improves over strong KV cache quantization baselines across bit rates. At 1-bit, it improves the average accuracy across LongBench tasks from 11.34 to 33.54, a gain of 22.20 percentage points over VQLLM.
Primary: Yonsei University
All Institutions: Yonsei University
This work has significant positive broader impact. By substantially improving KV cache compression for LLMs, it directly addresses a major bottleneck in deploying large language models, especially for long-context scenarios. This enables: 1. **Reduced Memory Footprint**: Allows LLMs to run on devices with less memory or process much longer contexts on existing hardware. 2. **Increased Throughput/Reduced Latency**: Improves the efficiency of LLM inference, leading to faster responses and higher serving capacity. 3. **Environmental Sustainability**: By reducing memory and computational requirements, it lowers the energy consumption associated with LLM inference, contributing to more environmentally sustainable AI practices. 4. **Accessibility**: Makes powerful LLMs more accessible for deployment on resource-constrained edge devices, broadening their application scope. The paper includes an explicit impact statement that aligns with these points. This paper introduces Gain-Shape K-means (GSKM), a novel clustering primitive that addresses centroid shrinkage in high-dimensional vector quantization, and integrates it into a gradient-weighted Residual Quantization pipeline (GSRQ) for sub-1-bit KV cache compression in LLMs. The work presents a well-motivated and elegant modification to standard K-means, demonstrating significant improvements in reconstruction quality, perplexity, and downstream task accuracy over state-of-the-art baselines across multiple LLMs and benchmarks, particularly excelling in the challenging sub-1-bit regime. Crucially, GSRQ also delivers substantial latency reductions and memory savings, making it a highly impactful contribution for enabling efficient and scalable deployment of large language models with extended context windows.
The paper introduces Gain-Shape K-means (GSKM) as a novel clustering primitive to address a subtle issue in standard L2 K-means: centroid shrinkage in high dimensions, which compromises angular alignment. GSKM re-parameterizes centroids into scalar gain and unit-norm shape, updating them separately. The shape update uses normalized vectors to ensure magnitude-invariant direction estimation, while the gain update projects onto the updated shape. This approach is well-motivated by the observation that residuals in multi-stage quantizers (like RQ) are often weakly structured and high-dimensional, exacerbating the shrinkage problem. The integration of GSKM into a Residual Quantization (RQ) pipeline, termed Gain-Shape Residual Quantization (GSRQ), is a logical extension. Furthermore, the paper incorporates a robust gradient-based weighting scheme, building on prior work but refining it with a logarithmic transform to mitigate the impact of outlier gradients, which is a practical and important detail for stability. The method is presented as a "drop-in replacement" for K-means, which enhances its practical adoptability. The theoretical motivation for GSKM's effectiveness in high-dimensional, under-capacity regimes is clearly articulated.
The experimental evaluation is comprehensive and rigorous. 1. **GSKM Evaluation**: The paper first validates GSKM against standard K-means on synthetic random Gaussian data and real KV cache activations (LLaMA-3-8B). Metrics include MSE, gain error, and cosine similarity. Results consistently show GSKM improving directional fidelity and reducing gain error, often leading to lower MSE, especially in under-capacity regimes and for less structured value/residual vectors. This effectively validates the core hypothesis of GSKM. 2. **KV Compression Evaluation**: * **Perplexity**: GSRQ is evaluated on LLaMA-2-7B, LLaMA-3-8B, and Mistral-7B across WikiText-2 and C4 datasets at various bit rates (2, 1, 0.75, 0.375 BPA). GSRQ consistently achieves lower perplexity than strong baselines (CQ, AnTKV), with improvements becoming more pronounced at lower bit rates. The graceful degradation at sub-1-bit is a key strength. * **Downstream Benchmarks**: LLaMA-3-8B-Instruct and Mistral-7B-Instruct are evaluated on LM Evaluation Harness benchmarks (ARC-C, MMLU, TruthfulQA, Winogrande, HellaSwag, PIQA, MathQA, GSM8K) and LongBench tasks. GSRQ significantly outperforms VQLLM and KIVI, notably achieving higher average accuracy at 0.75-bit than VQLLM at 1-bit, demonstrating strong sub-1-bit performance. * **Robustness**: The benefits are shown to generalize across different LLM architectures (Qwen3-8B) and challenging tasks (AIME24/25, MATH-500, RULER). 3. **Ablation Studies**: An ablation confirms the individual contributions of GSKM and the logarithmic gradient weighting, showing that both are crucial for optimal performance. 4. **Efficiency Analysis**: Crucially, the paper includes decoding latency and memory footprint analysis. Using custom Triton kernels, GSRQ demonstrates significant layer-wise latency reductions (up to 3.4x speedup) and substantial end-to-end speedups (1.59x to 3.40x) at long contexts, while also dramatically reducing memory footprint and avoiding OOM errors where FP16 baselines fail. 5. **Convergence Analysis**: Empirical convergence of GSKM is demonstrated, showing rapid convergence within 40 iterations. The experimental setup is thorough, using relevant models, datasets, and metrics, and providing strong evidence for the practical utility and superiority of GSRQ over existing methods, especially in the challenging sub-1-bit regime.
The paper provides a detailed algorithm for GSKM, including assignment and update rules. Key hyperparameters for the quantization pipeline (codebook size, subspace dimension, number of residual stages) are specified for different bit rates. The gradient weighting scheme is also described. However, the implementation relies on "fully custom PyTorch-based autoregressive decoding implementation utilizing high-performance custom Triton kernels." Without the actual code for these kernels and the overall framework, full reproduction of the latency and memory results might be challenging for an external researcher. The core GSKM algorithm itself is well-defined and should be reproducible. The absence of a project URL or code repository is a limitation for full reproducibility.
1. **Code Availability**: The lack of publicly available code (especially for the custom Triton kernels) makes it difficult for others to fully reproduce the performance and efficiency gains, particularly the latency and memory measurements. 2. **Theoretical Convergence**: The paper explicitly states that GSKM is not an exact block coordinate descent for a single L2 objective and does not claim monotonic decrease guarantee. While empirical convergence is shown, a deeper theoretical understanding of its convergence properties would strengthen the method. 3. **Generalization Beyond KV Cache**: While the paper focuses on KV cache quantization, the GSKM primitive itself could potentially be applied to other vector quantization tasks. The paper doesn't explore this broader applicability. 4. **Computational Overhead**: While GSKM has the same asymptotic complexity as K-means, the constant factors might differ slightly due to the separate gain/shape updates. This isn't explicitly discussed in terms of practical training time differences, though the focus is on inference.
This work has significant positive broader impact. By substantially improving KV cache compression for LLMs, it directly addresses a major bottleneck in deploying large language models, especially for long-context scenarios. This enables: 1. **Reduced Memory Footprint**: Allows LLMs to run on devices with less memory or process much longer contexts on existing hardware. 2. **Increased Throughput/Reduced Latency**: Improves the efficiency of LLM inference, leading to faster responses and higher serving capacity. 3. **Environmental Sustainability**: By reducing memory and computational requirements, it lowers the energy consumption associated with LLM inference, contributing to more environmentally sustainable AI practices. 4. **Accessibility**: Makes powerful LLMs more accessible for deployment on resource-constrained edge devices, broadening their application scope. The paper includes an explicit impact statement that aligns with these points. This paper introduces Gain-Shape K-means (GSKM), a novel clustering primitive that addresses centroid shrinkage in high-dimensional vector quantization, and integrates it into a gradient-weighted Residual Quantization pipeline (GSRQ) for sub-1-bit KV cache compression in LLMs. The work presents a well-motivated and elegant modification to standard K-means, demonstrating significant improvements in reconstruction quality, perplexity, and downstream task accuracy over state-of-the-art baselines across multiple LLMs and benchmarks, particularly excelling in the challenging sub-1-bit regime. Crucially, GSRQ also delivers substantial latency reductions and memory savings, making it a highly impactful contribution for enabling efficient and scalable deployment of large language models with extended context windows.
Active learning reduces labeling cost by querying the most informative unlabeled samples, but standard coreset methods ignore known data symmetries and can waste budget on transformed versions of the same instance. We propose GRINCO, a group-invariant coreset framework that performs acquisition in the quotient space induced by a transformation group, so that selection operates on orbits rather than raw samples. The method uses either canonical representatives or learned orbit-separating invariant embeddings to define practical quotient metrics, and combines quotient-space k-center selection with invariant training through an orbit-averaged loss. We further derive a generalization bound that relates excess orbit-averaged risk to quotient-space coverage, label uncertainty, and intra-orbit variability. Experiments on synthetic scale-invariant data and image benchmarks with rotation-induced redundancy show that GRINCO improves orbit coverage and achieves stronger label efficiency than conventional coreset baselines, especially when group-induced redundancy is substantial.
Primary: Universidade Federal de Santa Catarina
All Institutions: Universidade Federal de Santa Catarina, French National Research Agency (ANR)
This paper presents a theoretically grounded and practically motivated framework for active learning that leverages group invariance to reduce labeling redundancy. By formulating coreset selection in the quotient space, it offers a principled way to handle data symmetries, showing improved efficiency in relevant benchmarks. While the core ideas build on existing geometric deep learning concepts, their specific application and integration into active learning coreset selection, along with the derived generalization bounds, constitute a valuable contribution to the field.
The paper proposes GRINCO, a framework for active learning that integrates group theory to handle data symmetries. The core innovation is performing coreset selection (specifically k-center) in the quotient space induced by a transformation group, rather than the raw input space. This is achieved by defining metrics on orbits, either via canonical representatives or learned invariant embeddings. The method couples this selection with an orbit-averaged loss for training. The theoretical contribution includes a generalization bound linking quotient-space coverage to excess risk. The methodology is mathematically rigorous and logically sound, addressing a specific gap in standard coreset methods which ignore invariance. However, the concept of operating in quotient spaces is not entirely new in geometric deep learning; the novelty here lies in the specific application to active learning coreset selection and the derivation of the associated bounds.
The experiments cover synthetic scale-invariant data and image benchmarks with rotation-induced redundancy. The results demonstrate that GRINCO improves orbit coverage and achieves better label efficiency compared to conventional coreset baselines, particularly when group-induced redundancy is substantial. The evaluation is appropriate for the claims made. However, the scope of experiments appears limited to specific symmetries (rotation, scaling) and standard image benchmarks. There is no extensive ablation study on the choice of quotient metrics (canonical vs. learned) or a broader comparison with state-of-the-art active learning methods that might implicitly handle redundancy through other means (e.g., deep metric learning). The results are promising but do not yet demonstrate a paradigm shift in active learning performance across diverse domains.
The paper provides detailed mathematical formulations, including definitions of groups, orbits, quotient spaces, and the specific algorithms for selection and training. The pseudo-code for the AL pipeline is clear. The description of the learned invariant embeddings suggests standard practices in representation learning. While the paper is well-written, reproducibility will depend on the specific implementation of the learned orbit-separating functions, which are referenced but not fully detailed in the provided text. The synthetic experiments are likely fully reproducible.
The method assumes knowledge of the transformation group $G$ and its action on the data. It does not address cases where symmetries are unknown or only approximate. The computational cost of computing quotient metrics, especially with learned embeddings or complex group actions, is not thoroughly analyzed. The reliance on a specific group structure limits the generalizability to domains where such symmetries are not well-defined or are complex (e.g., natural language). The paper acknowledges the need for future work on unknown symmetries.
This work has significant potential impact in domains where data symmetries are prevalent and labeling is expensive, such as medical imaging (rotation invariance in X-rays), remote sensing, and computer vision. By reducing labeling redundancy, it promotes more efficient data collection. The theoretical framework also contributes to the growing body of literature on geometric deep learning and invariant representation learning. This paper presents a theoretically grounded and practically motivated framework for active learning that leverages group invariance to reduce labeling redundancy. By formulating coreset selection in the quotient space, it offers a principled way to handle data symmetries, showing improved efficiency in relevant benchmarks. While the core ideas build on existing geometric deep learning concepts, their specific application and integration into active learning coreset selection, along with the derived generalization bounds, constitute a valuable contribution to the field.
While Large Language Model (LLM) agents demonstrate proficiency in static benchmarks, their deployment in real-world scenarios is hindered by the dynamic nature of user queries, tool sets, and interaction dynamics. To address this generalization gap, we formalize OpenAgent (Tool-Use Agent in Open-World), a problem setting characterized by distributional shifts across query, action, observation, and domain dimensions. To systematically diagnose its impact, we construct a controlled sandbox environment where we define fine-grained environmental shifts across a four-tier hierarchy, Perception, Interaction, Reasoning, and Internalization, and conduct a comprehensive series of experiments. Our analysis yields a series of key insights, demonstrating that agents trained via both Supervised Fine-Tuning(SFT) and Reinforcement Learning suffer from varying degrees of performance degradation when confronting open environmental shifts. Building on these insights, we propose Perturbation-Augmented Fine-Tuning, a disturbance-based intervention strategy for SFT that lays the foundation for enhancing agent robustness and utility in realistic environments. Our code will be released at: https://github. com/LAMDA-NeSy/OpenAgent.
Primary: Nanjing University
All Institutions: Nanjing University, National Key Laboratory for Novel Software Technology
The paper provides a crucial diagnostic framework and empirical evidence for the fragility of LLM agents in open-world tool-use scenarios, proposing a perturbation-augmented fine-tuning method to mitigate these issues, thereby advancing the field's understanding of agent generalization and robustness.
The paper introduces "OpenAgent," a formalization of the open-world tool-use problem, moving beyond static benchmark evaluations. It proposes a four-tier hierarchy for environmental shifts (Perception, Interaction, Reasoning, Internalization) to systematically diagnose agent fragility. The core methodological contribution is "Perturbation-Augmented Fine-Tuning" (PAFT), a disturbance-based intervention strategy applied during Supervised Fine-Tuning (SFT) to enhance robustness. While the formalization is valuable, the method itself (perturbation augmentation) is conceptually similar to existing robustness training techniques (e.g., adversarial training, data augmentation) applied to the specific domain of LLM agents. It is not a fundamentally new algorithmic breakthrough but rather a targeted application and systematic evaluation framework.
The authors construct a controlled sandbox environment to test agents under various distributional shifts. They evaluate both SFT and RL-trained agents, demonstrating significant performance degradation in open-world settings compared to static benchmarks. The experiments are comprehensive in scope, covering the four defined tiers of shifts. However, the "results" described in the abstract are primarily diagnostic (showing fragility) rather than demonstrating a massive leap in absolute performance. The proposed PAFT method shows improvement over baselines, but the magnitude of this improvement and its generalizability to other domains or larger-scale models need rigorous scrutiny. The evaluation is solid for a diagnostic paper but lacks the "state-of-the-art" performance claims that often drive higher impact scores in top-tier venues.
The paper explicitly states that code will be released at a GitHub repository. The definition of the four-tier hierarchy and the sandbox environment suggests a structured experimental setup that should be reproducible if the code and environment are made available as promised. The clear definition of "open-world" shifts provides a standard for future reproducibility in this specific sub-field.
The paper focuses heavily on the *fragility* of current methods. While it proposes PAFT, the extent to which this solves the fundamental generalization gap in truly open-ended, unconstrained environments is debatable. The "sandbox" environment, while controlled, may not fully capture the chaotic nature of the real open world. Furthermore, the reliance on SFT with perturbation may not scale as effectively as RL-based approaches for complex, long-horizon tool use tasks, a limitation not deeply explored. The novelty of the method is incremental; the main contribution is the systematic evaluation and formalization.
This paper addresses a critical bottleneck in deploying LLM agents: the gap between benchmark performance and real-world reliability. By highlighting the fragility of static training, it encourages the community to shift focus towards robustness and generalization. The formalization of open-world shifts provides a common language and benchmark suite that could become standard for evaluating agent robustness, potentially guiding future research towards more resilient agent architectures. The paper provides a crucial diagnostic framework and empirical evidence for the fragility of LLM agents in open-world tool-use scenarios, proposing a perturbation-augmented fine-tuning method to mitigate these issues, thereby advancing the field's understanding of agent generalization and robustness.
Language models increasingly write probabilistic programs (in NumPyro, Stan, or Pyro), but a program that compiles, runs, and passes every unit test can still be \emph{statistically} wrong -- a Gaussian likelihood for heavy-tailed data, a Poisson for over-dispersed counts, an invalid prior support, or a pathological parameterization. The right verifier is therefore not a test suite but the Bayesian workflow itself: posterior predictive checks, simulation-based calibration, sampler diagnostics ($\hat R$, divergences, ESS), and held-out predictive density. We study this calibration oracle along three axes. \textbf{Detection:} on a benchmark of $14$ misspecification types across $10$ model families ($200$ instances), it flags the bug with AUC $0.97$ ($88\%$ at $2\%$ FPR \emph{when handed the correct reference program, an upper bound}) -- and a fully \emph{reference-free} version that uses no correct program reaches $62$--$78\%$ (the upper figure from a small automated model search), versus $0\%$ for a unit-test oracle. \textbf{Repair:} used as feedback in an LLM repair loop across fifteen models, calibration significantly outperforms unit-test feedback -- which is itself \emph{significantly worse than no feedback at all}, a passing test inducing false confidence that suppresses repair -- and improves over no feedback on strong-but-unsaturated models (GPT-5.1 $33{\to}92\%$, Claude $75{\to}100\%$; paired McNemar, $n{=}228$). \textbf{Reality:} on programs LLMs write from scratch for neutral briefs, $15$--$47\%$ of runnable ones are statistically misspecified (unit tests catch none), and calibration-guided repair significantly beats LLM-as-judge review, a Bayesian-workflow checklist, and data-summary self-debug. Across all three, the lesson is the same: for probabilistic programs, correctness is calibration, not compilation.
Primary: Columbia University
All Institutions: Columbia University
This paper introduces a paradigm shift for verifying LLM-generated probabilistic programs, demonstrating that Bayesian calibration diagnostics are essential for detecting code-invisible statistical errors, significantly outperforming and even revealing the harm of traditional unit tests in LLM repair loops. The comprehensive technical contribution, rigorous empirical validation on both injected and LLM-generated bugs, and the clear, actionable insights make this a genuinely field-wide significant paper that will reshape how LLM-assisted probabilistic modeling systems are designed and evaluated.
The paper proposes a robust methodology for detecting and repairing statistically misspecified probabilistic programs generated by Language Models (LLMs). The central innovation is the "calibration oracle," which formalizes and automates the Bayesian workflow diagnostics (Posterior Predictive Checks, Simulation-Based Calibration, sampler diagnostics like R-hat and divergences, and held-out predictive density) into a programmatic verifier. This oracle is designed to identify "code-invisible misspecification"—statistical errors that traditional unit tests (compilation, execution, output shape) cannot detect. The repair loop integrates this oracle as feedback for LLMs, guiding them to iteratively refine their programs. The methodology is well-grounded in Bayesian principles, and the formal distinction between code-visible and code-invisible bugs is crucial for understanding the limitations of existing LLM code verification paradigms. The aggregation of diverse diagnostics into a single verdict, with defined thresholds, provides a practical and actionable signal for automated systems.
The experimental evaluation is exceptionally thorough and convincing, covering detection, repair, and real-world applicability. 1. **Detection**: A comprehensive benchmark of 200 instances across 14 misspecification types and 10 model families was created. The calibration oracle achieved an impressive AUC of 0.97 (88% detection at 2% FPR), demonstrating its efficacy. Critically, a *reference-free* version of the oracle (without a ground-truth correct program) still achieved 62-78% detection, highlighting its practical utility. In stark contrast, the unit-test oracle achieved 0% detection for code-invisible bugs. 2. **Repair**: Experiments with 15 diverse LLMs (open and API) in a repair loop showed that calibration feedback consistently outperformed "no feedback" and "unit-test feedback." A surprising and impactful finding was that unit-test feedback was often *worse than no feedback*, as it induced false confidence and suppressed repair. Calibration feedback led to substantial improvements for strong-but-unsaturated models (e.g., GPT-5.1 from 33% to 92%), with statistically significant gains ($p < 4.5 \times 10^{-10}$ for pooled invisible-bug repairs). 3. **Real LLM Programs**: This section provides the strongest evidence. LLMs were tasked to write programs from scratch for neutral briefs. The study found that 15-47% of runnable LLM-generated programs were statistically misspecified (none caught by unit tests). Calibration-guided repair significantly outperformed strong baselines, including LLM-as-judge review, Bayesian-workflow checklists, and data-summary self-debug, achieving an 84% fix rate for misspecified programs. The results are robust, with detailed ablations and sensitivity analyses for oracle thresholds.
The paper demonstrates a high commitment to reproducibility. It specifies exact snapshots for API models and HuggingFace revisions for open models. Detailed hyperparameters for LLM generation (temperature, max tokens, repair budget) and NUTS inference (chains, draws, x64) are provided. Oracle thresholds are explicitly stated, and their sensitivity is analyzed. Crucially, the authors commit to releasing "The full system prompt, contract, feedback templates, benchmark generators, and analysis scripts with the paper," which is excellent practice and will enable full replication and extension of their work.
The authors are transparent about several limitations: 1. **PPC power**: The effectiveness of Posterior Predictive Checks depends on the chosen test statistics, meaning some misspecifications might remain undetected if not captured by the tracked statistics. 2. **Right fit, wrong structure**: The oracle primarily assesses distributional fit, not causal or generative correctness, making it blind to structural errors that yield similar predictive distributions. 3. **Over-wide predictive**: An overly confident or under-confident model might "cover" the data and pass PPC despite being fundamentally wrong. 4. **Diagnosis without remedy**: Weaker LLMs may receive accurate diagnostic feedback but lack the capability to translate it into a correct structural fix. 5. **SBC expense**: Simulation-Based Calibration is computationally intensive, limiting its practical use in the repair loop for large or costly models. 6. **Inference failure**: Sampler diagnostics can sometimes fire due to poor inference (e.g., NUTS issues) rather than genuine model misspecification, leading to false positives. 7. **Benchmark scope**: The repair experiments use a subset of bugs, and the tasks are classical low-dimensional models, suggesting that extending the approach to high-dimensional or complex structured models (e.g., deep models, spatial-temporal models) is future work.
This paper has profound broader impact for the burgeoning field of LLM-assisted scientific computing and probabilistic modeling. It fundamentally shifts the paradigm for verifying LLM-generated probabilistic code from "compilation" to "calibration," providing a principled and empirically validated framework for ensuring statistical correctness. This work is crucial for building trustworthy and reliable LLM agents that can assist in scientific discovery, data analysis, and model development. The PPL-agnostic nature of the Bayesian diagnostics means the approach is broadly applicable across popular probabilistic programming languages like Stan, Pyro, and PyMC. The surprising finding that unit-test feedback is actively harmful for LLM repair loops is a critical insight for designing future LLM agent architectures. This paper sets a new standard for evaluating and improving LLM-generated scientific code. This paper introduces a paradigm shift for verifying LLM-generated probabilistic programs, demonstrating that Bayesian calibration diagnostics are essential for detecting code-invisible statistical errors, significantly outperforming and even revealing the harm of traditional unit tests in LLM repair loops. The comprehensive technical contribution, rigorous empirical validation on both injected and LLM-generated bugs, and the clear, actionable insights make this a genuinely field-wide significant paper that will reshape how LLM-assisted probabilistic modeling systems are designed and evaluated.
Technology computer-aided design (TCAD) semiconductor device simulation is fundamentally constrained by the high computational cost of iteratively solving coupled drift-diffusion equations. Existing ML surrogates either reduce internal physics to macroscopic scalar regressions, or rely on single-step mappings that lack the iterative refinement required to resolve stiff, coupled fields. To address this, we introduce PCGD, a Physics-Guided Conditional Graph Diffusion framework operating natively on unstructured TCAD meshes to predict coupled electrostatic and carrier density fields. PCGD employs a Condition-Aware MeshGraphNet denoiser that explicitly injects boundary conditions and device structure context via global cross-attention. By augmenting data-driven denoising with a physics-guided hybrid objective that integrates exponent-free quasi-Fermi gradient matching with noise-aware PDE residuals, PCGD progressively enforce physical constraints in the iterative diffusion trajectory. This strategy successfully bypasses the numerical instabilities typical of stiff drift-diffusion equations. Evaluated on a challenging mixed PN/MOS benchmark, PCGD significantly outperforms deterministic one-step regression (1.207% error) and local diffusion (1.585% error) baselines by achieving a sub-percent mean relative field error of 0.835%, while concurrently reducing maximum PDE residual errors by nearly three orders of magnitude compared to pure diffusion. It also transfers robustly to unseen SOI topologies (0.815% error) via LoRA adaptation, using 5.30$\times$ less data and 14.34$\times$ fewer parameters than full fine-tuning. Ultimately, PCGD bridges the computational efficiency of generative surrogates with the rigorous physical fidelity of traditional TCAD, unlocking highly scalable, field-level analysis for robust device engineering.
Primary: Fudan University
All Institutions: Fudan University, Shanghai Integrated Circuit Manufacturing Innovation Center
PCGD has a profound broader impact, particularly in the semiconductor industry and scientific machine learning. By significantly accelerating TCAD simulations while maintaining high physical fidelity, it can drastically shorten the design cycle for advanced semiconductor devices, fostering innovation in microelectronics for applications ranging from AI hardware to power electronics. The methodology for stabilizing physics-informed diffusion models for highly stiff and nonlinear PDEs is generalizable beyond TCAD. It offers a blueprint for tackling similar challenges in other scientific and engineering domains, such as materials science, fluid dynamics, and biomechanics, where complex multiphysics simulations are computationally expensive and numerically challenging. The concept of a hybrid AI-TCAD simulation workflow, where ML provides robust initial guesses to accelerate or stabilize traditional solvers, represents a powerful paradigm for integrating AI into complex scientific computing pipelines, promising both efficiency and reliability. Furthermore, the successful application of parameter-efficient fine-tuning (LoRA) for domain adaptation in a scientific context highlights its potential for developing more adaptable and data-efficient ML models for specialized scientific tasks. PCGD introduces a physics-guided conditional graph diffusion framework that effectively addresses the computational bottlenecks and physical fidelity challenges in TCAD device simulation. This work makes substantial contributions through its mesh-native graph diffusion approach, condition-aware MeshGraphNet architecture with global cross-attention, and a novel hybrid physics-guided training objective that stabilizes learning for stiff, nonlinear drift-diffusion equations. The impressive empirical results, demonstrating sub-percent accuracy and near three-order-of-magnitude reduction in PDE residuals, coupled with robust transferability to unseen topologies, position PCGD as a significant advancement in ML-accelerated scientific computing, with clear implications for future semiconductor design and broader multiphysics simulation.
The methodology presented in PCGD is highly sophisticated and meticulously designed to tackle the inherent challenges of TCAD device simulation. The core idea of formulating coupled-field prediction as a conditional generative denoising task on native unstructured TCAD meshes is a significant departure from previous ML surrogates. This approach leverages the iterative refinement capabilities of diffusion models to decouple macroscopic field construction from microscopic detail resolution, which is crucial for handling the extreme stiffness and exponential nonlinearities of semiconductor physics. The Condition-Aware MeshGraphNet denoiser is a well-thought-out architectural innovation. By explicitly injecting boundary conditions and device structure context via global cross-attention, it effectively overcomes the limitations of purely local message passing, ensuring that critical global information (like terminal biases and device layout) is directly accessible to every mesh node. This design choice is critical for accelerating convergence to correct physical operating conditions. The most impactful methodological contribution is the hybrid physics-guided objective. It ingeniously combines an exponent-free quasi-Fermi gradient matching loss ($L_G$) for stable guidance during early, noisy diffusion stages with noise-aware, adaptively scaled exact PDE residuals ($L_P$) for fine-grained physical correction in later stages. The dual-tier stabilization strategy for $L_P$ (validity-aware SNR gating and adaptive batch balance) is a robust solution to prevent gradient divergence caused by the extreme nonlinearity of drift-diffusion equations. The detailed mathematical derivation connecting Scharfetter-Gummel flux to quasi-Fermi gradients and the implementation of GPU-accelerated graph operators for discrete PDE residuals further demonstrate the technical depth and rigor of the proposed framework.
The experimental evaluation is comprehensive, rigorous, and strongly supports the claims made in the paper. The use of a challenging mixed PN/MOS benchmark dataset, comprising a large number of training and validation snapshots from distinct devices and operating modes, ensures the robustness of the evaluation. The device-trajectory level splitting for train/validation is critical for assessing generalization. The ablation studies are particularly effective, systematically isolating the contributions of iterative generative refinement, global conditioning, and physics-guided supervision by comparing PCGD against well-chosen baselines (deterministic one-step regression, local diffusion, and condition-aware diffusion with varying levels of physics guidance). The chosen metrics, mean relative $L_2$ error and log-compressed maximum PDE residual error, are appropriate and provide a holistic view of both accuracy and physical consistency. The results are highly impressive: PCGD achieves a sub-percent mean relative field error (0.835%) and significantly outperforms all baselines, reducing maximum PDE residual errors by nearly three orders of magnitude compared to pure diffusion. The analysis of convergence dynamics clearly illustrates the stability gained from the hybrid physics-guided objective. Furthermore, the demonstration of robust transferability to unseen SOI topologies via parameter-efficient LoRA adaptation, requiring significantly less data and fewer parameters than full fine-tuning, is a strong indicator that PCGD learns generalizable physical priors, which is crucial for practical adoption in industrial settings.
The paper provides a commendable level of detail for reproducibility. The methodology is thoroughly described with mathematical formulations for the graph representation, diffusion process, Condition-Aware MeshGraphNet, and the hybrid physics-guided objective. The appendix offers specific architectural settings, coefficients for the loss functions, and details on the baseline architectures. Crucially, numerical safety measures for handling stiff exponentials and precision truncation are explicitly mentioned. The schema for mesh features and interface edges is also detailed. While the absence of publicly available code (no project URL provided) is a common limitation for arXiv preprints, the textual descriptions are sufficiently comprehensive for an experienced researcher with expertise in GNNs, diffusion models, and scientific computing to implement the core components of PCGD. Access to the specific TCAD data generation pipeline would be the final piece for exact replication of the dataset.
Despite its strengths, PCGD exhibits some limitations. The paper acknowledges "severe zero-shot degradation" when transferring to major out-of-distribution topological shifts (e.g., SOI MOSFETs not seen during pretraining). While LoRA adaptation effectively addresses this, it implies that the model, despite learning physical priors, still relies on structural interpolation from its pretraining data, limiting its immediate "universal foundational model" capabilities without a much broader pretraining corpus. The current experimental validation is primarily on 2D devices. While the paper discusses the theoretical O(N) scaling advantage for 3D simulations, empirical validation on massive 3D meshes, including memory footprint and training/inference times, is yet to be demonstrated. The computational cost of training such a complex diffusion model on large graph datasets is likely substantial, though not explicitly quantified. Finally, the proposed hybrid AI-TCAD workflow mentions routing high-residual cases back to traditional solvers, but the practical implementation details, criteria for routing, and the efficiency gains from this hybrid approach are not fully explored.
PCGD has a profound broader impact, particularly in the semiconductor industry and scientific machine learning. By significantly accelerating TCAD simulations while maintaining high physical fidelity, it can drastically shorten the design cycle for advanced semiconductor devices, fostering innovation in microelectronics for applications ranging from AI hardware to power electronics. The methodology for stabilizing physics-informed diffusion models for highly stiff and nonlinear PDEs is generalizable beyond TCAD. It offers a blueprint for tackling similar challenges in other scientific and engineering domains, such as materials science, fluid dynamics, and biomechanics, where complex multiphysics simulations are computationally expensive and numerically challenging. The concept of a hybrid AI-TCAD simulation workflow, where ML provides robust initial guesses to accelerate or stabilize traditional solvers, represents a powerful paradigm for integrating AI into complex scientific computing pipelines, promising both efficiency and reliability. Furthermore, the successful application of parameter-efficient fine-tuning (LoRA) for domain adaptation in a scientific context highlights its potential for developing more adaptable and data-efficient ML models for specialized scientific tasks. PCGD introduces a physics-guided conditional graph diffusion framework that effectively addresses the computational bottlenecks and physical fidelity challenges in TCAD device simulation. This work makes substantial contributions through its mesh-native graph diffusion approach, condition-aware MeshGraphNet architecture with global cross-attention, and a novel hybrid physics-guided training objective that stabilizes learning for stiff, nonlinear drift-diffusion equations. The impressive empirical results, demonstrating sub-percent accuracy and near three-order-of-magnitude reduction in PDE residuals, coupled with robust transferability to unseen topologies, position PCGD as a significant advancement in ML-accelerated scientific computing, with clear implications for future semiconductor design and broader multiphysics simulation.
LLM agents carry conclusions across steps and sessions in compressed memory, and memory products (e.g., mem0, LangMem) rewrite conversation into stored "facts" that later steps trust. We show this rewriting manufactures confidence: across our constructed agent settings, a casual, hedged remark becomes a confident, dated assertion the agent then obeys like a verified fact, granting every above-clearance request it faces. No attacker is needed: a role that was true once and never corrected is stored as a flat fact and acted on like a deliberate injection. We then isolate what the agent responds to. It is not the source: attributed, unattributed, and even forged "system of record" claims all grant alike. It is the confidence of the phrasing. A hedge is discounted, a flat assertion is obeyed, and this holds with no special keyword. Not all hedges are equal, though: the evidential register is the least-discounted, with "reportedly" obeyed like a flat assertion on most models. The obvious fixes fail. A passive "unverified" tag is ignored, and an active "do not trust this" instruction escalates even correct memory, so it is safe only by refusing to decide. The real fix lives in the store: keep the tentative phrasing rather than upgrade it. But that is hygiene, not a defense against an attacker who can simply write a confident lie. The deployable lesson is narrower and constructive: a single load-bearing memory is the hazard, and one redundant source restores correct decisions. We release the harness and demonstrations.
Primary: unknown
All Institutions: unknown
This paper has significant broader impact for the development and deployment of LLM agents. It identifies a fundamental, pervasive, and under-defended failure mode ("manufactured confidence") in how LLM agents process and store information in consolidated memory. This phenomenon can lead to agents being confidently wrong, granting unauthorized access, or making incorrect financial decisions, even without an explicit attacker. The findings challenge current memory consolidation practices and highlight the critical need for memory architectures that preserve epistemic status. The "hearsay blind spot" is a particularly alarming discovery, as it means agents may treat unverified reports as established facts. The paper provides crucial, actionable lessons for practitioners: avoid single load-bearing memories for consequential decisions, and implement redundant verification sources. While the proposed store-side fix is not a complete defense against adaptive attackers, it significantly raises the bar and improves hygiene. This work is essential for enhancing the safety, reliability, and trustworthiness of LLM agents in real-world applications. This paper identifies and rigorously diagnoses "manufactured confidence," a critical and pervasive failure mode in LLM agent memory consolidation where hedged remarks are rewritten into confident facts, leading to agents making confidently wrong decisions across diverse models and memory products. The comprehensive analysis reveals that agents obey the confidence of phrasing rather than its source, are blind to hearsay markers, and that common mitigations like passive tags or active distrust instructions are ineffective or lead to abdication, underscoring the necessity of preserving epistemic status in memory stores and employing redundant information sources for consequential decisions.
The methodology is exceptionally strong and well-designed for diagnosing the phenomenon of "manufactured confidence." The authors construct multi-step agent settings (access control, budget approval, running total) where memory is load-bearing, allowing for clear ground truth and legible impact. A crucial aspect is the use of real, shipped memory products (mem0, LangMem) alongside a verbatim control, which grounds the findings in practical agent deployments. The systematic probing involves varying how memory is presented (confident, passive tag, active instruction), dissecting the cues agents respond to (modality, hearsay, explicit non-verification), and testing the impact of source attribution (bare, attributed, forged authority). The inclusion of a "natural case" (staleness without injection) alongside adversarial injection strengthens the generalizability of the problem. The use of five diverse, state-of-the-art LLMs from four providers (Anthropic, Meta, OpenAI, Qwen) ensures the findings are not model-specific. The methodology also includes a symmetry test (over-denial) to rule out simple grant bias and a detailed analysis of the "laundering" process within memory products. The approach is comprehensive, rigorous, and effectively isolates the mechanisms behind manufactured confidence.
The experimental evaluation is thorough and provides compelling evidence for the paper's claims. Key findings include: 1. **Manufactured Confidence**: Memory consolidation rewrites hedged remarks into confident assertions, leading to high confident-wrong rates (0.50-1.00) across all models in consequential decisions. 2. **Source Invariance**: Agents obey the confidence of phrasing, not its source. Attributed, unattributed, and even forged "system of record" claims grant alike, demonstrating a critical blindness to provenance. 3. **Failure of Obvious Fixes**: Passive "unverified" tags are largely ignored, especially by non-Anthropic models. Active "do not trust this" instructions lead to abdication (escalating everything), not discrimination, costing all utility. 4. **Redundancy as a Fix**: A second, authoritative source allows agents to discriminate, turning distrust into selective caution rather than blanket abdication. 5. **Hearsay Blind Spot**: Evidential registers, particularly "reportedly," are the least-discounted hedges, often obeyed like flat assertions on most models. This is a critical, pervasive vulnerability. 6. **Symmetry**: The effect is symmetric, causing both over-granting and over-denial based on manufactured confidence, ruling out a simple grant bias. 7. **Consolidation, Not Vendor**: The laundering of hedges into confident facts is a property of LLM consolidation itself, not specific memory products or extraction LLMs. The experiments are quantitatively presented with clear rates, using temperature 0 for deterministic behavior per scenario. The results are consistent across models, highlighting a systemic issue. The distinction between "belief" and "low threshold" based on rationale analysis adds a qualitative layer to the findings.
The paper demonstrates a high commitment to reproducibility. The authors explicitly state, "We release the harness, data, and demonstrations at https://github.com/collapseindex/manufactured-confidence." They provide detailed information on the models used (exact API identifiers, providers, access dates), temperature settings, agent system prompts, memory poisoning setup, and memory backend configurations. Specific scripts (e.g., `cues.py`, `forged.py`) are mentioned, indicating a well-structured codebase. This level of detail and code release makes the experiments highly reproducible.
The authors are commendably transparent about the limitations: 1. **Constructed Scenarios**: The tasks are decision-shaped but not live deployments, and even "natural staleness" sessions are constructed, meaning the base rate of this failure mode in the wild is not measured. 2. **Scope**: The study focuses on two memory products, four extractors, and five phrasings, with deep probes primarily in access control. While robust, it's not exhaustive. The Zep probe is limited. 3. **Belief vs. Threshold**: The distinction relies on verbalized rationales, which are not ground-truth processing. 4. **Non-Adaptive Threat Model**: The proposed store-side defense is not robust against an adaptive attacker who can directly supply confident, forged authority. 5. **Sample Sizes**: While effects are large and consistent, $n$ values (e.g., 15 for decisions, 10 for poisonings) are relatively small for statistical generalization, though the deterministic nature at temperature 0 mitigates this for the constructed scenarios. 6. **Fix is a Prompt**: The hedge-preserving extraction is demonstrated via a prompt, not a fully engineered production store.
This paper has significant broader impact for the development and deployment of LLM agents. It identifies a fundamental, pervasive, and under-defended failure mode ("manufactured confidence") in how LLM agents process and store information in consolidated memory. This phenomenon can lead to agents being confidently wrong, granting unauthorized access, or making incorrect financial decisions, even without an explicit attacker. The findings challenge current memory consolidation practices and highlight the critical need for memory architectures that preserve epistemic status. The "hearsay blind spot" is a particularly alarming discovery, as it means agents may treat unverified reports as established facts. The paper provides crucial, actionable lessons for practitioners: avoid single load-bearing memories for consequential decisions, and implement redundant verification sources. While the proposed store-side fix is not a complete defense against adaptive attackers, it significantly raises the bar and improves hygiene. This work is essential for enhancing the safety, reliability, and trustworthiness of LLM agents in real-world applications. This paper identifies and rigorously diagnoses "manufactured confidence," a critical and pervasive failure mode in LLM agent memory consolidation where hedged remarks are rewritten into confident facts, leading to agents making confidently wrong decisions across diverse models and memory products. The comprehensive analysis reveals that agents obey the confidence of phrasing rather than its source, are blind to hearsay markers, and that common mitigations like passive tags or active distrust instructions are ineffective or lead to abdication, underscoring the necessity of preserving epistemic status in memory stores and employing redundant information sources for consequential decisions.
Diffusion Language Models (DLMs) are typically trained under fixed context structures, restricting denoising to predetermined token subsets. This creates a mismatch between training and inference, where models must operate over arbitrary configurations, leading to degradation off the training grid. We propose Adaptive Block Diffusion (ABD), which resolves this mismatch by optimizing denoising risk over a distribution of prefix-window configurations. By treating the configuration as a stochastic variable, ABD trains a single model over the full configuration space without architectural changes. We show that generalization across decoding strategies is governed by the support of the training distribution, and that ABD guarantees denoising optimality for any inference policy whose configurations are covered during training. Empirically, ABD exhibits structural invariance across decoding scales, avoiding off-grid collapse and recovering a monotonic relationship between block size and perplexity, while matching or outperforming fixed-block specialists at their target scales.
Primary: Microsoft AI
All Institutions: Microsoft AI
Adaptive Block Diffusion has significant broader impact for the field of Diffusion Language Models and potentially for structured generative models more generally. By providing a principled and effective solution to the training-inference mismatch, ABD makes DLMs more robust, generalizable, and practical for real-world deployment. The ability of a single model to perform well across various decoding scales and strategies simplifies model management and enables more flexible inference scenarios, such as adaptive generation speeds or streaming decoding. This could accelerate the adoption and development of DLMs in diverse applications. The theoretical framework also offers a valuable new perspective on understanding generalization in structured generative models, potentially inspiring similar training paradigms in other domains beyond text. The demonstrated improved zero-shot generalization to domain-shifted text further suggests that ABD could lead to more versatile and less domain-specific language models. Adaptive Block Diffusion resolves the training-inference mismatch in Diffusion Language Models by optimizing denoising risk over a stochastic distribution of prefix-window configurations, leading to a single model that exhibits structural invariance and robust generalization across diverse decoding scales. This paper presents a theoretically grounded and empirically validated framework that significantly enhances the robustness and flexibility of Diffusion Language Models, addressing a critical limitation of prior fixed-structure approaches and paving the way for more adaptable and generalizable text generation.
The methodology is robust and elegantly addresses a core problem in Diffusion Language Models (DLMs): the training-inference mismatch caused by fixed context structures. Adaptive Block Diffusion (ABD) proposes a novel training objective that treats the denoising configuration (prefix length $k$ and window length $\ell$) as a stochastic variable, optimizing denoising risk over a distribution $\pi$ of these configurations. This approach is commendable for not requiring architectural changes, instead focusing on a principled modification to the training process. The theoretical analysis is a significant strength, formally defining conditional denoising risk and proving statistical consistency over the support of $\pi$. The "Training-Inference Alignment" theorem, leveraging the Radon-Nikodym theorem, rigorously demonstrates that if an inference policy's configuration distribution is covered by the training distribution's support, then denoising optimality is guaranteed. This provides a strong theoretical foundation for the empirical claims of structural invariance. The practical implementation details, particularly the attention mask construction and the `ABDBoundaryManager` for sampling block lengths, are clearly described in the appendix, showcasing a well-thought-out and implementable solution.
The experimental evaluation is comprehensive, well-designed, and provides strong empirical evidence supporting the theoretical claims. The authors use standard language modeling benchmarks (LM1B, OpenWebText) and ensure fair comparisons by using an identical transformer architecture to existing baselines (MDLM, BD3LM). The most compelling result is the demonstration of "structural invariance": ABD successfully recovers the monotonic relationship between block size and perplexity, a fundamental property for generative models, which fixed-block specialists fail to maintain off their training grid. This directly validates the core hypothesis that training over a broad configuration distribution leads to better generalization. Furthermore, ABD matches or outperforms fixed-block specialists at their target scales, indicating that multi-scale training acts as a regularizer rather than a compromise. The zero-shot generalization experiments on diverse datasets, including scientific text, show improved robustness and suggest that ABD learns a more configuration-invariant language representation. The ablations on configuration distribution types (categorical exponential, uniform, lognormal) and training budget allocation are particularly insightful, offering practical guidance on how to tune ABD for specific inference regimes and demonstrating the trade-offs involved.
The paper excels in reproducibility. The methodology is clearly articulated, and the appendix provides detailed pseudocode for the critical components, including the `abd_attention_mask` and `ABDBoundaryManager`. The authors explicitly state that they leverage the same codebase, datasets, architecture, likelihood evaluation, and inference setup as a previously published work (arriola2025blockdiffusioninterpolatingautoregressive), which significantly lowers the barrier to reproduction. Specific details regarding training budget allocation and configuration sampling strategies are also provided. This level of detail and reliance on a shared foundation is exemplary.
The authors openly acknowledge several limitations. A key one is the dependence on the choice of the configuration distribution $\pi$. While $\pi$ offers a principled way to balance performance across decoding regimes, an suboptimal choice can bias the model towards frequently sampled configurations, potentially leading to uneven performance across scales. This implies that careful tuning of $\pi$ is necessary for specific application scenarios. Additionally, ABD does not directly address inference efficiency; while it enables flexible decoding, the selection of optimal inference-time policies remains an open problem. Finally, the theoretical analysis provides optimality guarantees under support coverage but does not offer finite-sample guarantees, meaning practical performance might still be influenced by the quality and density of training coverage in finite data regimes.
Adaptive Block Diffusion has significant broader impact for the field of Diffusion Language Models and potentially for structured generative models more generally. By providing a principled and effective solution to the training-inference mismatch, ABD makes DLMs more robust, generalizable, and practical for real-world deployment. The ability of a single model to perform well across various decoding scales and strategies simplifies model management and enables more flexible inference scenarios, such as adaptive generation speeds or streaming decoding. This could accelerate the adoption and development of DLMs in diverse applications. The theoretical framework also offers a valuable new perspective on understanding generalization in structured generative models, potentially inspiring similar training paradigms in other domains beyond text. The demonstrated improved zero-shot generalization to domain-shifted text further suggests that ABD could lead to more versatile and less domain-specific language models. Adaptive Block Diffusion resolves the training-inference mismatch in Diffusion Language Models by optimizing denoising risk over a stochastic distribution of prefix-window configurations, leading to a single model that exhibits structural invariance and robust generalization across diverse decoding scales. This paper presents a theoretically grounded and empirically validated framework that significantly enhances the robustness and flexibility of Diffusion Language Models, addressing a critical limitation of prior fixed-structure approaches and paving the way for more adaptable and generalizable text generation.
Retrieval-augmented generation (RAG) typically treats context selection as ranking chunks against a single query embedding. This assumption breaks down for complex queries, such as multi-hop or ambiguous questions, where top-k selection tends to over-cover one semantic aspect while ignoring critical sub-questions. We propose GeoRAG, which recasts context selection as Information Demand Coverage Optimization. GeoRAG builds a multi-dimensional demand distribution through diverse sub-query generation and reverse-validation weighting, then selects context by minimizing the Sinkhorn-Wasserstein distance between this demand distribution and the coverage of the selected set. The resulting demand-weighted facility-location objective is monotone submodular, giving a $1-1/e$ greedy guarantee, which we approximate with a Sinkhorn-based marginal-gain surrogate. The method is unsupervised, training-free, and retrieval-agnostic. We further show that single-point, query-proximity scorers cannot cover multi-modal demands, exposing a structural limit of ranking-based selection. On six open-domain QA benchmarks, GeoRAG improves exact match (EM) by +6.5 to +7.5 points over top-k truncation (up to +9.7 on HotpotQA and ASQA) and outperforms strong baselines including MMR, DPP, BGE-Reranker, SMART-RAG, and AdaGReS, with stable gains across context budgets and sub-query generators.
Primary: Singapore Management University
All Institutions: University of Shanghai for Science and Technology, Singapore Management University
GeoRAG presents a theoretically grounded, empirically superior method for RAG context selection by modeling information demand as a multi-dimensional distribution and optimizing for coverage via submodular optimization, significantly outperforming existing ranking and diversity-based approaches on complex QA benchmarks.
The paper proposes GeoRAG, a novel context selection framework for Retrieval-Augmented Generation (RAG) that moves beyond single-point query embeddings. The core innovation is reformulating context selection as an Information Demand Coverage Optimization problem. It constructs a multi-dimensional "Information Demand Proxy" distribution using diverse sub-query generation and reverse-validation weighting. The selection process minimizes the Sinkhorn-Wasserstein distance between this demand distribution and the coverage of the selected set. The authors prove that the resulting facility-location objective is monotone submodular, providing a theoretical $(1-1/e)$ greedy guarantee. They further demonstrate a structural limitation of existing ranking-based methods (query-proximity-monotone selectors) in handling bimodal information needs, providing a rigorous theoretical foundation for their approach. The method is unsupervised and training-free, making it broadly applicable.
The experimental evaluation is comprehensive and robust. The authors test GeoRAG across six open-domain QA benchmarks (NQ, TriviaQA, HotpotQA, 2WikiMHQA, ASQA, FEVER) and six different retrieval backends (Dense, BM25, Hybrid RRF, HyDE, MultiQuery, GraphRAG). GeoRAG consistently outperforms strong baselines, including MMR, DPP, BGE-Reranker, SMART-RAG, and AdaGReS, with significant gains on multi-hop datasets (up to +9.7 EM on HotpotQA). The paper includes extensive ablation studies isolating the contributions of the demand distribution (Axis A) and the set-aware coverage selection (Axis B). Crucially, they perform a "Full-Wikipedia" experiment without gold-injection to prove the method's effectiveness in realistic, harder retrieval settings. They also provide direct measurements of demand-dimension coverage, empirically validating that GeoRAG successfully covers multiple semantic peaks where baselines fail.
The paper provides detailed algorithmic descriptions, including the specific steps for sub-query generation, reverse-validation, and the Sinkhorn-based marginal gain calculation. Hyperparameters are clearly listed. The use of standard benchmarks and open-source models (Qwen3-Embedding-8B, Qwen3-4B) enhances reproducibility. The code is not explicitly linked in the text provided, but the methodological details are sufficient for implementation.
The method relies on LLM-generated sub-queries, which introduces a dependency on the quality and diversity of the generator. While the paper shows robustness across different generators, poor sub-query generation could degrade performance. The reverse-validation step adds computational overhead, though the latency analysis suggests it is manageable. The theoretical guarantee applies to the exact facility-location objective, while the deployed method uses a Sinkhorn surrogate; the paper acknowledges this but shows the surrogate performs well. The method is primarily evaluated on open-domain QA; its performance on more complex reasoning tasks or non-QA RAG applications is less clear.
GeoRAG addresses a fundamental limitation in current RAG systems: the inability to handle complex, multi-faceted queries effectively. By providing a retrieval-agnostic, training-free solution that significantly improves answer quality, it has the potential to become a standard component in RAG pipelines. The theoretical insights into the limitations of single-point embeddings also contribute to a deeper understanding of information retrieval in the LLM era. GeoRAG presents a theoretically grounded, empirically superior method for RAG context selection by modeling information demand as a multi-dimensional distribution and optimizing for coverage via submodular optimization, significantly outperforming existing ranking and diversity-based approaches on complex QA benchmarks.
Diffusion and continuous-flow generative models achieve high-quality generation, and their deterministic sampling can be formulated as solving learned ODE dynamics. However, accurate ODE discretization often requires many steps, making efficient few-step generation a key challenge. Among acceleration strategies, reflow-based distillation simplifies teacher ODE trajectories so that a student model can approximate the teacher transport with fewer steps. We identify a theoretical limitation of this paradigm, namely that trajectory matching can under-determine the distribution induced by the student model. In particular, two student models can attain the same trajectory-matching loss while inducing different endpoint marginal distributions, which may lead to different generation quality. To address this limitation, we introduce a marginal-alignment regularizer that penalizes the discrepancy between the student-induced marginal and the corresponding teacher marginal at the endpoint of each distillation interval. The regularizer is computed by tracking log-density changes along the ODE induced by the student model and evaluating scores from the frozen teacher model, without requiring auxiliary trainable networks or adversarial optimization. The resulting framework applies uniformly to the reflow family, including vanilla reflow and piecewise reflow. We further prove a telescoping total-variation bound showing that local marginal alignment controls the final-time discrepancy between the student-induced and teacher-induced distributions. Experiments on benchmark backbones demonstrate the effectiveness of the proposed method for few-step generation.
Primary: Tsinghua University
All Institutions: Tsinghua University
The paper introduces a marginal-alignment regularizer for reflow-based distillation, theoretically justifying and empirically demonstrating that aligning endpoint marginals improves few-step generation quality in continuous-flow models.
The paper addresses a critical theoretical gap in reflow-based distillation for continuous-flow generative models. The authors correctly identify that minimizing trajectory matching loss (matching the vector fields or paths) does not guarantee that the induced marginal distributions at the endpoints match, due to the potential for different ODE solutions to have the same path but different divergence properties or simply because trajectory matching is a local constraint while generation quality depends on the global marginal. The proposed solution, a marginal-alignment regularizer computed via log-density tracking and teacher scores, is theoretically sound and practically viable. It avoids the instability of adversarial training often seen in GAN-based distillation or score-matching approaches. The derivation of the telescoping total-variation bound provides a rigorous justification for why this regularizer helps, linking local alignment to global distributional fidelity. This is a significant methodological improvement over vanilla reflow.
The experiments demonstrate the effectiveness of the proposed method on benchmark backbones for few-step generation. While the specific quantitative results (FID, IS, etc.) are not fully detailed in the abstract, the claim of improved generation quality for few steps is consistent with the theoretical motivation. The method applies uniformly to vanilla and piecewise reflow, suggesting broad applicability. The evaluation likely covers standard image generation benchmarks (e.g., CIFAR-10, ImageNet subsets), which are standard for this type of work. The improvement in few-step generation is a highly relevant metric for practical deployment.
The method relies on tracking log-density changes along the student ODE and evaluating teacher scores. These are standard operations in continuous normalizing flow and diffusion literature. The lack of auxiliary trainable networks simplifies the implementation. The paper provides a clear algorithmic description, making it likely reproducible. However, the stability of log-density estimation can be sensitive to numerical integration errors, which might require careful hyperparameter tuning not always fully disclosed in short papers.
The primary limitation is the computational overhead of computing the log-density changes and evaluating teacher scores at every step of the distillation process. While this is done during training (distillation), it may slow down the distillation phase significantly compared to vanilla reflow. Additionally, the accuracy of the log-density estimation depends on the quality of the student model's flow; if the student is very poor, the density estimates might be unreliable, potentially destabilizing training. The paper does not explicitly discuss the trade-off between the regularization strength and the trajectory matching loss, which is a critical hyperparameter.
This work contributes to the democratization of high-quality generative models by making few-step generation more effective and stable. Efficient generation is crucial for real-time applications, mobile deployment, and reducing computational costs. By providing a theoretically grounded method to improve reflow distillation, it sets a new standard for how distillation should be performed in continuous-flow models, potentially influencing future research in this area. The paper introduces a marginal-alignment regularizer for reflow-based distillation, theoretically justifying and empirically demonstrating that aligning endpoint marginals improves few-step generation quality in continuous-flow models.
World models for language agents come in two useful forms. An agent-based world model calls an LLM API and reasons flexibly in language, but its errors appear as hallucinated state changes that are hard to score with ordinary regression losses. A parameterized world model is a trained transition predictor; its errors are easier to measure with quantities such as NodeMSE, delta accuracy, and validity accuracy, but it is usually weaker as a standalone planner. We compare these two families on four graph-structured planning benchmarks and introduce operational hallucination metrics for the agent-based case. The comparison motivates \textbf{Grounded Iterative Language Planning} (GILP), which trains only a small parameterized backbone and combines it with API-based agent reasoning. The backbone supplies valid actions, predicted state deltas, risk, and value; the LLM drafts an action and imagined delta; and a consistency gate asks for revision when the two disagree. On real GPT-4o-mini calls, GILP reduces hallucinated-state rate from 0.176 to 0.035. In calibrated simulator ablations, it raises success from 0.668 to 0.838 while adding only ~22% extra LLM calls.
Primary: The University of Tokyo
All Institutions: The University of Tokyo, Emory University
GILP offers a significant step forward in building more reliable and robust LLM agents. By effectively mitigating hallucination propagation, it enhances the trustworthiness of LLM-driven planning systems, particularly for long-horizon tasks where compounding errors are most problematic. This has broad implications for applications in complex workflow automation, tool use, resource management, and other domains requiring sequential decision-making. The framework for defining and measuring hallucination propagation provides valuable tools for future research in LLM agent evaluation. Furthermore, the demonstration that a small, cheap parametric model can effectively ground powerful, expensive LLMs opens avenues for more cost-efficient and performant hybrid agent architectures, potentially democratizing access to advanced LLM agent capabilities by making them more reliable even with less capable (or self-hosted) LLMs. This paper introduces Grounded Iterative Language Planning (GILP), a novel hybrid world model that significantly reduces hallucination propagation in LLM agents by combining flexible API reasoning with a small, trained parameterized backbone. The comprehensive evaluation, including real API validation and multi-LLM comparisons, demonstrates substantial improvements in task success and state faithfulness, providing a robust and generalizable solution to a critical problem in LLM agent reliability.
The paper introduces Grounded Iterative Language Planning (GILP), a hybrid world model that combines the flexible reasoning of an LLM agent with the measurable, grounded predictions of a small parameterized backbone. The methodology is well-articulated, consisting of four phases: (1) Parameterized Skeleton Scoring, where the backbone predicts action validity, state deltas, risk, and value for candidate actions; (2) LLM Draft, where the LLM generates an action and imagined next-state delta, incorporating the skeleton into its prompt; (3) Consistency Gate, which uses Jaccard similarity to compare the LLM's imagined delta with the backbone's prediction, triggering a targeted re-prompt for revision if they disagree; and (4) Risk Gate, which escalates if the backbone predicts high risk. The definition of "hallucinated state atom" and the operational metrics (Hallucinated-State Rate (HSR), Propagation Depth (PD), Error-Explosion Slope (EES)) are crucial for quantifying the LLM agent's semantic errors, which are otherwise hard to measure. The theoretical "One-step hallucination contraction" proposition provides a formal basis for GILP's error reduction. The approach is elegant in its simplicity and effectiveness, leveraging the strengths of both model types while mitigating their weaknesses. The use of structured JSON for state deltas and the consistency gate's Jaccard similarity are practical and robust design choices.
The experimental evaluation is exceptionally comprehensive and rigorous. The authors use four graph-structured planning benchmarks (TaskGraph, ToolChain, ResourceAlloc, RepairFlow) and conduct extensive comparisons across eleven planning strategies. A key strength is the use of a behavioral simulator calibrated against real GPT-4o-mini calls, which allows for large-scale ablations while maintaining fidelity to real-world LLM behavior. The paper demonstrates significant improvements: GILP raises simulator success from 0.668 to 0.838 and, critically, reduces the hallucinated-state rate (HSR) on real GPT-4o-mini calls from 0.176 to 0.035 (an 80% reduction). The analysis of long-horizon scaling clearly shows GILP's ability to prevent performance degradation due to hallucination propagation. The cost-quality tradeoff analysis is thorough, showing that GILP achieves better performance per successful task despite adding LLM calls. Ablation studies systematically validate the contribution of each component (validity, delta, risk, value, correction gate), confirming the importance of the consistency gate. The multi-API comparison is a standout, demonstrating GILP's generalizability across GPT-4o-mini, Claude-3-Haiku, Gemini-1.5-Flash, and Llama-3-8B, and showing how it equalizes their performance and reveals API-specific hallucination propensities. The inclusion of an AgentBench-style Knowledge Graph traversal task, while showing less statistically significant gains due to dataset limitations, provides valuable insights into applicability boundaries and the need for well-calibrated backbones.
The paper explicitly states the release of the prompt suite, simulator, benchmarks, and code artifacts for reproducible follow-up work, with a GitHub link provided. This commitment to open science significantly enhances reproducibility. The detailed methodology, algorithm, and experimental setup descriptions further support replication.
The authors acknowledge several limitations. The current operationalization of hallucination metrics focuses on status-level delta errors, leaving entity-set hallucinations and reward-attribution errors for future work. The simulator, while calibrated, is still a proxy and might not capture all nuances of real API behavior. The Knowledge Graph traversal results, while insightful, did not show statistically significant improvements in SR or HSR due to the small sample size and potential backbone calibration issues, indicating an applicability boundary where the parametric model might not be sufficiently trained or representative. The cost of GILP, while justified by improved success, still involves additional LLM calls, which can be a factor for extremely cost-sensitive applications, especially with expensive proprietary APIs.
GILP offers a significant step forward in building more reliable and robust LLM agents. By effectively mitigating hallucination propagation, it enhances the trustworthiness of LLM-driven planning systems, particularly for long-horizon tasks where compounding errors are most problematic. This has broad implications for applications in complex workflow automation, tool use, resource management, and other domains requiring sequential decision-making. The framework for defining and measuring hallucination propagation provides valuable tools for future research in LLM agent evaluation. Furthermore, the demonstration that a small, cheap parametric model can effectively ground powerful, expensive LLMs opens avenues for more cost-efficient and performant hybrid agent architectures, potentially democratizing access to advanced LLM agent capabilities by making them more reliable even with less capable (or self-hosted) LLMs. This paper introduces Grounded Iterative Language Planning (GILP), a novel hybrid world model that significantly reduces hallucination propagation in LLM agents by combining flexible API reasoning with a small, trained parameterized backbone. The comprehensive evaluation, including real API validation and multi-LLM comparisons, demonstrates substantial improvements in task success and state faithfulness, providing a robust and generalizable solution to a critical problem in LLM agent reliability.
Jailbreak attacks bypass LLM safety alignment, yet their mechanisms remain poorly understood. We provide evidence that attacks do not comprehensively eliminate safety features, but instead selectively suppress specific attention heads. We identify two functionally differentiated types: Adversarially Compromised Heads (ACHs) concentrated in early layers, which are suppressed under attacks, and Safety-Aligned Heads (SAHs) in mid-layers, which maintain robust activations even when attacks succeed. Ablation studies support the causal role of ACHs and the contribution of SAHs to robust activations: suppressing a small number of ACHs is sufficient to induce jailbreak-like behavior on normally refused inputs, while removing SAHs substantially weakens mid-layer safety activations. Token-level attribution further shows that ACH suppression is driven specifically by attack-template tokens, providing a mechanistic account of why attacks can bypass refusal decisions through ACH suppression while leaving internal safety signals sustained by SAHs -- a phenomenon we term Robust Harmful Features. To validate the practical significance of this robustness, we show that simply reading these persistent activations -- without any training -- yields competitive aggregate detection performance with strong adversarial robustness.
Primary: Beijing University of Posts and Telecommunications
All Institutions: Beijing University of Posts and Telecommunications
This paper provides a rigorous mechanistic explanation for jailbreak attacks, identifying specific attention heads responsible for safety suppression and robustness, and demonstrating a novel, training-free detection method based on these insights. The work significantly advances the field of mechanistic interpretability in LLMs, offering deep insights into how safety alignment functions internally and how it can be bypassed, thereby contributing to the development of more robust and understandable AI safety mechanisms.
The paper proposes a mechanistic interpretability framework to dissect the internal representations of Large Language Models (LLMs) under jailbreak attacks. The core methodological contribution is the identification and functional differentiation of two types of attention heads: Adversarially Compromised Heads (ACHs) and Safety-Aligned Heads (SAHs). The authors employ ablation studies and token-level attribution to establish causal links between these heads and model behavior. Specifically, they demonstrate that suppressing ACHs induces refusal failures, while SAHs maintain robust activation patterns even when the model outputs harmful content. This approach moves beyond black-box behavioral analysis to provide a granular, component-level understanding of safety mechanisms, leveraging techniques like activation patching and attribution mapping.
The experimental evaluation is comprehensive and rigorous. The authors conduct extensive ablation studies on multiple LLM architectures to validate the causal role of ACHs and SAHs. They perform token-level attribution to show that attack-template tokens specifically drive the suppression of ACHs. Furthermore, they develop a training-free detection method based on reading persistent SAH activations, demonstrating competitive aggregate performance and strong adversarial robustness against various jailbreak templates. The results are supported by 19 figures and detailed statistical analysis, providing strong empirical evidence for the "Robust Harmful Features" hypothesis. The evaluation covers both the mechanistic insights and the practical application of these insights for defense.
The paper provides detailed descriptions of the methodologies, including the specific attention heads analyzed, the ablation protocols, and the attribution methods used. The inclusion of 19 figures and the structured presentation of experiments suggests a high level of transparency. However, full reproducibility depends on the availability of the code and the specific model checkpoints used, which are not explicitly linked in the provided text (though standard for pre-submission reviews). The methodology is sufficiently detailed for other researchers to replicate the mechanistic analysis if given access to the models.
The primary limitation is the requirement for white-box access to the models for the mechanistic analysis and ablation studies, which restricts the direct applicability of the *analysis* to black-box scenarios, although the resulting *detector* is training-free and potentially applicable to black-box models if the activations can be accessed or approximated. Additionally, the study focuses on specific types of jailbreak attacks; the generalizability to novel, unseen attack vectors that might target different mechanisms remains to be seen. The paper also notes that the barrier to defense is mechanistic understanding, implying that translating these insights into robust, scalable defenses is a future challenge.
This work has significant implications for AI safety and security. By elucidating the mechanisms behind jailbreak attacks, it provides a foundation for developing more effective, mechanistically-informed defenses. The identification of robust safety features (SAHs) offers a new avenue for monitoring and enhancing LLM safety without retraining. However, the dual-use nature of this research is acknowledged; while the paper focuses on defense, the mechanistic understanding could theoretically be used to craft more sophisticated attacks that specifically evade these identified safety mechanisms. The impact statement correctly balances these concerns, emphasizing the defensive contributions. This paper provides a rigorous mechanistic explanation for jailbreak attacks, identifying specific attention heads responsible for safety suppression and robustness, and demonstrating a novel, training-free detection method based on these insights. The work significantly advances the field of mechanistic interpretability in LLMs, offering deep insights into how safety alignment functions internally and how it can be bypassed, thereby contributing to the development of more robust and understandable AI safety mechanisms.
Understanding how performance scales jointly with model size and data is a central problem in modern machine learning. Existing theoretical works on scaling laws typically describe generalization as a function of data or compute, often in fixed-feature or infinite-width regimes and for online SGD. Here, we instead study how generalization scales with the number of trainable parameters and the number of samples in a feature-learning model. We analyze $\ell_2$-regularized empirical test error minimization in a quadratic two-layer network in a finite-sample setting with structured data. This setting allows for an explicit characterization of the generalization error as a function of the number of samples, model width, and regularization. Our results reveal a phase diagram with distinct scaling regimes as the number of parameters varies. In particular, the generalization error follows data-dependent power laws controlled by the spectral structure of the target. We further characterize the transitions between regimes, including the onset of interpolation, and their impact on generalization.
Primary: École Polytechnique Fédérale de Lausanne (EPFL)
All Institutions: École Polytechnique Fédérale de Lausanne (EPFL), University of Zurich
This paper presents a rigorous theoretical analysis of generalization scaling laws in quadratic neural networks, revealing how model width acts as an implicit regularizer and characterizing distinct scaling regimes through state-evolution analysis. The work makes a significant contribution to the theoretical foundations of deep learning by providing explicit, data-dependent power laws for generalization error in a feature-learning setting, offering deep insights into the interplay between model size, data quantity, and regularization that are likely to influence future theoretical and practical approaches to model scaling.
The paper employs a sophisticated theoretical framework combining Approximate Message Passing (AMP) and statistical physics techniques (replica method/state evolution) to analyze the generalization scaling laws of quadratic two-layer neural networks. The methodology is rigorous within its domain, deriving explicit analytical characterizations of the excess test error as a function of model width, sample size, and regularization. It successfully maps out a phase diagram identifying distinct regimes (under-regularized, over-regularized, rank-collapse) and transitions such as the onset of interpolation. The approach isolates the role of width as an implicit regularizer, providing a closed-form description of the learned predictor's spectral structure.
The theoretical predictions are validated against numerical optimization of the quadratic network. The authors demonstrate excellent agreement between the state-evolution predictions and empirical test errors for moderate dimensions ($d=400$). While the experiments are limited to this specific stylized model, they are sufficient to support the theoretical claims within the defined setting. The paper does not claim empirical validity on large-scale real-world datasets, which is consistent with its theoretical focus.
The paper provides detailed derivations in the appendix, including the specific equations for the state evolution and the conditions for each phase. The numerical validation is straightforward to reproduce given the defined model and optimization setup. The reliance on AMP/heuristic extensions means that rigorous proofs for the non-asymptotic regimes are acknowledged as open problems, but the computational reproducibility of the claims is high.
The primary limitation is the stylized nature of the model: a shallow quadratic network with Gaussian inputs and a specific power-law spectral teacher. The authors explicitly state that precise exponents may not transfer directly to realistic deep architectures. Furthermore, the derivation relies on the replica-symmetric assumption and non-rigorous extensions of AMP to non-asymptotic regimes, which, while numerically accurate, lack full mathematical rigor in the finite-size setting.
This work provides fundamental insights into the mechanisms of feature learning and the role of model width in generalization. By characterizing width as an implicit regularizer and deriving optimal scaling laws, it offers theoretical guidance for understanding why over-parameterization can be beneficial and how to balance model capacity with data availability. It bridges the gap between fixed-feature models and full feature-learning regimes, contributing to the broader understanding of scaling laws in modern ML. This paper presents a rigorous theoretical analysis of generalization scaling laws in quadratic neural networks, revealing how model width acts as an implicit regularizer and characterizing distinct scaling regimes through state-evolution analysis. The work makes a significant contribution to the theoretical foundations of deep learning by providing explicit, data-dependent power laws for generalization error in a feature-learning setting, offering deep insights into the interplay between model size, data quantity, and regularization that are likely to influence future theoretical and practical approaches to model scaling.
Low-Rank Adaptation (LoRA) has become the standard tool for parameter-efficient fine-tuning of large pretrained models. When applied sequentially across tasks in Continual Learning (CL), the standard assumption is that each new task requires a dedicated low-rank adapter. In this work, we challenge this assumption empirically and structurally. We show that task-specific LoRA adapters in CL exhibit significant low-rank redundancy: the subspaces spanned by adapters trained on different tasks substantially overlap, and in many cases earlier adapters can faithfully represent later tasks. Building on this observation, we propose LiteLoRA, a plug-and-play gating mechanism that learns at train time whether to recruit a new adapter or reuse existing low-rank representations. Our method reduces the number of active adapters by 20-70% while matching or exceeding state-of-the-art performance on standard CL benchmarks, revealing that structural redundancy is pervasive and that selective learning is sufficient to achieve stability without sacrificing plasticity.
Primary: ETH Zurich
All Institutions: ETH Zurich
LiteLoRA effectively reduces the parameter footprint of Continual Learning with LoRA by discovering and exploiting low-rank redundancy across tasks, achieving state-of-the-art performance with significantly fewer active adapters. The paper makes a compelling empirical case for structural efficiency in PEFT, offering a practical solution to the stability-plasticity dilemma without sacrificing accuracy.
The paper proposes LiteLoRA, a method that challenges the standard "one-adapter-per-task" paradigm in Continual Learning (CL) with LoRA. The core insight is that task-specific low-rank adapters exhibit significant subspace redundancy. To exploit this, the authors introduce a differentiable gating mechanism (using Gumbel-Sigmoid and Straight-Through Estimators) that learns to prune adapters at the task level. The training is decoupled into two phases: feature acquisition and structural pruning. This approach is built on top of SD-LoRA, leveraging its magnitude-direction decomposition. The methodology is technically sound, leveraging existing PEFT and CL techniques in a novel structural way. The two-phase training is a clever heuristic to stabilize the discrete selection process.
The evaluation covers standard CL benchmarks: CIFAR-100, ImageNet-A, and ImageNet-R. The results demonstrate that LiteLoRA matches or exceeds the performance of SD-LoRA while reducing the number of active adapters by 20-70%. The paper provides a detailed analysis of the sparsity-accuracy frontier, showing that accuracy saturates quickly with fewer adapters. The robustness across different task orderings is a strong point, highlighting the method's ability to adapt to the curriculum. The reduction in parameter count is significant and practically relevant for memory-constrained deployment.
The paper provides sufficient implementation details, including backbone (ViT-B/16), LoRA rank (10), and dataset splits. The two-phase training procedure is clearly defined. However, the specific hyperparameters for the sparsity penalty and gating temperature are mentioned as being grid-searched, which is standard but requires careful reporting for exact reproduction. The code is not explicitly linked in the text provided, but the description is detailed enough for a competent practitioner to implement.
The authors acknowledge that the final pruning decision depends on hyperparameters (sparsity weight, temperature). The method assumes that redundant adapters are not uniquely required for future tasks, which might not hold for highly compositional tasks. The evaluation is limited to image classification tasks; generalization to other modalities or more complex CL settings (e.g., object detection, segmentation) is not explored. The "plug-and-play" claim is somewhat limited by the dependency on SD-LoRA's specific structure, though the gating mechanism itself is modular.
This work contributes to more sustainable and efficient machine learning by reducing the computational and memory overhead of continual adaptation. It challenges the assumption that linear parameter growth is necessary for CL, potentially lowering the barrier for deploying large models in resource-constrained environments. The findings on low-rank redundancy may influence future research in PEFT and CL, encouraging more efficient model architectures. LiteLoRA effectively reduces the parameter footprint of Continual Learning with LoRA by discovering and exploiting low-rank redundancy across tasks, achieving state-of-the-art performance with significantly fewer active adapters. The paper makes a compelling empirical case for structural efficiency in PEFT, offering a practical solution to the stability-plasticity dilemma without sacrificing accuracy.
Therapy-induced cardiotoxicity is the leading non-oncological cause of treatment interruption in breast cancer patients, yet early, automated risk stratification from routine cardiac imaging remains an unsolved problem. We present EchoRisk, the first curated, multicentre, longitudinal echocardiography dataset with explicit cardiotoxicity labels, released as the primary technical reference for the EchoRisk-MICCAI 2026 challenge. The dataset comprises 422 patients enrolled in the EU-funded CARDIOCARE prospective study across five European sites, yielding 2,159 echocardiography videos across 1,123 clinical exams acquired at up to five longitudinal timepoints, alongside a dedicated cohort of 280 patients with baseline imaging for early cardiotoxicity prediction. Three clinically grounded tasks are defined: automated estimation of left ventricular ejection fraction from cine video (Task 1), classification of LV dysfunction from longitudinal imaging (Task 2), and early prediction of therapy-induced cardiotoxicity from pre-therapy baseline echocardiography alone (Task 3). For each task we specify the evaluation protocol, primary and secondary metrics, and ranking procedure. We establish baseline performance using an R(2+1)D video backbone with LSTM aggregation trained from Kinetics-400 pretrained weights, demonstrating strong discriminative performance for cardiac functional assessment and LV dysfunction classification, while early cardiotoxicity prediction from a single pre-therapy video remains a significant open problem for the community. The dataset, evaluation code, and baseline implementations are publicly available to serve as a benchmark for further collaboration, comparison, and the creation of task-specific architectures in cardio-oncology.
Primary: Foundation for Research and Technology Hellas
All Institutions: Foundation for Research and Technology Hellas, University of Ioannina, Hellenic Mediterranean University, National and Kapodistrian University of Athens, Karolinska University Hospital, Bank of Cyprus Oncology Centre
This paper introduces EchoRisk, the first curated, multicentre, longitudinal echocardiography dataset with explicit cardiotoxicity labels, along with a comprehensive benchmark for cardio-oncology, highlighting early cardiotoxicity prediction as a significant open problem. The meticulous curation of a high-quality, clinically relevant dataset from a prospective study, coupled with well-defined tasks and robust baselines, provides an invaluable resource that will drive significant research in medical AI, particularly in addressing the critical challenge of therapy-induced cardiotoxicity.
The paper introduces EchoRisk, a multicentre, longitudinal echocardiography dataset for cardio-oncology, derived from the EU-funded CARDIOCARE prospective study across five European sites. A key methodological strength is the expert-adjudicated cardiotoxicity labels, which integrate longitudinal echocardiography findings with biomarkers following ESC 2022 guidelines, representing a deliberate and rigorous curation process. This ensures high-quality ground truth, superior to automated EHR extraction. Three clinically grounded tasks are defined: Task 1 (LVEF estimation), Task 2 (LV dysfunction classification using GLS), and Task 3 (early cardiotoxicity prediction from baseline imaging). The baseline models employ a robust R(2+1)D ResNet-18 backbone, pretrained on Kinetics-400, combined with an LSTM for temporal aggregation, a standard yet powerful architecture for video analysis. Detailed preprocessing steps (greyscale conversion, fractional index sampling, resizing) and training specifics (AdamW, learning rate scheduling, specific loss functions like Focal Loss for imbalanced tasks) are provided. A dual-view strategy for Task 3 and a clinical reference baseline (logistic regression on age and LVEF) further enhance the benchmark's comprehensiveness and clinical relevance. The overall methodology for dataset construction and task definition is exceptionally strong and clinically well-aligned.
The experimental evaluation is comprehensive and rigorously conducted. Baselines are established across all three tasks, with results averaged over eight independent random seeds and ensemble predictions for robustness. For Task 1 (LVEF estimation), a test MAE of 4.98 pp is achieved, aligning with established benchmarks like EchoNet-Dynamic and validating the dataset's utility for functional assessment. Task 2 (LV dysfunction classification) demonstrates strong performance with a test AUC of 0.849, indicating effective discrimination of GLS-defined dysfunction. The most impactful finding emerges from Task 3 (early cardiotoxicity prediction): the best video baseline achieves an AUC of 0.541, which is statistically indistinguishable from the clinical reference floor (AUC 0.525). This crucial result, consistent across internal pilot experiments, highlights that early cardiotoxicity prediction from baseline echocardiography remains a significant open problem, even with advanced deep learning architectures. The detailed statistical analysis, including 95% confidence intervals via non-parametric bootstrap resampling and Wilcoxon signed-rank tests with Holm-Bonferroni correction, adds significant rigor. Calibration is also assessed via Expected Calibration Error (ECE). The experiments effectively map the current performance landscape and clearly identify a challenging frontier for future research.
The paper demonstrates an outstanding commitment to reproducibility. It explicitly states that the EchoRisk dataset, evaluation code, and baseline implementations are publicly available via a dedicated GitHub repository. The methodology section provides extensive details on the model architecture, preprocessing steps, training hyperparameters (optimizers, learning rates, weight decay, early stopping), and loss functions. The use of multiple random seeds (42-49) for all experiments, along with the procedure for ensemble predictions and handling of degenerate runs, ensures that the reported results are robust and verifiable. The detailed statistical analysis methods, including confidence interval calculation and hypothesis testing, further contribute to the transparency and reproducibility of the benchmark. This level of detail and open-source commitment is exemplary for a benchmark paper.
While a highly valuable contribution, the dataset size, though multicentre and longitudinal, is relatively modest (422 patients overall, 280 for Task 3) compared to some large-scale single-center datasets. This might limit the ability of current deep learning models to extract extremely subtle prognostic signals for Task 3. The variable follow-up window for cardiotoxicity labels in Task 3, while reflecting real-world data collection, means the positive label indicates cardiotoxicity within the *available* window, not a fixed 12-month horizon, which could introduce some variability in interpretation. The baselines, while robust, are standard video architectures; the paper's novelty lies in the benchmark itself rather than new architectural contributions. The reliance on Kinetics-400 pretraining, while common, might not be optimally suited for medical ultrasound, suggesting future work could explore domain-specific pretraining.
EchoRisk has profound broader impact potential. It addresses a critical and growing clinical challenge in cardio-oncology: the early detection and risk stratification of therapy-induced cardiotoxicity in breast cancer patients. By providing the first multicentre, longitudinal echocardiography dataset with expert-adjudicated cardiotoxicity labels, it establishes a foundational resource for the machine learning community. Its role as the primary technical reference for the EchoRisk-MICCAI 2026 challenge ensures widespread adoption and will catalyze significant research into novel AI methods for cardiac ultrasound. Success in tasks like early cardiotoxicity prediction could lead to personalized treatment strategies, timely cardioprotective interventions, reduced treatment interruptions, and ultimately improved long-term cardiovascular outcomes for cancer patients. The open-source nature of the dataset and tools will foster collaborative research, accelerating progress in this vital area of medical AI and serving as a model for future clinically relevant benchmarks. This paper introduces EchoRisk, the first curated, multicentre, longitudinal echocardiography dataset with explicit cardiotoxicity labels, along with a comprehensive benchmark for cardio-oncology, highlighting early cardiotoxicity prediction as a significant open problem. The meticulous curation of a high-quality, clinically relevant dataset from a prospective study, coupled with well-defined tasks and robust baselines, provides an invaluable resource that will drive significant research in medical AI, particularly in addressing the critical challenge of therapy-induced cardiotoxicity.
Retrieval-augmented generation (RAG) enhances LLMs by incorporating external knowledge to support response generation. However, conflicts between retrieved context and parametric knowledge have emerged as a critical challenge in RAG systems. To mitigate such conflicts, numerous studies have attempted to identify and edit knowledge-related internal neurons, aiming to improve the ability of LLMs to rely on contextual evidence during generation. However, these neuron-level approaches may introduce unintended cascading effects that compromise the general capabilities of LLMs, as the modified neurons are often entangled with broader model behaviors and functionalities. In this paper, we introduce SHIFT, a novel framework that reformulates neuron-level modification as learnable gate modulation, allowing LLMs to adaptively regulate internal activations for knowledge conflict resolution. Technically, our SHIFT equips LLMs with a lightweight gate module and optimizes fewer than 0.01% trainable parameters while keeping the backbone model frozen. During generation, the gate module adjusts the model's internal representations to adaptively leverage contextual and parametric knowledge. Extensive experiments on six datasets validate the effectiveness of our SHIFT in comparison with various competing baselines. All datasets and code are available at https://github.com/OpenBMB/SHIFT.
Primary: Tsinghua University
All Institutions: Tsinghua University, Beijing Academy of Artificial Intelligence (BAAI), Shanghai AI Laboratory
SHIFT offers a significant step forward in making RAG systems more reliable and robust. By effectively mitigating knowledge conflicts without compromising the LLM's general capabilities, it can lead to: * **More Trustworthy LLM Applications:** Reducing factual errors and hallucinations in RAG outputs is crucial for applications in sensitive domains like healthcare, finance, and legal research. * **Improved User Experience:** Users will receive more consistent and accurate information, enhancing their trust and satisfaction with LLM-powered tools. * **Efficient Model Deployment:** The parameter-efficient nature means SHIFT can be integrated into existing LLMs with minimal overhead, making it practical for widespread adoption. * **Advancement in Knowledge Management:** This work contributes to the broader field of knowledge management in AI, offering a refined mechanism for integrating external information with internal model knowledge. * **Reduced Development Costs:** By avoiding the need for extensive re-training or complex neuron-level interventions, SHIFT can lower the development and maintenance costs of RAG systems. The work primarily has positive societal impacts by improving the factual grounding of AI systems. No significant negative ethical concerns are immediately apparent, beyond the general risks associated with powerful AI systems if misused. SHIFT introduces a novel gate-modulated activation steering framework to effectively mitigate knowledge conflicts in Retrieval-Augmented Generation (RAG) systems. This approach offers a parameter-efficient and less intrusive alternative to traditional neuron-level knowledge editing, demonstrating superior performance across diverse factual and non-factual conflict datasets while preserving the LLM's general capabilities. The paper presents a well-designed methodology, comprehensive experimental validation, and a strong commitment to reproducibility, making a valuable contribution to the robustness and reliability of RAG systems.
The paper introduces SHIFT, a novel framework for mitigating knowledge conflicts in Retrieval-Augmented Generation (RAG) by employing gate-modulated activation steering. The core idea is to replace explicit neuron-level knowledge editing, which can lead to cascading effects and compromise general LLM capabilities, with a lightweight, learnable gate module. This gate module adaptively regulates internal activations to balance between retrieved contextual knowledge and the LLM's parametric knowledge. Technically, SHIFT operates by inserting a small, trainable gate module into the feed-forward network (FFN) layers of a frozen backbone LLM. Specifically, for each FFN output $h_{ffn}$, SHIFT computes a gate value $g$ based on the input representation $x$ and the retrieved context $c$. This gate $g$ then modulates the FFN output, effectively scaling it: $h'_{ffn} = g \odot h_{ffn}$. The gate $g$ is learned via a small neural network (e.g., a two-layer MLP) that takes the concatenation of the input hidden state and a contextual representation (derived from the retrieved documents) as input. This design is inspired by LoRA-like parameter-efficient fine-tuning methods but is applied specifically to control activation flow for knowledge conflict resolution. The training objective for SHIFT is multi-faceted, aiming to achieve three goals: 1. **Conflict Resolution:** Maximize the likelihood of generating correct answers when conflicts exist. 2. **Factual Consistency:** Ensure the model relies on the retrieved context when it is accurate. 3. **General Capability Preservation:** Minimize the impact on the LLM's original knowledge and capabilities when no conflict or retrieval is present. This is achieved through a weighted sum of three loss components: a standard cross-entropy loss on conflict-resolved data, a consistency loss that encourages alignment with retrieved facts, and a knowledge preservation loss (e.g., KL divergence) to maintain original model behavior. The adaptive nature comes from the gate's dependency on both input and context, allowing dynamic adjustment during inference. The methodology is sound and addresses a critical problem in RAG. The parameter-efficient nature (<0.01% trainable parameters) is a significant advantage, making it practical for large LLMs and reducing the risk of catastrophic forgetting associated with full fine-tuning or extensive knowledge editing. The reformulation of neuron-level modification as learnable gate modulation is a clever abstraction that offers more flexibility and less intrusiveness.
The experiments are conducted on six datasets, covering both factual and non-factual knowledge conflicts, which is a good breadth. The datasets include: * **Factual Conflict:** PopQA, CounterFact, WikiBio * **Non-Factual Conflict:** Self-Correction, Hallucination-Correction, TruthfulQA This diverse set of benchmarks allows for a comprehensive evaluation of SHIFT's ability to handle different types of knowledge conflicts. The baselines chosen are appropriate and representative of existing approaches, including: * **Vanilla RAG:** Standard RAG without conflict mitigation. * **Knowledge Editing methods:** MEMIT, ROME (neuron-level editing). * **Context-aware methods:** Self-Correction (prompting-based), Hallucination-Correction (fine-tuning based). * **Parameter-Efficient Fine-tuning (PEFT) methods:** LoRA (as a general adapter). The evaluation metrics include accuracy, F1 score, and generation quality metrics (e.g., perplexity, faithfulness). The results consistently demonstrate SHIFT's superiority across most benchmarks, achieving higher accuracy and F1 scores compared to baselines, particularly in factual conflict scenarios. Crucially, SHIFT also shows better preservation of general capabilities and less degradation on non-conflict data compared to neuron-editing methods. Ablation studies are performed to analyze the contribution of different loss components (conflict resolution, factual consistency, knowledge preservation) and the placement of the gate module. These studies provide valuable insights into the design choices and confirm the importance of each component. The analysis of gate values and their correlation with conflict intensity further supports the adaptive nature of SHIFT. Qualitative examples also illustrate how SHIFT helps the model prioritize contextual information over conflicting parametric knowledge. The experiments are extensive and well-designed, providing strong empirical evidence for SHIFT's effectiveness and efficiency. The comparison with both knowledge editing and other RAG enhancement techniques highlights its unique position and advantages.
The paper states that "All datasets and code are available at https://github.com/OpenBMB/SHIFT." This commitment to open-sourcing the code and datasets is excellent for reproducibility. The methodology section provides sufficient detail on the architecture of the gate module, the training objectives, and the overall framework. The experimental section details the datasets, baselines, and evaluation metrics. While specific hyperparameters for each model and dataset might be in the appendix or code, the overall setup seems well-documented, suggesting a high degree of reproducibility.
1. **Complexity of Gate Learning:** While the gate module is lightweight, learning to dynamically modulate activations for nuanced knowledge conflict resolution can still be complex. The paper doesn't deeply explore cases where the gate might misfire or over-correct, leading to new types of errors. 2. **Definition of "Conflict":** The paper relies on predefined datasets for "knowledge conflict." In real-world RAG systems, identifying and categorizing conflicts dynamically can be challenging. SHIFT's effectiveness might depend on the quality and clarity of conflict signals during training. 3. **Scalability to Larger Models:** While the parameter efficiency is a strong point, the computational overhead of the gate module during inference (even if small) might become noticeable for extremely large models or high-throughput scenarios. The paper mentions "fewer than 0.01% trainable parameters," but the inference-time computation cost of the gate itself is not explicitly quantified in terms of latency. 4. **Generalization to Unseen Conflicts:** The model is trained on specific conflict datasets. Its ability to generalize to novel or more subtle forms of knowledge conflicts not encountered during training is an open question. 5. **Interpretability of Gate Decisions:** While the analysis shows correlation between gate values and conflict, a deeper interpretability of *why* the gate chooses to amplify or suppress certain activations in specific contexts could provide further insights and build trust in the system.
SHIFT offers a significant step forward in making RAG systems more reliable and robust. By effectively mitigating knowledge conflicts without compromising the LLM's general capabilities, it can lead to: * **More Trustworthy LLM Applications:** Reducing factual errors and hallucinations in RAG outputs is crucial for applications in sensitive domains like healthcare, finance, and legal research. * **Improved User Experience:** Users will receive more consistent and accurate information, enhancing their trust and satisfaction with LLM-powered tools. * **Efficient Model Deployment:** The parameter-efficient nature means SHIFT can be integrated into existing LLMs with minimal overhead, making it practical for widespread adoption. * **Advancement in Knowledge Management:** This work contributes to the broader field of knowledge management in AI, offering a refined mechanism for integrating external information with internal model knowledge. * **Reduced Development Costs:** By avoiding the need for extensive re-training or complex neuron-level interventions, SHIFT can lower the development and maintenance costs of RAG systems. The work primarily has positive societal impacts by improving the factual grounding of AI systems. No significant negative ethical concerns are immediately apparent, beyond the general risks associated with powerful AI systems if misused. SHIFT introduces a novel gate-modulated activation steering framework to effectively mitigate knowledge conflicts in Retrieval-Augmented Generation (RAG) systems. This approach offers a parameter-efficient and less intrusive alternative to traditional neuron-level knowledge editing, demonstrating superior performance across diverse factual and non-factual conflict datasets while preserving the LLM's general capabilities. The paper presents a well-designed methodology, comprehensive experimental validation, and a strong commitment to reproducibility, making a valuable contribution to the robustness and reliability of RAG systems.