Last 7 Days (June 28 – July 04, 2026)
Language models increasingly write probabilistic programs (in NumPyro, Stan, or Pyro), but a program that compiles, runs, and passes every unit test can still be \emph{statistically} wrong -- a Gaussian likelihood for heavy-tailed data, a Poisson for over-dispersed counts, an invalid prior support, or a pathological parameterization. The right verifier is therefore not a test suite but the Bayesian workflow itself: posterior predictive checks, simulation-based calibration, sampler diagnostics ($\hat R$, divergences, ESS), and held-out predictive density. We study this calibration oracle along three axes. \textbf{Detection:} on a benchmark of $14$ misspecification types across $10$ model families ($200$ instances), it flags the bug with AUC $0.97$ ($88\%$ at $2\%$ FPR \emph{when handed the correct reference program, an upper bound}) -- and a fully \emph{reference-free} version that uses no correct program reaches $62$--$78\%$ (the upper figure from a small automated model search), versus $0\%$ for a unit-test oracle. \textbf{Repair:} used as feedback in an LLM repair loop across fifteen models, calibration significantly outperforms unit-test feedback -- which is itself \emph{significantly worse than no feedback at all}, a passing test inducing false confidence that suppresses repair -- and improves over no feedback on strong-but-unsaturated models (GPT-5.1 $33{\to}92\%$, Claude $75{\to}100\%$; paired McNemar, $n{=}228$). \textbf{Reality:} on programs LLMs write from scratch for neutral briefs, $15$--$47\%$ of runnable ones are statistically misspecified (unit tests catch none), and calibration-guided repair significantly beats LLM-as-judge review, a Bayesian-workflow checklist, and data-summary self-debug. Across all three, the lesson is the same: for probabilistic programs, correctness is calibration, not compilation.
Primary: Columbia University
All Institutions: Columbia University
This paper introduces a paradigm shift for verifying LLM-generated probabilistic programs, demonstrating that Bayesian calibration diagnostics are essential for detecting code-invisible statistical errors, significantly outperforming and even revealing the harm of traditional unit tests in LLM repair loops. The comprehensive technical contribution, rigorous empirical validation on both injected and LLM-generated bugs, and the clear, actionable insights make this a genuinely field-wide significant paper that will reshape how LLM-assisted probabilistic modeling systems are designed and evaluated.
The paper proposes a robust methodology for detecting and repairing statistically misspecified probabilistic programs generated by Language Models (LLMs). The central innovation is the "calibration oracle," which formalizes and automates the Bayesian workflow diagnostics (Posterior Predictive Checks, Simulation-Based Calibration, sampler diagnostics like R-hat and divergences, and held-out predictive density) into a programmatic verifier. This oracle is designed to identify "code-invisible misspecification"—statistical errors that traditional unit tests (compilation, execution, output shape) cannot detect. The repair loop integrates this oracle as feedback for LLMs, guiding them to iteratively refine their programs. The methodology is well-grounded in Bayesian principles, and the formal distinction between code-visible and code-invisible bugs is crucial for understanding the limitations of existing LLM code verification paradigms. The aggregation of diverse diagnostics into a single verdict, with defined thresholds, provides a practical and actionable signal for automated systems.
The experimental evaluation is exceptionally thorough and convincing, covering detection, repair, and real-world applicability. 1. **Detection**: A comprehensive benchmark of 200 instances across 14 misspecification types and 10 model families was created. The calibration oracle achieved an impressive AUC of 0.97 (88% detection at 2% FPR), demonstrating its efficacy. Critically, a *reference-free* version of the oracle (without a ground-truth correct program) still achieved 62-78% detection, highlighting its practical utility. In stark contrast, the unit-test oracle achieved 0% detection for code-invisible bugs. 2. **Repair**: Experiments with 15 diverse LLMs (open and API) in a repair loop showed that calibration feedback consistently outperformed "no feedback" and "unit-test feedback." A surprising and impactful finding was that unit-test feedback was often *worse than no feedback*, as it induced false confidence and suppressed repair. Calibration feedback led to substantial improvements for strong-but-unsaturated models (e.g., GPT-5.1 from 33% to 92%), with statistically significant gains ($p < 4.5 \times 10^{-10}$ for pooled invisible-bug repairs). 3. **Real LLM Programs**: This section provides the strongest evidence. LLMs were tasked to write programs from scratch for neutral briefs. The study found that 15-47% of runnable LLM-generated programs were statistically misspecified (none caught by unit tests). Calibration-guided repair significantly outperformed strong baselines, including LLM-as-judge review, Bayesian-workflow checklists, and data-summary self-debug, achieving an 84% fix rate for misspecified programs. The results are robust, with detailed ablations and sensitivity analyses for oracle thresholds.
The paper demonstrates a high commitment to reproducibility. It specifies exact snapshots for API models and HuggingFace revisions for open models. Detailed hyperparameters for LLM generation (temperature, max tokens, repair budget) and NUTS inference (chains, draws, x64) are provided. Oracle thresholds are explicitly stated, and their sensitivity is analyzed. Crucially, the authors commit to releasing "The full system prompt, contract, feedback templates, benchmark generators, and analysis scripts with the paper," which is excellent practice and will enable full replication and extension of their work.
The authors are transparent about several limitations: 1. **PPC power**: The effectiveness of Posterior Predictive Checks depends on the chosen test statistics, meaning some misspecifications might remain undetected if not captured by the tracked statistics. 2. **Right fit, wrong structure**: The oracle primarily assesses distributional fit, not causal or generative correctness, making it blind to structural errors that yield similar predictive distributions. 3. **Over-wide predictive**: An overly confident or under-confident model might "cover" the data and pass PPC despite being fundamentally wrong. 4. **Diagnosis without remedy**: Weaker LLMs may receive accurate diagnostic feedback but lack the capability to translate it into a correct structural fix. 5. **SBC expense**: Simulation-Based Calibration is computationally intensive, limiting its practical use in the repair loop for large or costly models. 6. **Inference failure**: Sampler diagnostics can sometimes fire due to poor inference (e.g., NUTS issues) rather than genuine model misspecification, leading to false positives. 7. **Benchmark scope**: The repair experiments use a subset of bugs, and the tasks are classical low-dimensional models, suggesting that extending the approach to high-dimensional or complex structured models (e.g., deep models, spatial-temporal models) is future work.
This paper has profound broader impact for the burgeoning field of LLM-assisted scientific computing and probabilistic modeling. It fundamentally shifts the paradigm for verifying LLM-generated probabilistic code from "compilation" to "calibration," providing a principled and empirically validated framework for ensuring statistical correctness. This work is crucial for building trustworthy and reliable LLM agents that can assist in scientific discovery, data analysis, and model development. The PPL-agnostic nature of the Bayesian diagnostics means the approach is broadly applicable across popular probabilistic programming languages like Stan, Pyro, and PyMC. The surprising finding that unit-test feedback is actively harmful for LLM repair loops is a critical insight for designing future LLM agent architectures. This paper sets a new standard for evaluating and improving LLM-generated scientific code. This paper introduces a paradigm shift for verifying LLM-generated probabilistic programs, demonstrating that Bayesian calibration diagnostics are essential for detecting code-invisible statistical errors, significantly outperforming and even revealing the harm of traditional unit tests in LLM repair loops. The comprehensive technical contribution, rigorous empirical validation on both injected and LLM-generated bugs, and the clear, actionable insights make this a genuinely field-wide significant paper that will reshape how LLM-assisted probabilistic modeling systems are designed and evaluated.
4D hand motion reconstruction from egocentric video is bottlenecked by clear limitations of existing methods: image-based pipelines depend on a detector that fails under heavy occlusion, while video-based methods rely on temporal modules learned only from scarce hand-pose annotations, a narrow signal insufficient to model motion dynamics, occlusion reasoning, and hand-object interaction. These capabilities, however, are exactly what video generative models must implicitly acquire when trained to synthesize coherent video at internet scale. Motivated by this, we present ViDiHand, which leverages the representations of a pretrained video diffusion model to reconstruct 4D two-hand pose. We adapt it via a hand-overlay rendering objective that specializes its features for hands while preserving its world priors. A decoder then recovers metric-scale pose from the adapted features. The whole pipeline operates directly on full frames--no detector, no infiller, and no test-time optimization. On ARCTIC, HOT3D, and HOI4D, ViDiHand substantially outperforms prior methods, establishing video diffusion models as a powerful new foundation for hand motion reconstruction and a promising route to scalable in-the-wild data collection for embodied AI. Project page: https://vidihand.github.io.
Primary: A*STAR (Agency for Science, Technology and Research)
All Institutions: A*STAR, NTU Singapore (Nanyang Technological University), Alibaba Group
The paper presents a significant advancement in 4D hand motion reconstruction by effectively adapting video diffusion models for perception tasks, achieving state-of-the-art performance on challenging benchmarks through a novel hand-overlay rendering adaptation and a geometrically-aware dual-branch decoder.
The paper proposes a novel paradigm for 4D hand motion reconstruction by leveraging the internal representations of a large-scale pretrained Video Diffusion Model (Wan2.1-VACE). Instead of treating the diffusion model as a generative black box or a frozen feature extractor, the authors introduce a "hand-overlay rendering" adaptation stage. This involves finetuning only the VACE branch of the model to regenerate input clips with semi-transparent rendered hand overlays. This clever pretext task specializes the model's world priors (occlusion reasoning, temporal coherence, 3D geometry) for hand-centric tasks without destroying the general visual knowledge. The decoder is a dual-branch architecture: a Hand-Token Branch for holistic articulated pose and a Joint-Heatmap Branch for local 2D localization, coupled by mutual cross-attention and a closed-form geometric solve for camera translation. This design elegantly separates the holistic vs. local inductive biases of the representation. The approach is methodologically sound, theoretically motivated by the capabilities of generative models, and technically sophisticated in its integration of diffusion features with geometric constraints.
The evaluation is comprehensive and rigorous. The authors test on three challenging egocentric hand benchmarks: ARCTIC (heavy occlusion), HOT3D (fisheye, high dynamic range, motion blur), and HOI4D (cross-dataset generalization). They introduce a "penalty protocol" that folds false negatives into pose metrics, providing a more realistic assessment of detection robustness than standard TP-only metrics. ViDiHand establishes new state-of-the-art results across all metrics, with particularly significant gains in frame accuracy (detection robustness) and temporal jitter (smoothness). The ablation studies are thorough, validating the choice of DiT layer, denoising step, and decoder components. The cross-dataset transfer to HOI4D demonstrates the generalizability of the learned priors. The results are statistically significant and practically meaningful, showing that video diffusion models capture richer spatiotemporal priors than discriminative video models or image-based detectors.
The paper provides detailed implementation details, including the specific backbone (Wan2.1-VACE), the two-stage training curriculum (joint overlay then MANO mesh overlay), and the decoder architecture. The supplementary material contains extensive details on the evaluation protocol, metric definitions, and ablation studies. The project page link suggests code/data availability, which is standard for high-impact ML papers. The use of a publicly available backbone (Wan2.1) enhances reproducibility, although the specific finetuning steps and data preprocessing pipelines would need to be carefully followed. The closed-form geometric solve is well-defined.
The primary limitation is computational cost. The method runs at 5.5 fps on 4 A100 GPUs, making it an offline annotation tool rather than a real-time solution. The authors acknowledge this and suggest distillation as a future direction. Additionally, Stage 1b still requires MANO-annotated video, which is a scarce resource, though the authors propose self-supervised pretexts to relax this in the future. The method may also struggle with extreme cases not covered in the training data, although the cross-dataset results suggest good generalization.
This work has significant implications for embodied AI, robotics, and human-computer interaction. By providing a scalable, high-quality method for 4D hand reconstruction from egocentric video, it enables the creation of large-scale datasets for training robot policies and understanding human behavior. The paradigm shift towards leveraging video generative models for perception tasks could influence future research in 3D vision, motion capture, and video understanding. It also highlights the untapped potential of diffusion models for discriminative tasks, potentially inspiring similar approaches in other domains. The paper presents a significant advancement in 4D hand motion reconstruction by effectively adapting video diffusion models for perception tasks, achieving state-of-the-art performance on challenging benchmarks through a novel hand-overlay rendering adaptation and a geometrically-aware dual-branch decoder.
Technology computer-aided design (TCAD) semiconductor device simulation is fundamentally constrained by the high computational cost of iteratively solving coupled drift-diffusion equations. Existing ML surrogates either reduce internal physics to macroscopic scalar regressions, or rely on single-step mappings that lack the iterative refinement required to resolve stiff, coupled fields. To address this, we introduce PCGD, a Physics-Guided Conditional Graph Diffusion framework operating natively on unstructured TCAD meshes to predict coupled electrostatic and carrier density fields. PCGD employs a Condition-Aware MeshGraphNet denoiser that explicitly injects boundary conditions and device structure context via global cross-attention. By augmenting data-driven denoising with a physics-guided hybrid objective that integrates exponent-free quasi-Fermi gradient matching with noise-aware PDE residuals, PCGD progressively enforce physical constraints in the iterative diffusion trajectory. This strategy successfully bypasses the numerical instabilities typical of stiff drift-diffusion equations. Evaluated on a challenging mixed PN/MOS benchmark, PCGD significantly outperforms deterministic one-step regression (1.207% error) and local diffusion (1.585% error) baselines by achieving a sub-percent mean relative field error of 0.835%, while concurrently reducing maximum PDE residual errors by nearly three orders of magnitude compared to pure diffusion. It also transfers robustly to unseen SOI topologies (0.815% error) via LoRA adaptation, using 5.30$\times$ less data and 14.34$\times$ fewer parameters than full fine-tuning. Ultimately, PCGD bridges the computational efficiency of generative surrogates with the rigorous physical fidelity of traditional TCAD, unlocking highly scalable, field-level analysis for robust device engineering.
Primary: Fudan University
All Institutions: Fudan University, Shanghai Integrated Circuit Manufacturing Innovation Center
PCGD has a profound broader impact, particularly in the semiconductor industry and scientific machine learning. By significantly accelerating TCAD simulations while maintaining high physical fidelity, it can drastically shorten the design cycle for advanced semiconductor devices, fostering innovation in microelectronics for applications ranging from AI hardware to power electronics. The methodology for stabilizing physics-informed diffusion models for highly stiff and nonlinear PDEs is generalizable beyond TCAD. It offers a blueprint for tackling similar challenges in other scientific and engineering domains, such as materials science, fluid dynamics, and biomechanics, where complex multiphysics simulations are computationally expensive and numerically challenging. The concept of a hybrid AI-TCAD simulation workflow, where ML provides robust initial guesses to accelerate or stabilize traditional solvers, represents a powerful paradigm for integrating AI into complex scientific computing pipelines, promising both efficiency and reliability. Furthermore, the successful application of parameter-efficient fine-tuning (LoRA) for domain adaptation in a scientific context highlights its potential for developing more adaptable and data-efficient ML models for specialized scientific tasks. PCGD introduces a physics-guided conditional graph diffusion framework that effectively addresses the computational bottlenecks and physical fidelity challenges in TCAD device simulation. This work makes substantial contributions through its mesh-native graph diffusion approach, condition-aware MeshGraphNet architecture with global cross-attention, and a novel hybrid physics-guided training objective that stabilizes learning for stiff, nonlinear drift-diffusion equations. The impressive empirical results, demonstrating sub-percent accuracy and near three-order-of-magnitude reduction in PDE residuals, coupled with robust transferability to unseen topologies, position PCGD as a significant advancement in ML-accelerated scientific computing, with clear implications for future semiconductor design and broader multiphysics simulation.
The methodology presented in PCGD is highly sophisticated and meticulously designed to tackle the inherent challenges of TCAD device simulation. The core idea of formulating coupled-field prediction as a conditional generative denoising task on native unstructured TCAD meshes is a significant departure from previous ML surrogates. This approach leverages the iterative refinement capabilities of diffusion models to decouple macroscopic field construction from microscopic detail resolution, which is crucial for handling the extreme stiffness and exponential nonlinearities of semiconductor physics. The Condition-Aware MeshGraphNet denoiser is a well-thought-out architectural innovation. By explicitly injecting boundary conditions and device structure context via global cross-attention, it effectively overcomes the limitations of purely local message passing, ensuring that critical global information (like terminal biases and device layout) is directly accessible to every mesh node. This design choice is critical for accelerating convergence to correct physical operating conditions. The most impactful methodological contribution is the hybrid physics-guided objective. It ingeniously combines an exponent-free quasi-Fermi gradient matching loss ($L_G$) for stable guidance during early, noisy diffusion stages with noise-aware, adaptively scaled exact PDE residuals ($L_P$) for fine-grained physical correction in later stages. The dual-tier stabilization strategy for $L_P$ (validity-aware SNR gating and adaptive batch balance) is a robust solution to prevent gradient divergence caused by the extreme nonlinearity of drift-diffusion equations. The detailed mathematical derivation connecting Scharfetter-Gummel flux to quasi-Fermi gradients and the implementation of GPU-accelerated graph operators for discrete PDE residuals further demonstrate the technical depth and rigor of the proposed framework.
The experimental evaluation is comprehensive, rigorous, and strongly supports the claims made in the paper. The use of a challenging mixed PN/MOS benchmark dataset, comprising a large number of training and validation snapshots from distinct devices and operating modes, ensures the robustness of the evaluation. The device-trajectory level splitting for train/validation is critical for assessing generalization. The ablation studies are particularly effective, systematically isolating the contributions of iterative generative refinement, global conditioning, and physics-guided supervision by comparing PCGD against well-chosen baselines (deterministic one-step regression, local diffusion, and condition-aware diffusion with varying levels of physics guidance). The chosen metrics, mean relative $L_2$ error and log-compressed maximum PDE residual error, are appropriate and provide a holistic view of both accuracy and physical consistency. The results are highly impressive: PCGD achieves a sub-percent mean relative field error (0.835%) and significantly outperforms all baselines, reducing maximum PDE residual errors by nearly three orders of magnitude compared to pure diffusion. The analysis of convergence dynamics clearly illustrates the stability gained from the hybrid physics-guided objective. Furthermore, the demonstration of robust transferability to unseen SOI topologies via parameter-efficient LoRA adaptation, requiring significantly less data and fewer parameters than full fine-tuning, is a strong indicator that PCGD learns generalizable physical priors, which is crucial for practical adoption in industrial settings.
The paper provides a commendable level of detail for reproducibility. The methodology is thoroughly described with mathematical formulations for the graph representation, diffusion process, Condition-Aware MeshGraphNet, and the hybrid physics-guided objective. The appendix offers specific architectural settings, coefficients for the loss functions, and details on the baseline architectures. Crucially, numerical safety measures for handling stiff exponentials and precision truncation are explicitly mentioned. The schema for mesh features and interface edges is also detailed. While the absence of publicly available code (no project URL provided) is a common limitation for arXiv preprints, the textual descriptions are sufficiently comprehensive for an experienced researcher with expertise in GNNs, diffusion models, and scientific computing to implement the core components of PCGD. Access to the specific TCAD data generation pipeline would be the final piece for exact replication of the dataset.
Despite its strengths, PCGD exhibits some limitations. The paper acknowledges "severe zero-shot degradation" when transferring to major out-of-distribution topological shifts (e.g., SOI MOSFETs not seen during pretraining). While LoRA adaptation effectively addresses this, it implies that the model, despite learning physical priors, still relies on structural interpolation from its pretraining data, limiting its immediate "universal foundational model" capabilities without a much broader pretraining corpus. The current experimental validation is primarily on 2D devices. While the paper discusses the theoretical O(N) scaling advantage for 3D simulations, empirical validation on massive 3D meshes, including memory footprint and training/inference times, is yet to be demonstrated. The computational cost of training such a complex diffusion model on large graph datasets is likely substantial, though not explicitly quantified. Finally, the proposed hybrid AI-TCAD workflow mentions routing high-residual cases back to traditional solvers, but the practical implementation details, criteria for routing, and the efficiency gains from this hybrid approach are not fully explored.
PCGD has a profound broader impact, particularly in the semiconductor industry and scientific machine learning. By significantly accelerating TCAD simulations while maintaining high physical fidelity, it can drastically shorten the design cycle for advanced semiconductor devices, fostering innovation in microelectronics for applications ranging from AI hardware to power electronics. The methodology for stabilizing physics-informed diffusion models for highly stiff and nonlinear PDEs is generalizable beyond TCAD. It offers a blueprint for tackling similar challenges in other scientific and engineering domains, such as materials science, fluid dynamics, and biomechanics, where complex multiphysics simulations are computationally expensive and numerically challenging. The concept of a hybrid AI-TCAD simulation workflow, where ML provides robust initial guesses to accelerate or stabilize traditional solvers, represents a powerful paradigm for integrating AI into complex scientific computing pipelines, promising both efficiency and reliability. Furthermore, the successful application of parameter-efficient fine-tuning (LoRA) for domain adaptation in a scientific context highlights its potential for developing more adaptable and data-efficient ML models for specialized scientific tasks. PCGD introduces a physics-guided conditional graph diffusion framework that effectively addresses the computational bottlenecks and physical fidelity challenges in TCAD device simulation. This work makes substantial contributions through its mesh-native graph diffusion approach, condition-aware MeshGraphNet architecture with global cross-attention, and a novel hybrid physics-guided training objective that stabilizes learning for stiff, nonlinear drift-diffusion equations. The impressive empirical results, demonstrating sub-percent accuracy and near three-order-of-magnitude reduction in PDE residuals, coupled with robust transferability to unseen topologies, position PCGD as a significant advancement in ML-accelerated scientific computing, with clear implications for future semiconductor design and broader multiphysics simulation.
Although neural networks are remarkably effective, their underlying optimization principles remain theoretically elusive, often characterized by non-convex landscapes and stochastic heuristics. In this work, we propose a paradigm shift by replacing the discrete training problem of shallow neural networks with a well-posed continuum variational surrogate. We identify a family of $λ$-convex functionals over parameter densities in weighted Sobolev spaces and prove that these variational problems are globally well-posed, stable, and exhibit unexpected almost $C^3$ regularity. Unlike existing Wasserstein-based or Mean-Field approaches, which often face limited regularity and discretization challenges, our formulation provides direct access to elliptic regularity and convex analysis. This allows us to prove that the optimal parameter density can be obtained by solving a single linear system, bypassing iterative optimization entirely. We establish explicit generalization error controls at a rate of $1/α$ relative to the regularization parameter, and prove that finite-width networks of size $N$ achieve the continuum optimum at an $O(1/N)$ rate. This perspective bridges the gap between the Neural Tangent Kernel (NTK) and feature-learning regimes, providing a principled framework for understanding over-parameterization through the lens of variational calculus.
Primary: University of Warsaw
All Institutions: University of Warsaw, Brno University of Technology, Université de Toulouse, INSA Toulouse
This paper offers a profound theoretical contribution that could have a significant broader impact on the field of machine learning, particularly in theoretical understanding and the development of new research directions. * **New Theoretical Paradigm**: It introduces a genuinely novel paradigm for analyzing neural networks, distinct from existing mean-field/Wasserstein and NTK frameworks. This opens up a new avenue for applying advanced mathematical tools (variational calculus, elliptic PDE theory) to ML problems. * **Understanding Implicit Bias**: The discovery of near-$C^3$ regularity for optimal parameter densities provides a concrete and quantitative form of implicit bias towards smooth, well-generalizing solutions. This offers a deeper understanding of why overparameterized networks avoid overfitting. * **Bridging Regimes**: By offering a globally convex, nonlinear model that is exactly tied to finite-width networks, it bridges the gap between the lazy-training NTK regime and the feature-learning capabilities of mean-field approaches. * **Novel Regularization Principles**: The framework could inspire new regularization techniques grounded in variational principles and Sobolev spaces, potentially leading to more robust and interpretable models. * **Alternative Optimization**: The demonstration that the optimal density can be found by solving a linear system, even if currently limited to shallow networks, is a conceptual breakthrough that might inspire new hybrid optimization strategies or analytical solutions for specific network architectures. * **Foundational Research**: While not immediately applicable to state-of-the-art deep learning, this work lays a strong theoretical foundation that could influence future research on multi-layer networks, potentially leading to new insights into their complex optimization landscapes and generalization properties. This paper proposes a paradigm shift by formulating shallow neural network training as a globally well-posed continuum variational problem in weighted Sobolev spaces, yielding a unique, almost $C^3$ regular minimizer obtainable by solving a single linear system. This work provides profound theoretical insights into the implicit bias and generalization of neural networks, bridging existing frameworks with a novel mathematical approach based on convex analysis and elliptic PDE theory, despite its current limitation to shallow architectures.
This paper proposes a highly novel and mathematically rigorous variational formulation for shallow neural networks, departing significantly from existing mean-field/Wasserstein and Neural Tangent Kernel (NTK) approaches. The core methodology involves replacing the discrete training problem with a continuum variational surrogate defined over parameter densities in weighted Sobolev spaces ($W^{1,2}(\Omega) \cap L^2_\omega(\Omega)$). The authors identify a family of $\lambda$-convex functionals, which is a key innovation enabling global well-posedness, stability, and high regularity of the solutions. A major strength of this approach is its direct access to convex analysis and elliptic PDE theory, which allows for several profound theoretical results: 1. **Global Convexity**: The proposed functional is proven to be globally $2(\lambda, \mu)$-convex, ensuring the existence and uniqueness of a minimizer without linearization assumptions. 2. **High Regularity**: The optimal parameter density is shown to possess unexpected almost $C^3$ regularity, a level of smoothness not typically accessible in other infinite-width analyses. This is derived from the Euler-Lagrange equation, which turns out to be a linear elliptic PDE. 3. **Direct Solution**: Crucially, the optimal parameter density (or its projection onto a finite-dimensional basis) can be obtained by solving a single linear system, completely bypassing iterative optimization methods like gradient descent. This is a remarkable theoretical achievement for this specific problem formulation. 4. **Consistency with Discrete Networks**: The paper proves the absence of a Lavrentiev gap, meaning the infimum of the risk is the same whether optimizing over atomic measures (finite-width networks), Sobolev densities, or smooth functions. Furthermore, finite-width networks of size $N$ are shown to achieve the continuum optimum at an $O(1/N)$ rate. 5. **Gradient Flow Analysis**: The associated $L^2_\omega$-gradient flow is shown to converge exponentially fast to the unique minimizer, providing insights into the continuous-time dynamics. The methodology is deeply rooted in advanced mathematics (functional analysis, PDE theory, calculus of variations) and provides a fresh perspective on understanding the implicit bias and generalization properties of neural networks. The use of Sobolev regularization to promote smoothness is well-justified within this framework.
The experimental evaluation is primarily illustrative, serving to validate the theoretical claims rather than to achieve state-of-the-art performance on large-scale benchmarks. The authors demonstrate how the theoretical framework translates into a practical computational method: approximating the parameter density with a polynomial ansatz and solving the resulting ridge regression-like linear system. The experiments include: 1. **1D Sinus Function**: Shows that the regularized solution accurately and smoothly tracks the target, outperforming an unregularized ansatz (overfits) and a single-layer neural network baseline (noisier). This highlights the non-overfitting and smoothness properties predicted by the theory. 2. **1D Discontinuous Sign Function**: Illustrates the stability of the minimizer to noise and outliers, consistent with the theoretical stability results. 3. **Benchmark Datasets (Diabetes, California Housing)**: These are small-scale regression tasks. The proposed method, using polynomial basis functions, is compared against a single-hidden-layer network with 10,000 ReLU neurons trained by Adam/SGD. The results claim competitive or superior accuracy, demonstrating strong finite-sample performance. While the experiments effectively showcase the properties of the proposed variational formulation (smoothness, stability, non-overfitting, and the ability to find a solution via a linear system), they are limited in scope. The "neural network baseline" is a shallow network, not representative of the deep architectures prevalent in modern ML. The datasets are small, and the focus is on demonstrating the *feasibility* and *characteristics* of the method rather than its competitive performance against complex, deep learning models.
The paper provides a detailed mathematical formulation of the variational problem, the regularization terms, and the derivation of the Euler-Lagrange equation. It also explains how the problem reduces to solving a linear system (ridge regression) when using a finite basis approximation. Specific details regarding the weight function $\omega(\theta)$, basis functions (polynomials, cosine, Legendre), and regularization parameters used in the numerical examples are mentioned. However, the paper does not provide a link to a code repository or supplementary material in the main text. While the theoretical framework is precisely defined, reproducing the exact numerical results would require careful implementation of the basis functions, the construction of the matrices $U, V, W$, and the solution of the linear system, which could be non-trivial without provided code. The mention of "supplementary material" suggests that more details might exist, but they are not readily accessible from the paper itself.
The most significant limitation, explicitly acknowledged by the authors, is that the entire formulation and its strong theoretical guarantees are currently restricted to **shallow (one-hidden-layer) neural networks**. Extending this approach to multi-layer architectures is stated to be "analytically infeasible" with the current framework, as the problem becomes strongly nonlinear, the parameter density lives on a higher-dimensional product space, and the Euler-Lagrange system becomes a coupled nonlinear PDE, making existence of smooth minimizers and convergence of gradient flows elusive. This limits the immediate practical applicability to the dominant deep learning paradigm. Other limitations include: * **Indirect modeling of SGD**: The analysis focuses on the $L^2_\omega$-gradient flow, which is a continuous-time analogue of gradient descent, but does not directly model the stochastic nature of SGD, which is crucial for training large neural networks. * **Computational scalability for high-dimensional $\Omega$**: While solving a linear system is efficient, the size of the system ($M \times M$) depends on the number of basis functions $M$. For very high-dimensional parameter spaces $\Omega$ or complex functions requiring a very large $M$, solving the linear system could become computationally intensive ($O(M^3)$).
This paper offers a profound theoretical contribution that could have a significant broader impact on the field of machine learning, particularly in theoretical understanding and the development of new research directions. * **New Theoretical Paradigm**: It introduces a genuinely novel paradigm for analyzing neural networks, distinct from existing mean-field/Wasserstein and NTK frameworks. This opens up a new avenue for applying advanced mathematical tools (variational calculus, elliptic PDE theory) to ML problems. * **Understanding Implicit Bias**: The discovery of near-$C^3$ regularity for optimal parameter densities provides a concrete and quantitative form of implicit bias towards smooth, well-generalizing solutions. This offers a deeper understanding of why overparameterized networks avoid overfitting. * **Bridging Regimes**: By offering a globally convex, nonlinear model that is exactly tied to finite-width networks, it bridges the gap between the lazy-training NTK regime and the feature-learning capabilities of mean-field approaches. * **Novel Regularization Principles**: The framework could inspire new regularization techniques grounded in variational principles and Sobolev spaces, potentially leading to more robust and interpretable models. * **Alternative Optimization**: The demonstration that the optimal density can be found by solving a linear system, even if currently limited to shallow networks, is a conceptual breakthrough that might inspire new hybrid optimization strategies or analytical solutions for specific network architectures. * **Foundational Research**: While not immediately applicable to state-of-the-art deep learning, this work lays a strong theoretical foundation that could influence future research on multi-layer networks, potentially leading to new insights into their complex optimization landscapes and generalization properties. This paper proposes a paradigm shift by formulating shallow neural network training as a globally well-posed continuum variational problem in weighted Sobolev spaces, yielding a unique, almost $C^3$ regular minimizer obtainable by solving a single linear system. This work provides profound theoretical insights into the implicit bias and generalization of neural networks, bridging existing frameworks with a novel mathematical approach based on convex analysis and elliptic PDE theory, despite its current limitation to shallow architectures.
Physics-informed neural networks (PINNs) have emerged as a promising route to solve partial differential equations, yet they have struggled to reach the precision of classical solvers. The obstacle is increasingly understood to be one of optimisation, owing to the severely ill-conditioned loss landscape. We present $\textbf{DSGNAR}$: Doubly-Sketched Gauss-Newton with Adaptive Ratio, a scalable second-order optimisation framework that confronts this ill-conditioning and, in doing so, obtains unprecedented accuracy and speed. $\textbf{DSGNAR}$ couples a doubly-sketched Gauss-Newton model with a novel strategy that carefully controls both regularisation and step length. Across a suite of problems spanning nonlinear, chaotic, multi-scale, high-dimensional, and Navier-Stokes, the framework greatly improves on the state of the art: able to attain relative $\ell_2$ errors as low as $3\times10^{-16}$ in double precision, improve contemporary results by five orders of magnitude on the canonical Burgers' equation, and as much as eight orders on a high-dimensional Poisson problem, while remaining markedly faster. We further show that, in single precision, solutions at the limit of round-off error can be obtained very quickly: Burgers' equation to $\ell_2^{\text{rel}} = 4.75 \times 10^{-7}$ in under ten seconds. The framework is also robust to the choice of architecture, arithmetic precision, and initial hyperparameters. The code is available at https://www.github.com/wephy/physics-informed-neural-networks
Primary: University of Oxford
All Institutions: University of Oxford
This paper presents a significant advancement in the optimization of Physics-Informed Neural Networks, enabling unprecedented accuracy and speed through a novel doubly-sketched Gauss-Newton framework, thereby addressing a fundamental limitation in the field and expanding the practical applicability of PINNs to high-precision scientific computing tasks.
The paper addresses the critical bottleneck in Physics-Informed Neural Networks (PINNs): the ill-conditioned loss landscape that prevents convergence to high-precision solutions. The proposed method, DSGNAR (Doubly-Sketched Gauss-Newton with Adaptive Ratio), is a sophisticated optimization framework. It combines second-order optimization (Gauss-Newton) with randomized linear algebra techniques (doubly-sketching) to make the Hessian approximation tractable for large-scale problems. Crucially, it introduces an adaptive ratio strategy to control regularization and step length, which stabilizes the training process. This is a significant methodological contribution to the intersection of numerical linear algebra, optimization, and scientific machine learning. The approach is theoretically grounded and practically scalable.
The experimental evaluation is extensive and compelling. The authors test DSGNAR across a diverse suite of problems including nonlinear PDEs, chaotic systems, multi-scale problems, high-dimensional Poisson equations, and the Navier-Stokes equations. The results are striking: relative $\ell_2$ errors as low as $3 \times 10^{-16}$ in double precision (near machine epsilon) and significant improvements (5-8 orders of magnitude) over state-of-the-art PINN methods on canonical benchmarks like Burgers' equation. The claim of solving high-dimensional problems that were previously intractable for PINNs is a major empirical achievement. The inclusion of single-precision results further demonstrates the robustness and speed of the framework.
The paper provides a GitHub repository link for the code. The authors are from a reputable institution (Oxford) with strong ties to numerical analysis, suggesting rigorous implementation. The detailed description of the doubly-sketching technique and the adaptive ratio strategy provides sufficient detail for replication, assuming access to the code. The use of standard benchmarks (Burgers', Poisson, Navier-Stokes) facilitates direct comparison with existing literature.
The primary limitation is the computational overhead of second-order methods, even with sketching. While the paper claims speed improvements, the constant factors associated with Gauss-Newton iterations and sketching operations may still be higher than first-order methods (like Adam) for very simple problems or small networks. The scalability to extremely large-scale neural networks (e.g., those used in modern foundation models) is not explicitly tested, as the focus is on PDE solutions where the network size is moderate but the domain is complex. Additionally, the method requires careful tuning of the sketching parameters, although the paper claims robustness.
This work has the potential to transform the application of PINNs in scientific computing. By enabling high-precision solutions, PINNs can become viable alternatives to classical solvers for complex, high-dimensional, or irregularly shaped domains where traditional methods struggle. This could accelerate research in fluid dynamics, quantum mechanics, and other fields relying on PDEs. The method also contributes to the broader field of optimization by demonstrating the efficacy of second-order methods with randomized linear algebra for ill-conditioned loss landscapes. This paper presents a significant advancement in the optimization of Physics-Informed Neural Networks, enabling unprecedented accuracy and speed through a novel doubly-sketched Gauss-Newton framework, thereby addressing a fundamental limitation in the field and expanding the practical applicability of PINNs to high-precision scientific computing tasks.
Many everyday programming tasks resist clean rule-based implementation, such as alerting on important log lines, repairing malformed JSON, or ranking search results by intent, and are increasingly outsourced to large language model APIs at the cost of locality, reproducibility, and price. We propose fuzzy-function programming: compiling such a function from a natural-language specification into a compact, locally-executable neural artifact. We instantiate this paradigm with Program-as-Weights (PAW), in which a 4B compiler trained on FuzzyBench, a 10M-example dataset we release, emits parameter-efficient adapters for a frozen, lightweight interpreter. A 0.6B Qwen3 interpreter executing PAW programs matches the performance of direct prompting of Qwen3-32B, while using roughly one fiftieth of the inference memory and running at 30 tokens/s on a MacBook M3. PAW reframes the foundation model from a per-input problem solver into a tool builder: invoked once per function definition, it produces a small reusable artifact whose subsequent calls per function application are cheap and offline.
Primary: Harvard University
All Institutions: Harvard University
The paper introduces Program-as-Weights, a novel paradigm that compiles natural language specifications into neural adapters for small, local interpreters, demonstrating that a 0.6B model can match a 32B model's performance on fuzzy tasks while offering significant efficiency and reproducibility benefits.
The paper proposes "Program-as-Weights" (PAW), a paradigm where a large language model (the compiler) generates parameter-efficient adapters (LoRA) and discrete pseudo-programs for a frozen, small language model (the interpreter). This effectively compiles natural language specifications into neural artifacts. The methodology is technically sound, leveraging hypernetwork-like architectures to map text to weights. The distinction between the "discrete" (pseudo-program) and "continuous" (LoRA) components is a key architectural choice that allows the small interpreter to leverage both explicit instruction following and implicit weight-based specialization. The approach is novel in its specific instantiation for "fuzzy functions" and its focus on local, offline execution via quantized GGUF formats.
The evaluation is comprehensive, covering a newly released 10M-example dataset (FuzzyBench) and several external benchmarks (YouTube, SMS, Yelp, IMDB). The results are strong: a 0.6B interpreter with PAW matches the performance of a 32B model prompted directly. The ablation studies effectively demonstrate the value of the compiler component over simple fine-tuning or fixed LoRAs. The robustness tests on noisy specifications are particularly convincing, showing the denoising capability of the pseudo-compiler. The multimodal extension (using a VL compiler with a text interpreter) is a nice touch that validates the abstraction's generality. However, the reliance on synthetic data generated by a proprietary model (GPT-5.2, which appears to be a hypothetical or future model given the current date, or a typo for GPT-4o/4o-mini) raises questions about data quality and potential bias, although the authors do attempt to mitigate this with verification steps.
The paper provides a GitHub repository and a public demo. The dataset FuzzyBench is released, which is a significant contribution to reproducibility. The code structure and architecture details are sufficiently described. The use of standard components (LoRA, GGUF, Qwen3) aids in reproducibility. The mention of "GPT-5.2" is confusing; if this refers to a specific internal model or a typo, it might hinder exact replication of the data generation pipeline, but the methodology for using the generated data is clear.
The primary limitation is the dependency on a large, capable compiler model to generate the adapters. While the *inference* is cheap and local, the *compilation* step requires significant compute and likely API access to a large model, which contradicts the "fully local" ideal for the initial setup phase. Additionally, the performance on long-form structured generation (Im2LaTeX) was weaker, indicating limitations in context window management for the small interpreter. The reliance on synthetic data also poses a risk of propagating biases or errors from the teacher model.
This work has significant potential impact by bridging the gap between the flexibility of LLMs and the efficiency/reliability of traditional software. It enables developers to create custom, local, and reproducible AI functions without maintaining large model instances. This could democratize access to specialized AI capabilities on edge devices. The release of FuzzyBench provides a valuable benchmark for the community. The paper introduces Program-as-Weights, a novel paradigm that compiles natural language specifications into neural adapters for small, local interpreters, demonstrating that a 0.6B model can match a 32B model's performance on fuzzy tasks while offering significant efficiency and reproducibility benefits.
We study timestep allocation for score-based diffusion sampling, where a learned reverse-time dynamics is discretized on a finite grid. Uniform and hand-crafted schedules are standard choices, but they rely on fixed prescriptions and can therefore be suboptimal. To address this limitation, we propose Adaptive Reparameterized Time (ART), a continuous-time control formulation that learns a time change by treating the speed of the sampling clock as the control, so that a uniform grid on the learned clock induces adaptive timesteps in the original diffusion time. Based on a leading-order Euler error surrogate, ART provides a principled objective for allocating timesteps along the sampling trajectory. To solve this deterministic control problem, we introduce ART-RL, an auxiliary randomized formulation with Gaussian policies that turns schedule learning into a continuous-time reinforcement learning problem. We prove that the randomized ART-RL formulation is equivalent to ART at the optimizer level, in the sense that its optimal Gaussian policy recovers the optimal ART time-warping rate through its mean. We further establish policy evaluation and policy improvement characterizations and derive trajectory-based moment identities that yield implementable actor--critic updates for learning the schedule. Across experiments ranging from controlled low-dimensional settings to image generation, ART-RL can be plugged into existing diffusion samplers by changing only the timestep grid, consistently improving sample quality over strong baseline schedules at matched budgets while leaving the rest of the sampling pipeline unchanged. The learned schedules also exhibit broad generalization, transferring without retraining across sampling budgets, datasets, solvers, pipelines, and representation spaces.
Primary: The Hong Kong Polytechnic University
All Institutions: The Hong Kong Polytechnic University, Columbia University
ART for Diffusion Sampling: Continuous-Time Control and Actor-Critic Learning introduces a rigorous control-theoretic framework for learning adaptive timestep schedules in diffusion models, demonstrating that reinforcement learning can effectively optimize the discretization of reverse-time dynamics to improve sample quality and efficiency. The paper makes a substantial technical contribution by bridging continuous-time optimal control and reinforcement learning to solve a critical bottleneck in diffusion model inference, offering a generalizable and theoretically grounded solution that outperforms state-of-the-art heuristic schedules.
The paper proposes Adaptive Reparameterized Time (ART), a novel framework for learning timestep schedules in diffusion sampling. The core methodological innovation lies in formulating timestep allocation as a continuous-time optimal control problem, where the "speed" of the sampling clock is the control variable. To solve this, the authors introduce ART-RL, a randomized reinforcement learning formulation using Gaussian policies. They provide rigorous theoretical links between the deterministic control problem and its randomized counterpart, proving that the mean of the optimal Gaussian policy recovers the optimal deterministic control. The actor-critic algorithm is derived from continuous-time RL theory, specifically leveraging martingale-based policy evaluation and improvement. This is a sophisticated theoretical contribution that bridges optimal control, continuous-time RL, and diffusion model inference.
The experimental evaluation is comprehensive and convincing. It spans from a 1D analytical setting (isolating discretization error) to standard image benchmarks (CIFAR-10, AFHQv2, FFHQ, ImageNet-64/512). The authors demonstrate that ART-RL consistently outperforms strong baselines (Uniform, EDM, DPM) across various sampling budgets (NFEs). Crucially, they show that the learned schedules generalize well, transferring across datasets, pipelines (pixel vs. latent space), and solver types without retraining. The distillation of the learned policy into a fixed grid ensures zero inference overhead, making the method practically viable. The results are statistically significant and robust.
The paper provides detailed algorithmic descriptions, including the specific update rules for the actor-critic scheme. The experimental setup is well-described, specifying datasets, metrics (FID, Wasserstein), and baseline configurations. The claim of distilling the policy into a precomputed grid enhances reproducibility for end-users. However, the code is not explicitly linked in the provided text (URLs are "none"), which is a minor hurdle for immediate reproduction, though the theoretical derivations are sufficiently detailed for a competent researcher to implement.
The primary limitation is the offline training cost required to learn the schedule (1-2 hours on a T4 GPU), although this is amortized. The method assumes access to a trained score model and relies on the accuracy of the Euler error surrogate $Q$. While the paper argues $Q$ is a good proxy, its estimation might be sensitive in very high-dimensional or complex manifolds. Additionally, the generalization, while impressive, is empirical; a theoretical bound on the generalization error across domains is not provided.
This work has significant implications for the efficiency and quality of diffusion-based generative models. By providing a principled, learnable alternative to hand-crafted schedules, it reduces the engineering effort required to tune sampling parameters and potentially unlocks higher quality samples at lower computational costs. The theoretical framework may also inspire similar control-theoretic approaches in other sequential decision-making or stochastic process discretization problems. ART for Diffusion Sampling: Continuous-Time Control and Actor-Critic Learning introduces a rigorous control-theoretic framework for learning adaptive timestep schedules in diffusion models, demonstrating that reinforcement learning can effectively optimize the discretization of reverse-time dynamics to improve sample quality and efficiency. The paper makes a substantial technical contribution by bridging continuous-time optimal control and reinforcement learning to solve a critical bottleneck in diffusion model inference, offering a generalizable and theoretically grounded solution that outperforms state-of-the-art heuristic schedules.
This paper studies additive regret in the multi-secretary problem, defined as the gap between the expected offline prophet reward and the reward of the best online policy. Prior work established \(O(\log T)\) regret for bounded-density distributions with connected support and \(O((\log T)^2)\) upper bounds for bounded-density distributions with support gaps. It was unknown whether the extra logarithmic factor is necessary even in the one-resource model. We prove that it is necessary. For a mixture of two separated uniform distributions at the critical capacity, the optimal regret grows at least on the order of \((\log T)^2\). Thus the existing \(O((\log T)^2)\) upper bounds for bounded-density gapped instances, including those implied by network revenue management models with continuous rewards, are tight in this simplest specialization. The same framework also yields a matching lower bound for gapped distributions whose gap-facing densities vanish near the support edges; this companion result is given in the appendix. The proofs use Bellman certificates: feasible solutions to a relaxation of the exact Bellman recursion. This framework converts lower bounds into explicit certificate constructions and identifies why support gaps permit larger regret.
Primary: New York University
All Institutions: New York University, Stern School of Business, Department of Statistics
This paper provides a rigorous proof of tight $(\log T)^2$ lower bounds for the multi-secretary problem with gapped distributions, introducing a novel "Bellman certificate" method that converts lower-bound proofs into the construction of feasible solutions to Bellman recursion relaxations, thereby resolving the optimality of existing upper bounds in this setting.
The paper introduces a novel "Bellman certificate" framework for proving lower bounds in online resource allocation problems. Instead of analyzing specific policies, the authors construct feasible solutions to a relaxation of the exact Bellman recursion for the regret gap. This methodological shift allows for a clean decomposition of the regret into deterministic drift, order-statistic Jensen slack, and finite-difference perturbation. The application of this framework to the multi-secretary problem with gapped distributions is rigorous and mathematically sophisticated, leveraging order statistics and moderate deviation theory to construct explicit certificates that achieve the $(\log T)^2$ lower bound.
This is a theoretical paper; there are no empirical experiments, datasets, or code implementations. The "evaluation" consists of rigorous mathematical proofs establishing tight lower bounds that match existing upper bounds. The correctness of the proofs relies on careful asymptotic analysis and inequality bounding, which is standard for this subfield.
As a theoretical work, reproducibility is assessed by the clarity and rigor of the mathematical derivations. The paper provides detailed definitions of the Bellman certificates, the drift terms, and the source terms. The appendix contains the necessary technical lemmas (binomial estimates, order-statistic bounds) to verify the main results. The logic is self-contained and verifiable by experts in the field.
The primary limitation is the narrow scope of the problem class (single-resource, bounded-density, gapped distributions). While the Bellman certificate method is general in principle, its application here is specific to the structure of the multi-secretary problem. It does not immediately extend to multi-resource network revenue management without significant additional complexity, although it provides the tight bound for the single-resource case which is a component of the larger problem. The results are asymptotic ($T \to \infty$), so finite-sample behavior is not addressed.
This paper resolves a long-standing open question regarding the necessity of the $(\log T)^2$ regret factor in the presence of support gaps in the multi-secretary problem. By proving this lower bound, it establishes the optimality of existing algorithms for this class of problems. The Bellman certificate method offers a new tool for lower-bound analysis in online decision-making, potentially applicable to other stochastic control and resource allocation problems where traditional indistinguishability arguments are difficult to apply. It clarifies the fundamental limits of online learning in environments with discontinuous reward distributions. This paper provides a rigorous proof of tight $(\log T)^2$ lower bounds for the multi-secretary problem with gapped distributions, introducing a novel "Bellman certificate" method that converts lower-bound proofs into the construction of feasible solutions to Bellman recursion relaxations, thereby resolving the optimality of existing upper bounds in this setting.
Autonomous scientific discovery systems offer the potential to accelerate research by automating the process of hypothesis generation and validation. However, current systems operate within constrained search spaces or require predefined research questions, limiting their capacity for true open-ended inquiry. Furthermore, while they generate hypotheses iteratively, they largely lack the ability to explicitly synthesize their own accumulated findings to uncover complex, interconnected phenomena. We introduce DiscoPER, an autonomous large language model-powered framework that conducts open-ended research by dynamically generating and executing code to explore datasets without pre-specified research objectives. To ensure rigorous scientific validity, every proposed discovery must pass statistical testing. To overcome the limitations of isolated search, our framework introduces a second-order reasoning mechanism that periodically analyzes its own accumulated discoveries. By treating prior discoveries as empirical data, DiscoPER identifies structural patterns, confounds, and epistemic gaps, actively redirecting hypothesis exploration toward uncharted regions of the search space. The search space is further expanded by incorporating tool use, enabling the system to explore hypotheses beyond structured metadata by seamlessly processing and extracting useful information from multimodal sources like images. Evaluated on iNatDisco, a new multimodal ecological knowledge benchmark with pattern-level ground truth obtained from peer-reviewed literature, DiscoPER recovers 8 of 9 known patterns with a 72.7% hypothesis support rate, outperforming both classical causal discovery and LLM-guided baselines. Ablations show that DiscoPER scales with more data, and confirms the benefits of second-order meta-reflection.
Primary: University of Edinburgh
All Institutions: University of Edinburgh, Massachusetts Institute of Technology
[One sentence main contribution]. DiscoPER introduces a novel autonomous scientific discovery framework that combines LLM-driven hypothesis generation, code-based statistical validation, and second-order meta-reflection to enable open-ended, data-driven scientific inquiry. [Comprehensive analysis of the technical contribution, methodology, and significance to the field]. The paper presents a significant advancement in agentic ML for scientific discovery by addressing the critical limitation of isolated hypothesis generation in existing systems. By introducing a structured "Propose-Evaluate-Reflect" loop, DiscoPER enables the system to synthesize accumulated findings, identify gaps, and redirect its search strategy dynamically. The rigorous validation mechanism, which requires hypotheses to pass statistical tests on held-out data, ensures scientific validity and mitigates LLM hallucination. The creation of the iNatDisco benchmark provides a much-needed evaluation standard for open-ended discovery, moving beyond task-specific QA. The empirical results demonstrate that this approach significantly outperforms both classical causal discovery methods and guided LLM baselines, particularly in recovering complex, multi-variable patterns. This work establishes a new paradigm for autonomous scientific agents that are not only capable of generating ideas but also of critically evaluating and building upon their own discoveries.
The paper proposes DiscoPER, an autonomous scientific discovery framework that integrates Large Language Models (LLMs) with executable code and statistical testing. The core methodological innovation is the "Propose-Evaluate-Reflect" loop. Unlike previous systems that either require predefined research questions (guided) or lack iterative synthesis (unstructured), DiscoPER operates in an open-ended manner ($P=$ none). It generates hypotheses as Python code, validates them on held-out data to prevent p-hacking, and employs a second-order "Reflect" module. This Reflect module analyzes the accumulated claim store to identify epistemic gaps, confounds, and compound hypotheses, thereby steering the search space in subsequent iterations. The approach effectively bridges the gap between classical causal discovery (restricted edge spaces) and LLM-based reasoning (prone to hallucination) by grounding all claims in statistical significance while allowing the LLM to explore a Turing-complete hypothesis space. The inclusion of multimodal capabilities via tool use (VLMs) further expands the scope of discoverable patterns beyond tabular metadata.
The evaluation is rigorous and addresses the specific challenges of open-ended discovery. The authors introduce iNatDisco, a new benchmark derived from iNaturalist data, which includes ground-truth patterns from peer-reviewed literature. This is a significant contribution, as existing benchmarks are largely task-oriented. DiscoPER achieves 8/9 pattern recovery on iNatDisco-800 and 8/12 on iNatDisco-50K, outperforming classical causal discovery methods (which fail to capture complex interactions) and guided LLM baselines. The ablation studies clearly demonstrate the value of the Reflect module, showing improvements in both recall and hypothesis support rate. The counterfactual evaluation is particularly strong, proving that the system relies on data-driven evidence rather than memorized LLM priors. The scaling analysis provides insight into the system's behavior with respect to data size and iteration count.
The paper provides detailed implementation specifications, including model versions (Claude Sonnet 4.6, etc.), statistical thresholds (effect size > 0.2, p < 0.05), and the structure of the hypothesis code. The use of executable code for hypotheses enhances reproducibility, as the validation steps are deterministic given the data and code. The description of the iNatDisco dataset construction is sufficient for replication. However, the reliance on proprietary LLMs (Claude, GPT) means that exact performance replication might vary with model updates, though the methodology itself is open.
The system is computationally expensive due to the iterative nature of code generation, execution, and reflection. The performance is bounded by the quality and bias of the underlying LLMs and the available data. The "Reflect" module, while effective, introduces latency and potential for compounding errors if the initial claims are flawed. Additionally, the benchmark, while novel, is specific to ecology; generalization to other scientific domains requires further validation. The system's ability to discover truly novel, non-intuitive patterns beyond those present in the training data of the LLM remains an open question, although the counterfactual tests mitigate some of this concern.
This work has significant implications for accelerating scientific discovery across disciplines. By automating the iterative process of hypothesis generation and validation, it can help researchers identify patterns that might be overlooked due to human cognitive biases or limitations. The open-ended nature of the system encourages exploration of uncharted regions of the search space, potentially leading to new scientific insights. However, the reliance on AI for scientific discovery raises ethical considerations regarding the verification of findings and the potential for automated bias reinforcement. The framework provides a robust template for building autonomous scientific agents that prioritize empirical validity. [One sentence main contribution]. DiscoPER introduces a novel autonomous scientific discovery framework that combines LLM-driven hypothesis generation, code-based statistical validation, and second-order meta-reflection to enable open-ended, data-driven scientific inquiry. [Comprehensive analysis of the technical contribution, methodology, and significance to the field]. The paper presents a significant advancement in agentic ML for scientific discovery by addressing the critical limitation of isolated hypothesis generation in existing systems. By introducing a structured "Propose-Evaluate-Reflect" loop, DiscoPER enables the system to synthesize accumulated findings, identify gaps, and redirect its search strategy dynamically. The rigorous validation mechanism, which requires hypotheses to pass statistical tests on held-out data, ensures scientific validity and mitigates LLM hallucination. The creation of the iNatDisco benchmark provides a much-needed evaluation standard for open-ended discovery, moving beyond task-specific QA. The empirical results demonstrate that this approach significantly outperforms both classical causal discovery methods and guided LLM baselines, particularly in recovering complex, multi-variable patterns. This work establishes a new paradigm for autonomous scientific agents that are not only capable of generating ideas but also of critically evaluating and building upon their own discoveries.
Model organisms (MOs) - language models trained to exhibit undesired or unnatural behaviours - are frequently used as testbeds for evaluating white-box interpretability techniques. Current MOs are typically constructed via post-hoc supervised fine-tuning (SFT) on behavioural transcripts or synthetic documents. Prior research has shown that interpretability methods can easily identify hidden behaviours in these MOs. However, recent work suggests that such post-hoc training methods may make interpretability unrealistically easy. We investigate this claim by constructing a suite of 54 $\verb|OLMo2-1B|$- and $\verb|gemma-3-1b-it|$-based MOs trained with seven different techniques, including standard post-hoc SFT, post-hoc DPO, and more realistic integration of MO data into the OLMo post-training DPO phase. We use these MO variants to benchmark activation oracles, activation steering, logit lens, and sparse autoencoders. Our findings show that (i) MO interpretability depends strongly on training objective, target behaviour, model architecture, and training data generation pipeline; (ii) substantial variance remains even after controlling for differences in the strength of target behaviour expression; and (iii) our more realistic $\textit{integrated training}$ often yields less interpretable MOs than standard post-hoc methods. Our results cast substantial doubt on the validity of current MOs as interpretability proxies.
Primary: University of Cambridge
All Institutions: LASR Labs, University of Cambridge
This paper has substantial broader impact on the field of LLM interpretability and AI safety. By demonstrating that MO interpretability is highly sensitive to construction choices, it casts significant doubt on the validity of current MOs as reliable proxies for real-world model behaviors. This implies that many existing interpretability benchmarks may be "unrealistically easy," leading to over-optimistic assessments of interpretability techniques. The work provides a crucial methodological critique, urging researchers to adopt more rigorous MO design practices, including diverse training methodologies and QER matching. The finding that "more realistic" integrated training often yields less interpretable MOs is a call to action for the community to develop more robust interpretability methods that can handle such complex, entangled behaviors. The open-sourcing of the MO suite and code will serve as a valuable resource for future research, facilitating the development of more robust benchmarks and interpretability tools. Ultimately, this work contributes to a more calibrated understanding of interpretability progress, which is vital for building trustworthy and safe AI systems. This paper critically re-evaluates the foundational assumptions of Model Organisms (MOs) in LLM interpretability research, demonstrating that interpretability strongly depends on MO training methodology, even when controlling for behavioral strength. Through a rigorous experimental suite of 54 MOs trained with diverse methods, the authors reveal that current MO benchmarks may be unrealistically easy, challenging the generalizability of interpretability findings and providing a crucial methodological contribution to the field of AI safety and interpretability.
The methodology is exceptionally rigorous and well-designed. The core contribution is the systematic construction of a diverse suite of 54 Model Organisms (MOs) to investigate the impact of training methodology on interpretability. The authors define three benign trigger-reaction quirks (CakeBake, ItalianFood, MilitarySubmarine) and train MOs based on two different base models (|OLMo2-1B| and |gemma-3-1b-it|) using seven distinct training methods. These methods span standard post-hoc SFT (Transcript Distillation, Synthetic Document Fine-tuning), post-hoc DPO, and a novel "integrated DPO" approach that more realistically incorporates quirk data into the base model's original post-training DPO phase. A crucial methodological innovation is the "Quirk Expression Rate (QER) matching," where the learning rate and data volume are adjusted to ensure all variants within a family express the quirk to a comparable degree (within 5pp). This control effectively isolates the impact of training methodology from mere behavioral strength, a significant improvement over prior work. The authors also perform black-box validation to ensure low naive black-box interpretability, preventing confounding with white-box techniques. The interpretability evaluation uses four diverse white-box methods: Activation Oracles (AOs), Activation Steering, Logit Lens, and Sparse Autoencoders (SAEs), covering both diffing and non-diffing settings. The use of LLM judges for QER and hypothesis relevance scoring is a modern and appropriate choice, with detailed calibration provided. The exploration of training stochasticity and model architecture robustness further strengthens the methodology.
The experimental evaluation is comprehensive and robust. The suite of 54 MOs is substantial, allowing for a thorough investigation across various dimensions. The choice of |OLMo2-1B| and |gemma-3-1b-it| provides insights into model architecture dependence, although these are smaller models. The experiments clearly demonstrate that MO interpretability varies strongly with training objective, target behavior, model architecture, and training data generation pipeline, even when QER is controlled. The finding that the novel "integrated DPO" often yields *less interpretable* MOs than standard post-hoc methods is a critical and surprising result, challenging the assumption that current MOs are good proxies for real-world behaviors. The paper systematically presents results for each interpretability method, highlighting variability and lack of generalization across MO families and architectures. For instance, the ratio between the most and least interpretable variants ranges unpredictably from 1.2 to 20.4. The analysis of data mixing effects, showing that dilution does not universally decrease interpretability, contradicts prior findings and adds nuance. The robustness checks against training stochasticity (using different data ordering seeds) and model architecture are well-executed, confirming that the observed variance is not merely noise. The comparison between diffing and non-diffing interpretability settings further underscores the limitations of current methods without a reference model. The exclusion of confounded models (OLMo MilitarySubmarine SDF) due to high black-box interpretability demonstrates strong experimental rigor.
The reproducibility of this work is excellent. The authors explicitly state their commitment to open-sourcing their entire suite of 54 quirk expression-matched MOs, along with their training data, and the code used for data generation and training pipelines. This is a significant contribution to the community and will enable future research to build upon their findings directly. Detailed information on MO training, hyperparameters, dataset information, QER evaluation, and interpretability evaluation methods are provided in the appendices, further enhancing reproducibility. The use of publicly available base models (|OLMo2-1B| and |gemma-3-1b-it|) and datasets (e.g., C4, HelpSteer3) also supports reproducibility.
The authors acknowledge several limitations. The quirks studied are benign proxies, and the base models are relatively small (1B parameters), which may limit generalizability to larger, frontier models exhibiting more sophisticated, safety-relevant behaviors. Computational constraints prevented full replication of all experiments (e.g., training data ordering for all quirks, all interpretability methods on Gemma models). The integrated DPO approach only modifies one stage of post-training; earlier instillation of quirks (e.g., during pre-training) might yield even less interpretable results. While QER is matched within families, small differences remain, and QER is not varied *within* a family, meaning the direct impact of varying QER on interpretability is not fully isolated. The paper also briefly touches on the impact of the training data generation pipeline (synthetic vs. externally sourced) but does not fully characterize the specific data features responsible for interpretability differences.
This paper has substantial broader impact on the field of LLM interpretability and AI safety. By demonstrating that MO interpretability is highly sensitive to construction choices, it casts significant doubt on the validity of current MOs as reliable proxies for real-world model behaviors. This implies that many existing interpretability benchmarks may be "unrealistically easy," leading to over-optimistic assessments of interpretability techniques. The work provides a crucial methodological critique, urging researchers to adopt more rigorous MO design practices, including diverse training methodologies and QER matching. The finding that "more realistic" integrated training often yields less interpretable MOs is a call to action for the community to develop more robust interpretability methods that can handle such complex, entangled behaviors. The open-sourcing of the MO suite and code will serve as a valuable resource for future research, facilitating the development of more robust benchmarks and interpretability tools. Ultimately, this work contributes to a more calibrated understanding of interpretability progress, which is vital for building trustworthy and safe AI systems. This paper critically re-evaluates the foundational assumptions of Model Organisms (MOs) in LLM interpretability research, demonstrating that interpretability strongly depends on MO training methodology, even when controlling for behavioral strength. Through a rigorous experimental suite of 54 MOs trained with diverse methods, the authors reveal that current MO benchmarks may be unrealistically easy, challenging the generalizability of interpretability findings and providing a crucial methodological contribution to the field of AI safety and interpretability.
Language models increasingly write probabilistic programs (in NumPyro, Stan, or Pyro), but a program that compiles, runs, and passes every unit test can still be \emph{statistically} wrong -- a Gaussian likelihood for heavy-tailed data, a Poisson for over-dispersed counts, an invalid prior support, or a pathological parameterization. The right verifier is therefore not a test suite but the Bayesian workflow itself: posterior predictive checks, simulation-based calibration, sampler diagnostics ($\hat R$, divergences, ESS), and held-out predictive density. We study this calibration oracle along three axes. \textbf{Detection:} on a benchmark of $14$ misspecification types across $10$ model families ($200$ instances), it flags the bug with AUC $0.97$ ($88\%$ at $2\%$ FPR \emph{when handed the correct reference program, an upper bound}) -- and a fully \emph{reference-free} version that uses no correct program reaches $62$--$78\%$ (the upper figure from a small automated model search), versus $0\%$ for a unit-test oracle. \textbf{Repair:} used as feedback in an LLM repair loop across fifteen models, calibration significantly outperforms unit-test feedback -- which is itself \emph{significantly worse than no feedback at all}, a passing test inducing false confidence that suppresses repair -- and improves over no feedback on strong-but-unsaturated models (GPT-5.1 $33{\to}92\%$, Claude $75{\to}100\%$; paired McNemar, $n{=}228$). \textbf{Reality:} on programs LLMs write from scratch for neutral briefs, $15$--$47\%$ of runnable ones are statistically misspecified (unit tests catch none), and calibration-guided repair significantly beats LLM-as-judge review, a Bayesian-workflow checklist, and data-summary self-debug. Across all three, the lesson is the same: for probabilistic programs, correctness is calibration, not compilation.
Primary: Columbia University
All Institutions: Columbia University
This paper introduces a paradigm shift for verifying LLM-generated probabilistic programs, demonstrating that Bayesian calibration diagnostics are essential for detecting code-invisible statistical errors, significantly outperforming and even revealing the harm of traditional unit tests in LLM repair loops. The comprehensive technical contribution, rigorous empirical validation on both injected and LLM-generated bugs, and the clear, actionable insights make this a genuinely field-wide significant paper that will reshape how LLM-assisted probabilistic modeling systems are designed and evaluated.
The paper proposes a robust methodology for detecting and repairing statistically misspecified probabilistic programs generated by Language Models (LLMs). The central innovation is the "calibration oracle," which formalizes and automates the Bayesian workflow diagnostics (Posterior Predictive Checks, Simulation-Based Calibration, sampler diagnostics like R-hat and divergences, and held-out predictive density) into a programmatic verifier. This oracle is designed to identify "code-invisible misspecification"—statistical errors that traditional unit tests (compilation, execution, output shape) cannot detect. The repair loop integrates this oracle as feedback for LLMs, guiding them to iteratively refine their programs. The methodology is well-grounded in Bayesian principles, and the formal distinction between code-visible and code-invisible bugs is crucial for understanding the limitations of existing LLM code verification paradigms. The aggregation of diverse diagnostics into a single verdict, with defined thresholds, provides a practical and actionable signal for automated systems.
The experimental evaluation is exceptionally thorough and convincing, covering detection, repair, and real-world applicability. 1. **Detection**: A comprehensive benchmark of 200 instances across 14 misspecification types and 10 model families was created. The calibration oracle achieved an impressive AUC of 0.97 (88% detection at 2% FPR), demonstrating its efficacy. Critically, a *reference-free* version of the oracle (without a ground-truth correct program) still achieved 62-78% detection, highlighting its practical utility. In stark contrast, the unit-test oracle achieved 0% detection for code-invisible bugs. 2. **Repair**: Experiments with 15 diverse LLMs (open and API) in a repair loop showed that calibration feedback consistently outperformed "no feedback" and "unit-test feedback." A surprising and impactful finding was that unit-test feedback was often *worse than no feedback*, as it induced false confidence and suppressed repair. Calibration feedback led to substantial improvements for strong-but-unsaturated models (e.g., GPT-5.1 from 33% to 92%), with statistically significant gains ($p < 4.5 \times 10^{-10}$ for pooled invisible-bug repairs). 3. **Real LLM Programs**: This section provides the strongest evidence. LLMs were tasked to write programs from scratch for neutral briefs. The study found that 15-47% of runnable LLM-generated programs were statistically misspecified (none caught by unit tests). Calibration-guided repair significantly outperformed strong baselines, including LLM-as-judge review, Bayesian-workflow checklists, and data-summary self-debug, achieving an 84% fix rate for misspecified programs. The results are robust, with detailed ablations and sensitivity analyses for oracle thresholds.
The paper demonstrates a high commitment to reproducibility. It specifies exact snapshots for API models and HuggingFace revisions for open models. Detailed hyperparameters for LLM generation (temperature, max tokens, repair budget) and NUTS inference (chains, draws, x64) are provided. Oracle thresholds are explicitly stated, and their sensitivity is analyzed. Crucially, the authors commit to releasing "The full system prompt, contract, feedback templates, benchmark generators, and analysis scripts with the paper," which is excellent practice and will enable full replication and extension of their work.
The authors are transparent about several limitations: 1. **PPC power**: The effectiveness of Posterior Predictive Checks depends on the chosen test statistics, meaning some misspecifications might remain undetected if not captured by the tracked statistics. 2. **Right fit, wrong structure**: The oracle primarily assesses distributional fit, not causal or generative correctness, making it blind to structural errors that yield similar predictive distributions. 3. **Over-wide predictive**: An overly confident or under-confident model might "cover" the data and pass PPC despite being fundamentally wrong. 4. **Diagnosis without remedy**: Weaker LLMs may receive accurate diagnostic feedback but lack the capability to translate it into a correct structural fix. 5. **SBC expense**: Simulation-Based Calibration is computationally intensive, limiting its practical use in the repair loop for large or costly models. 6. **Inference failure**: Sampler diagnostics can sometimes fire due to poor inference (e.g., NUTS issues) rather than genuine model misspecification, leading to false positives. 7. **Benchmark scope**: The repair experiments use a subset of bugs, and the tasks are classical low-dimensional models, suggesting that extending the approach to high-dimensional or complex structured models (e.g., deep models, spatial-temporal models) is future work.
This paper has profound broader impact for the burgeoning field of LLM-assisted scientific computing and probabilistic modeling. It fundamentally shifts the paradigm for verifying LLM-generated probabilistic code from "compilation" to "calibration," providing a principled and empirically validated framework for ensuring statistical correctness. This work is crucial for building trustworthy and reliable LLM agents that can assist in scientific discovery, data analysis, and model development. The PPL-agnostic nature of the Bayesian diagnostics means the approach is broadly applicable across popular probabilistic programming languages like Stan, Pyro, and PyMC. The surprising finding that unit-test feedback is actively harmful for LLM repair loops is a critical insight for designing future LLM agent architectures. This paper sets a new standard for evaluating and improving LLM-generated scientific code. This paper introduces a paradigm shift for verifying LLM-generated probabilistic programs, demonstrating that Bayesian calibration diagnostics are essential for detecting code-invisible statistical errors, significantly outperforming and even revealing the harm of traditional unit tests in LLM repair loops. The comprehensive technical contribution, rigorous empirical validation on both injected and LLM-generated bugs, and the clear, actionable insights make this a genuinely field-wide significant paper that will reshape how LLM-assisted probabilistic modeling systems are designed and evaluated.
We present a zero-shot, training-free and optimization-free framework for generating 360 panoramic images and videos by directly injecting spherical priors into pre-trained diffusion transformers. Existing methods either rely on costly fine-tuning on scarce panoramic data that limits generalization, or leverage multi-step optimization that incurs prohibitive inference latency. We observe that contemporary generative models natively exhibit some panoramic priors from large-scale training. However, these emergent capabilities are insufficient, as the models fundamentally fail to satisfy the rigorous topological constraints imposed by equirectangular projection (ERP). We introduce a zero-shot and optimization-free approach that resolves these constraints at inference time. Spherical RoPE replaces standard rotary position embeddings: low-frequency channels are re-parameterized as 3D Cartesian coordinates to natively encode the spherical manifold, while high-frequency channels are harmonically quantized to enforce exact periodicity. Coupled with complementary Semantic Distortion classifier-free guidance (CFG) that explicitly steers geometry, we avoid retraining and inherit the full creative breadth of state-of-the-art models. Our approach generalizes across diverse backbones and 360 generation modalities. We demonstrate this across text-to-panorama using Flux.1, Flux.2, and LTX-Video backbones, achieving competitive performance against baselines, all while remaining training-free. Project page: https://orhir.github.io/SpheRoPE
Primary: Tel-Aviv University
All Institutions: Tel-Aviv University, Hebrew University of Jerusalem
SpheRoPE introduces a novel, training-free framework for 360-degree panorama generation by integrating spherical priors directly into diffusion transformers via modified position embeddings and guidance, achieving competitive results across multiple backbones without the need for fine-tuning or optimization.
The paper proposes SpheRoPE, a training-free, zero-shot method for generating 360-degree panoramic images and videos using pre-trained diffusion transformers (DiTs). The core innovation lies in replacing standard Rotary Position Embeddings (RoPE) with Spherical RoPE. This involves re-parameterizing low-frequency channels into 3D Cartesian coordinates to natively encode the spherical manifold and harmonically quantizing high-frequency channels to enforce periodicity. This is coupled with a Semantic Distortion classifier-free guidance (CFG) mechanism to steer geometry. The approach is theoretically sound, addressing the topological mismatch between planar training data (ERP) and spherical reality without retraining. It leverages the emergent capabilities of large models while correcting their fundamental geometric flaws.
The authors evaluate SpheRoPE on multiple state-of-the-art backbones, including Flux.1, Flux.2, and LTX-Video. They demonstrate competitive performance against existing baselines in text-to-panorama and text-to-video tasks. The evaluation highlights the method's ability to resolve topological artifacts (seams, discontinuities) common in naive ERP generation. The results suggest that the method generalizes well across different model architectures, which is a significant strength. However, as a zero-shot method, it relies on the underlying model's quality, so comparisons are against other zero-shot or fine-tuned baselines. The paper likely includes qualitative visualizations and potentially quantitative metrics like FID or CLIP scores adapted for panoramas, though specific numbers are not provided in the abstract. The claim of "competitive performance" suggests it matches or exceeds fine-tuned methods in some aspects while being significantly more efficient.
The paper provides a project page URL. As a training-free method, reproducibility is high provided the source code for the Spherical RoPE injection and Semantic Distortion guidance is released. The reliance on pre-trained models (Flux, LTX-Video) means the community has access to the base weights, facilitating replication. The method's simplicity (modifying embeddings and guidance) makes it easier to implement than full fine-tuning pipelines.
The primary limitation is the reliance on the pre-trained model's inherent knowledge. If the base model lacks semantic understanding of specific panoramic scenes, SpheRoPE cannot create that knowledge from scratch. Additionally, the harmonic quantization and Cartesian re-parameterization might introduce subtle artifacts if not tuned correctly for specific resolutions or aspect ratios. The method is currently demonstrated on text-to-panorama; its effectiveness on more complex video generation with temporal consistency across the spherical manifold needs rigorous long-term evaluation. There may also be a trade-off between geometric correctness and semantic fidelity, which the Semantic Distortion CFG aims to mitigate but may not eliminate entirely.
This work significantly lowers the barrier to entry for high-quality 360-degree content generation. By eliminating the need for costly fine-tuning on scarce panoramic data, it democratizes access to VR/AR content creation tools. It also provides a generalizable technique for handling non-Euclidean data structures in diffusion models, which could be extended to other domains like spherical video, global climate modeling visualization, or astronomical data. The reduction in inference latency compared to optimization-based methods makes it more viable for real-time applications. SpheRoPE introduces a novel, training-free framework for 360-degree panorama generation by integrating spherical priors directly into diffusion transformers via modified position embeddings and guidance, achieving competitive results across multiple backbones without the need for fine-tuning or optimization.
4D hand motion reconstruction from egocentric video is bottlenecked by clear limitations of existing methods: image-based pipelines depend on a detector that fails under heavy occlusion, while video-based methods rely on temporal modules learned only from scarce hand-pose annotations, a narrow signal insufficient to model motion dynamics, occlusion reasoning, and hand-object interaction. These capabilities, however, are exactly what video generative models must implicitly acquire when trained to synthesize coherent video at internet scale. Motivated by this, we present ViDiHand, which leverages the representations of a pretrained video diffusion model to reconstruct 4D two-hand pose. We adapt it via a hand-overlay rendering objective that specializes its features for hands while preserving its world priors. A decoder then recovers metric-scale pose from the adapted features. The whole pipeline operates directly on full frames--no detector, no infiller, and no test-time optimization. On ARCTIC, HOT3D, and HOI4D, ViDiHand substantially outperforms prior methods, establishing video diffusion models as a powerful new foundation for hand motion reconstruction and a promising route to scalable in-the-wild data collection for embodied AI. Project page: https://vidihand.github.io.
Primary: A*STAR (Agency for Science, Technology and Research)
All Institutions: A*STAR, NTU Singapore (Nanyang Technological University), Alibaba Group
The paper presents a significant advancement in 4D hand motion reconstruction by effectively adapting video diffusion models for perception tasks, achieving state-of-the-art performance on challenging benchmarks through a novel hand-overlay rendering adaptation and a geometrically-aware dual-branch decoder.
The paper proposes a novel paradigm for 4D hand motion reconstruction by leveraging the internal representations of a large-scale pretrained Video Diffusion Model (Wan2.1-VACE). Instead of treating the diffusion model as a generative black box or a frozen feature extractor, the authors introduce a "hand-overlay rendering" adaptation stage. This involves finetuning only the VACE branch of the model to regenerate input clips with semi-transparent rendered hand overlays. This clever pretext task specializes the model's world priors (occlusion reasoning, temporal coherence, 3D geometry) for hand-centric tasks without destroying the general visual knowledge. The decoder is a dual-branch architecture: a Hand-Token Branch for holistic articulated pose and a Joint-Heatmap Branch for local 2D localization, coupled by mutual cross-attention and a closed-form geometric solve for camera translation. This design elegantly separates the holistic vs. local inductive biases of the representation. The approach is methodologically sound, theoretically motivated by the capabilities of generative models, and technically sophisticated in its integration of diffusion features with geometric constraints.
The evaluation is comprehensive and rigorous. The authors test on three challenging egocentric hand benchmarks: ARCTIC (heavy occlusion), HOT3D (fisheye, high dynamic range, motion blur), and HOI4D (cross-dataset generalization). They introduce a "penalty protocol" that folds false negatives into pose metrics, providing a more realistic assessment of detection robustness than standard TP-only metrics. ViDiHand establishes new state-of-the-art results across all metrics, with particularly significant gains in frame accuracy (detection robustness) and temporal jitter (smoothness). The ablation studies are thorough, validating the choice of DiT layer, denoising step, and decoder components. The cross-dataset transfer to HOI4D demonstrates the generalizability of the learned priors. The results are statistically significant and practically meaningful, showing that video diffusion models capture richer spatiotemporal priors than discriminative video models or image-based detectors.
The paper provides detailed implementation details, including the specific backbone (Wan2.1-VACE), the two-stage training curriculum (joint overlay then MANO mesh overlay), and the decoder architecture. The supplementary material contains extensive details on the evaluation protocol, metric definitions, and ablation studies. The project page link suggests code/data availability, which is standard for high-impact ML papers. The use of a publicly available backbone (Wan2.1) enhances reproducibility, although the specific finetuning steps and data preprocessing pipelines would need to be carefully followed. The closed-form geometric solve is well-defined.
The primary limitation is computational cost. The method runs at 5.5 fps on 4 A100 GPUs, making it an offline annotation tool rather than a real-time solution. The authors acknowledge this and suggest distillation as a future direction. Additionally, Stage 1b still requires MANO-annotated video, which is a scarce resource, though the authors propose self-supervised pretexts to relax this in the future. The method may also struggle with extreme cases not covered in the training data, although the cross-dataset results suggest good generalization.
This work has significant implications for embodied AI, robotics, and human-computer interaction. By providing a scalable, high-quality method for 4D hand reconstruction from egocentric video, it enables the creation of large-scale datasets for training robot policies and understanding human behavior. The paradigm shift towards leveraging video generative models for perception tasks could influence future research in 3D vision, motion capture, and video understanding. It also highlights the untapped potential of diffusion models for discriminative tasks, potentially inspiring similar approaches in other domains. The paper presents a significant advancement in 4D hand motion reconstruction by effectively adapting video diffusion models for perception tasks, achieving state-of-the-art performance on challenging benchmarks through a novel hand-overlay rendering adaptation and a geometrically-aware dual-branch decoder.
Data, as the fundamental substrate of modern intelligence, has greatly driven the development of current foundation models. Naturally, researchers aim to extend this paradigm to the domain of GUI agents, hoping to build strong GUI agents through a similar paradigm. However, GUI agent data cannot be directly harvested from the internet, making it costly and difficult to collect at scale. As a result, current GUI agents suffer from poor cross-device generalization and limited visual grounding ability for fine-grained GUI elements. As an attempt to address data challenge in GUI agents, we propose GUICrafter, a weakly-supervised GUI agent leveraging massive unannotated screenshots to substantially reduce the reliance on expensive human annotations. GUICrafter explores a curriculum learning framework for training GUI agents through two progressive stages. First, the model learns visual grounding from large-scale unannotated screenshots and webpages, leveraging the rich contextual signals inherent in GUI interactions without human annotations. Then, in Stage 2, we leverage a small amount of high-quality data to calibrate the model via reinforcement learning. Experiments show that GUICrafter achieves competitive, or even superior, performance to advanced systems like UI-TARS while using only 0.1% of its data. Furthermore, under the same amount of annotated data, GUICrafter surpasses all previous methods such as GUI-R1. Code, data, and models are available at https://github.com/fansunqi/GUICrafter.
Primary: Tsinghua University
All Institutions: Tsinghua University, Tencent Hunyuan
GUICrafter presents a significant advancement in GUI agent training by introducing a scalable weakly-supervised pretraining stage that leverages unannotated screenshots for visual grounding, achieving state-of-the-art performance with minimal annotated data. The technical contribution lies in the effective formulation of meta-tasks from interactive signals and the robust two-stage RLVR framework, which offers a practical and efficient path forward for data-constrained GUI agent development.
The paper proposes GUICrafter, a two-stage training framework for GUI agents. Stage 1 involves "weakly-supervised GUI pretraining" using massive unannotated screenshots. The core innovation here is the extraction of interactive signals (clickable/typable elements) from web pages and mobile apps to create "meta-tasks" (e.g., "click any clickable area"). This allows the model to learn visual grounding without human annotation by leveraging the inherent structure of GUIs. Stage 2 uses a small amount of high-quality, manually annotated data for reinforcement learning (RLVR with GRPO) to calibrate the model. The reward design includes a Gaussian position reward to provide finer-grained feedback than binary point-in-box rewards. The approach effectively bridges the gap between large-scale unsupervised visual learning and precise task-oriented grounding.
The evaluation is comprehensive, covering multiple benchmarks across web (Mind2Web, ScreenSpot-Pro), mobile (AndroidControl, AITW, AndroidWorld), and general (OmniACT) domains. The results show that GUICrafter-3B and GUICrafter-7B achieve performance competitive with or superior to state-of-the-art models like UI-TARS and GUI-R1, despite using significantly less annotated data (0.1% of UI-TARS's data). The ablation studies effectively demonstrate the contribution of Stage 1 (visual grounding improvement) and Stage 2 (task completion calibration). The comparison against baselines is fair, including reproductions of GUI-R1 on full datasets. The scalability analysis (10k to 500k samples) provides strong evidence for the data efficiency and robustness of the weakly-supervised stage.
The authors provide code, data, and models. The methodology is clearly described, including the specific extraction tools (Playwright) and the reward function formulas. The use of standard benchmarks and clear reporting of metrics (Element Accuracy, Step Success Rate, etc.) enhances reproducibility. The distinction between the weakly-supervised data generation and the supervised fine-tuning data is clear.
The method still relies on a small amount of high-quality annotated data in Stage 2 for calibration, although this is significantly reduced compared to prior work. The weakly-supervised data generation relies on automated extraction which may have noise (though the paper shows robustness to this). The "meta-tasks" are somewhat generic and may not capture the semantic intent of complex user goals, which is handled in Stage 2. The approach is primarily tested on web and mobile interfaces; generalization to other GUI types (e.g., desktop applications with complex non-standard widgets) might require further validation.
This work addresses a critical bottleneck in GUI agent development: data scarcity. By demonstrating that massive unannotated data can be leveraged for visual grounding, it lowers the barrier to entry for building robust GUI agents. This could accelerate the development of autonomous agents for web and mobile interaction, with implications for accessibility, automation, and human-computer interaction. The open-source release contributes to the community by providing a new baseline and dataset generation pipeline. GUICrafter presents a significant advancement in GUI agent training by introducing a scalable weakly-supervised pretraining stage that leverages unannotated screenshots for visual grounding, achieving state-of-the-art performance with minimal annotated data. The technical contribution lies in the effective formulation of meta-tasks from interactive signals and the robust two-stage RLVR framework, which offers a practical and efficient path forward for data-constrained GUI agent development.
On-policy distillation (OPD) offers superior capacity transfer by supervising student-sampled trajectories with dense token-level signals. To furnish high-quality supervision sources and thereby elevate the performance frontier of distillation, an intuitive direction is to infuse privileged information to either teacher or student itself. However, this additional input induces a potential failure mode we dub privilege illusion: a pattern that conflates the transferable capability gap that students are meant to close, and the information asymmetry gap that can only be mimicked but never replicated. This issue is further amplified by the inherent non-uniformity of token-level supervision, where only a small subset of tokens carries pivotal capability-bearing signals. To this end, we propose DOPD, an advantage-aware dual distillation paradigm that dynamically routes token-level supervision between privileged teacher and privileged student policies based on their advantage gap and relative probabilities. Each token receives supervision of different strength, objective, and strategy from either teacher or student itself, which transfers credible capability while simultaneously receiving auxiliary signals, to alleviate privilege illusion. Extensive experiments on both large language model (LLM) and vision-language model (VLM) settings demonstrate that DOPD consistently outperforms Vanilla OPD and other counterparts. Further results on stability, robustness, continual learning, and out-of-distribution tasks validate its superiority.
Primary: Explore Academy
All Institutions: Explore Academy, MMLab
DOPD presents a significant advancement in on-policy distillation by introducing a dynamic, advantage-aware routing mechanism that effectively mitigates the "privilege illusion" caused by information asymmetry, leading to superior and more stable knowledge transfer across LLM and VLM domains.
The paper introduces DOPD, an advantage-aware dual on-policy distillation framework. The core innovation lies in addressing the "privilege illusion," a phenomenon where privileged information (e.g., hints, annotations) creates an apparent performance gap between teacher and student that is due to information asymmetry rather than transferable capability. DOPD dynamically routes token-level supervision by calculating a "privilege advantage gap" and comparing token probabilities. It classifies tokens into four regimes (High/Low Advantage x High/Low Probability) and applies different distillation strategies (strong teacher distillation, light teacher distillation, weak self-regularization, or student consistency) accordingly. The methodology is theoretically grounded in disentangling capability gaps from information gaps. The approach is well-motivated and addresses a genuine limitation in current OPD practices. However, the mechanism is essentially a heuristic routing based on probability and advantage metrics, which, while effective, is not radically new in the context of adaptive weighting or curriculum learning, though the specific application to privilege illusion is novel.
The experimental evaluation is extensive, covering both LLMs (Qwen3 series) and VLMs (Qwen3-VL series). The authors compare DOPD against a wide range of baselines, including standard OPD, self-distillation, and adaptive distillation methods. Results show consistent improvements across 8 benchmarks for LLMs and 8 for VLMs. The paper also includes ablation studies on token types, divergence objectives, and privileged information modalities. Scalability is tested across different teacher-student size ratios, demonstrating robustness. The results are statistically significant and convincing. The inclusion of continual learning and OOD generalization adds depth. The use of "Qwen3" and "GPT-5.4" suggests this is a very recent or hypothetical future paper (given current dates), which might indicate a pre-print context where benchmarks are state-of-the-art. The performance gains are substantial (e.g., +7.5 points on LLM average).
The paper provides detailed implementation settings, including model sizes, optimizer parameters, batch sizes, and specific hyperparameters for the distillation intensities ($w=0.3, l=0.6$). The dataset sources are named (RaR-Science-20K, DAPO-Math-17K, etc.). However, the reliance on "GPT-5.4" for generating privileged hints and the specific "Qwen3" models (which may not be publicly released or named exactly this way in the public domain yet, depending on the exact current date) could pose reproducibility challenges if the underlying models or data generation pipelines are not open. The code is not explicitly linked in the text provided, though "none" is listed for project URL.
The paper acknowledges that the method relies on the quality of privileged information. If the privileged hints are noisy or misleading, the "privilege advantage gap" might be misinterpreted. The method also introduces additional computational overhead due to the forward passes of both privileged teacher and student policies for every token to calculate the advantage gap. The analysis of "privilege illusion" is insightful but relies on the assumption that the advantage gap is a perfect proxy for capability vs. information, which might not always hold in complex, multi-modal settings. The paper does not extensively discuss the failure modes of the routing mechanism itself (e.g., what happens if probabilities are unstable).
DOPD provides a more robust framework for distilling large models, which is crucial for deploying capable AI in resource-constrained environments. By mitigating privilege illusion, it ensures that students learn genuine capabilities rather than shortcuts, leading to better generalization and safety. This has implications for the entire field of model compression and post-training alignment. DOPD presents a significant advancement in on-policy distillation by introducing a dynamic, advantage-aware routing mechanism that effectively mitigates the "privilege illusion" caused by information asymmetry, leading to superior and more stable knowledge transfer across LLM and VLM domains.
Technology computer-aided design (TCAD) semiconductor device simulation is fundamentally constrained by the high computational cost of iteratively solving coupled drift-diffusion equations. Existing ML surrogates either reduce internal physics to macroscopic scalar regressions, or rely on single-step mappings that lack the iterative refinement required to resolve stiff, coupled fields. To address this, we introduce PCGD, a Physics-Guided Conditional Graph Diffusion framework operating natively on unstructured TCAD meshes to predict coupled electrostatic and carrier density fields. PCGD employs a Condition-Aware MeshGraphNet denoiser that explicitly injects boundary conditions and device structure context via global cross-attention. By augmenting data-driven denoising with a physics-guided hybrid objective that integrates exponent-free quasi-Fermi gradient matching with noise-aware PDE residuals, PCGD progressively enforce physical constraints in the iterative diffusion trajectory. This strategy successfully bypasses the numerical instabilities typical of stiff drift-diffusion equations. Evaluated on a challenging mixed PN/MOS benchmark, PCGD significantly outperforms deterministic one-step regression (1.207% error) and local diffusion (1.585% error) baselines by achieving a sub-percent mean relative field error of 0.835%, while concurrently reducing maximum PDE residual errors by nearly three orders of magnitude compared to pure diffusion. It also transfers robustly to unseen SOI topologies (0.815% error) via LoRA adaptation, using 5.30$\times$ less data and 14.34$\times$ fewer parameters than full fine-tuning. Ultimately, PCGD bridges the computational efficiency of generative surrogates with the rigorous physical fidelity of traditional TCAD, unlocking highly scalable, field-level analysis for robust device engineering.
Primary: Fudan University
All Institutions: Fudan University, Shanghai Integrated Circuit Manufacturing Innovation Center
PCGD has a profound broader impact, particularly in the semiconductor industry and scientific machine learning. By significantly accelerating TCAD simulations while maintaining high physical fidelity, it can drastically shorten the design cycle for advanced semiconductor devices, fostering innovation in microelectronics for applications ranging from AI hardware to power electronics. The methodology for stabilizing physics-informed diffusion models for highly stiff and nonlinear PDEs is generalizable beyond TCAD. It offers a blueprint for tackling similar challenges in other scientific and engineering domains, such as materials science, fluid dynamics, and biomechanics, where complex multiphysics simulations are computationally expensive and numerically challenging. The concept of a hybrid AI-TCAD simulation workflow, where ML provides robust initial guesses to accelerate or stabilize traditional solvers, represents a powerful paradigm for integrating AI into complex scientific computing pipelines, promising both efficiency and reliability. Furthermore, the successful application of parameter-efficient fine-tuning (LoRA) for domain adaptation in a scientific context highlights its potential for developing more adaptable and data-efficient ML models for specialized scientific tasks. PCGD introduces a physics-guided conditional graph diffusion framework that effectively addresses the computational bottlenecks and physical fidelity challenges in TCAD device simulation. This work makes substantial contributions through its mesh-native graph diffusion approach, condition-aware MeshGraphNet architecture with global cross-attention, and a novel hybrid physics-guided training objective that stabilizes learning for stiff, nonlinear drift-diffusion equations. The impressive empirical results, demonstrating sub-percent accuracy and near three-order-of-magnitude reduction in PDE residuals, coupled with robust transferability to unseen topologies, position PCGD as a significant advancement in ML-accelerated scientific computing, with clear implications for future semiconductor design and broader multiphysics simulation.
The methodology presented in PCGD is highly sophisticated and meticulously designed to tackle the inherent challenges of TCAD device simulation. The core idea of formulating coupled-field prediction as a conditional generative denoising task on native unstructured TCAD meshes is a significant departure from previous ML surrogates. This approach leverages the iterative refinement capabilities of diffusion models to decouple macroscopic field construction from microscopic detail resolution, which is crucial for handling the extreme stiffness and exponential nonlinearities of semiconductor physics. The Condition-Aware MeshGraphNet denoiser is a well-thought-out architectural innovation. By explicitly injecting boundary conditions and device structure context via global cross-attention, it effectively overcomes the limitations of purely local message passing, ensuring that critical global information (like terminal biases and device layout) is directly accessible to every mesh node. This design choice is critical for accelerating convergence to correct physical operating conditions. The most impactful methodological contribution is the hybrid physics-guided objective. It ingeniously combines an exponent-free quasi-Fermi gradient matching loss ($L_G$) for stable guidance during early, noisy diffusion stages with noise-aware, adaptively scaled exact PDE residuals ($L_P$) for fine-grained physical correction in later stages. The dual-tier stabilization strategy for $L_P$ (validity-aware SNR gating and adaptive batch balance) is a robust solution to prevent gradient divergence caused by the extreme nonlinearity of drift-diffusion equations. The detailed mathematical derivation connecting Scharfetter-Gummel flux to quasi-Fermi gradients and the implementation of GPU-accelerated graph operators for discrete PDE residuals further demonstrate the technical depth and rigor of the proposed framework.
The experimental evaluation is comprehensive, rigorous, and strongly supports the claims made in the paper. The use of a challenging mixed PN/MOS benchmark dataset, comprising a large number of training and validation snapshots from distinct devices and operating modes, ensures the robustness of the evaluation. The device-trajectory level splitting for train/validation is critical for assessing generalization. The ablation studies are particularly effective, systematically isolating the contributions of iterative generative refinement, global conditioning, and physics-guided supervision by comparing PCGD against well-chosen baselines (deterministic one-step regression, local diffusion, and condition-aware diffusion with varying levels of physics guidance). The chosen metrics, mean relative $L_2$ error and log-compressed maximum PDE residual error, are appropriate and provide a holistic view of both accuracy and physical consistency. The results are highly impressive: PCGD achieves a sub-percent mean relative field error (0.835%) and significantly outperforms all baselines, reducing maximum PDE residual errors by nearly three orders of magnitude compared to pure diffusion. The analysis of convergence dynamics clearly illustrates the stability gained from the hybrid physics-guided objective. Furthermore, the demonstration of robust transferability to unseen SOI topologies via parameter-efficient LoRA adaptation, requiring significantly less data and fewer parameters than full fine-tuning, is a strong indicator that PCGD learns generalizable physical priors, which is crucial for practical adoption in industrial settings.
The paper provides a commendable level of detail for reproducibility. The methodology is thoroughly described with mathematical formulations for the graph representation, diffusion process, Condition-Aware MeshGraphNet, and the hybrid physics-guided objective. The appendix offers specific architectural settings, coefficients for the loss functions, and details on the baseline architectures. Crucially, numerical safety measures for handling stiff exponentials and precision truncation are explicitly mentioned. The schema for mesh features and interface edges is also detailed. While the absence of publicly available code (no project URL provided) is a common limitation for arXiv preprints, the textual descriptions are sufficiently comprehensive for an experienced researcher with expertise in GNNs, diffusion models, and scientific computing to implement the core components of PCGD. Access to the specific TCAD data generation pipeline would be the final piece for exact replication of the dataset.
Despite its strengths, PCGD exhibits some limitations. The paper acknowledges "severe zero-shot degradation" when transferring to major out-of-distribution topological shifts (e.g., SOI MOSFETs not seen during pretraining). While LoRA adaptation effectively addresses this, it implies that the model, despite learning physical priors, still relies on structural interpolation from its pretraining data, limiting its immediate "universal foundational model" capabilities without a much broader pretraining corpus. The current experimental validation is primarily on 2D devices. While the paper discusses the theoretical O(N) scaling advantage for 3D simulations, empirical validation on massive 3D meshes, including memory footprint and training/inference times, is yet to be demonstrated. The computational cost of training such a complex diffusion model on large graph datasets is likely substantial, though not explicitly quantified. Finally, the proposed hybrid AI-TCAD workflow mentions routing high-residual cases back to traditional solvers, but the practical implementation details, criteria for routing, and the efficiency gains from this hybrid approach are not fully explored.
PCGD has a profound broader impact, particularly in the semiconductor industry and scientific machine learning. By significantly accelerating TCAD simulations while maintaining high physical fidelity, it can drastically shorten the design cycle for advanced semiconductor devices, fostering innovation in microelectronics for applications ranging from AI hardware to power electronics. The methodology for stabilizing physics-informed diffusion models for highly stiff and nonlinear PDEs is generalizable beyond TCAD. It offers a blueprint for tackling similar challenges in other scientific and engineering domains, such as materials science, fluid dynamics, and biomechanics, where complex multiphysics simulations are computationally expensive and numerically challenging. The concept of a hybrid AI-TCAD simulation workflow, where ML provides robust initial guesses to accelerate or stabilize traditional solvers, represents a powerful paradigm for integrating AI into complex scientific computing pipelines, promising both efficiency and reliability. Furthermore, the successful application of parameter-efficient fine-tuning (LoRA) for domain adaptation in a scientific context highlights its potential for developing more adaptable and data-efficient ML models for specialized scientific tasks. PCGD introduces a physics-guided conditional graph diffusion framework that effectively addresses the computational bottlenecks and physical fidelity challenges in TCAD device simulation. This work makes substantial contributions through its mesh-native graph diffusion approach, condition-aware MeshGraphNet architecture with global cross-attention, and a novel hybrid physics-guided training objective that stabilizes learning for stiff, nonlinear drift-diffusion equations. The impressive empirical results, demonstrating sub-percent accuracy and near three-order-of-magnitude reduction in PDE residuals, coupled with robust transferability to unseen topologies, position PCGD as a significant advancement in ML-accelerated scientific computing, with clear implications for future semiconductor design and broader multiphysics simulation.
LLM agents carry conclusions across steps and sessions in compressed memory, and memory products (e.g., mem0, LangMem) rewrite conversation into stored "facts" that later steps trust. We show this rewriting manufactures confidence: across our constructed agent settings, a casual, hedged remark becomes a confident, dated assertion the agent then obeys like a verified fact, granting every above-clearance request it faces. No attacker is needed: a role that was true once and never corrected is stored as a flat fact and acted on like a deliberate injection. We then isolate what the agent responds to. It is not the source: attributed, unattributed, and even forged "system of record" claims all grant alike. It is the confidence of the phrasing. A hedge is discounted, a flat assertion is obeyed, and this holds with no special keyword. Not all hedges are equal, though: the evidential register is the least-discounted, with "reportedly" obeyed like a flat assertion on most models. The obvious fixes fail. A passive "unverified" tag is ignored, and an active "do not trust this" instruction escalates even correct memory, so it is safe only by refusing to decide. The real fix lives in the store: keep the tentative phrasing rather than upgrade it. But that is hygiene, not a defense against an attacker who can simply write a confident lie. The deployable lesson is narrower and constructive: a single load-bearing memory is the hazard, and one redundant source restores correct decisions. We release the harness and demonstrations.
Primary: unknown
All Institutions: unknown
This paper has significant broader impact for the development and deployment of LLM agents. It identifies a fundamental, pervasive, and under-defended failure mode ("manufactured confidence") in how LLM agents process and store information in consolidated memory. This phenomenon can lead to agents being confidently wrong, granting unauthorized access, or making incorrect financial decisions, even without an explicit attacker. The findings challenge current memory consolidation practices and highlight the critical need for memory architectures that preserve epistemic status. The "hearsay blind spot" is a particularly alarming discovery, as it means agents may treat unverified reports as established facts. The paper provides crucial, actionable lessons for practitioners: avoid single load-bearing memories for consequential decisions, and implement redundant verification sources. While the proposed store-side fix is not a complete defense against adaptive attackers, it significantly raises the bar and improves hygiene. This work is essential for enhancing the safety, reliability, and trustworthiness of LLM agents in real-world applications. This paper identifies and rigorously diagnoses "manufactured confidence," a critical and pervasive failure mode in LLM agent memory consolidation where hedged remarks are rewritten into confident facts, leading to agents making confidently wrong decisions across diverse models and memory products. The comprehensive analysis reveals that agents obey the confidence of phrasing rather than its source, are blind to hearsay markers, and that common mitigations like passive tags or active distrust instructions are ineffective or lead to abdication, underscoring the necessity of preserving epistemic status in memory stores and employing redundant information sources for consequential decisions.
The methodology is exceptionally strong and well-designed for diagnosing the phenomenon of "manufactured confidence." The authors construct multi-step agent settings (access control, budget approval, running total) where memory is load-bearing, allowing for clear ground truth and legible impact. A crucial aspect is the use of real, shipped memory products (mem0, LangMem) alongside a verbatim control, which grounds the findings in practical agent deployments. The systematic probing involves varying how memory is presented (confident, passive tag, active instruction), dissecting the cues agents respond to (modality, hearsay, explicit non-verification), and testing the impact of source attribution (bare, attributed, forged authority). The inclusion of a "natural case" (staleness without injection) alongside adversarial injection strengthens the generalizability of the problem. The use of five diverse, state-of-the-art LLMs from four providers (Anthropic, Meta, OpenAI, Qwen) ensures the findings are not model-specific. The methodology also includes a symmetry test (over-denial) to rule out simple grant bias and a detailed analysis of the "laundering" process within memory products. The approach is comprehensive, rigorous, and effectively isolates the mechanisms behind manufactured confidence.
The experimental evaluation is thorough and provides compelling evidence for the paper's claims. Key findings include: 1. **Manufactured Confidence**: Memory consolidation rewrites hedged remarks into confident assertions, leading to high confident-wrong rates (0.50-1.00) across all models in consequential decisions. 2. **Source Invariance**: Agents obey the confidence of phrasing, not its source. Attributed, unattributed, and even forged "system of record" claims grant alike, demonstrating a critical blindness to provenance. 3. **Failure of Obvious Fixes**: Passive "unverified" tags are largely ignored, especially by non-Anthropic models. Active "do not trust this" instructions lead to abdication (escalating everything), not discrimination, costing all utility. 4. **Redundancy as a Fix**: A second, authoritative source allows agents to discriminate, turning distrust into selective caution rather than blanket abdication. 5. **Hearsay Blind Spot**: Evidential registers, particularly "reportedly," are the least-discounted hedges, often obeyed like flat assertions on most models. This is a critical, pervasive vulnerability. 6. **Symmetry**: The effect is symmetric, causing both over-granting and over-denial based on manufactured confidence, ruling out a simple grant bias. 7. **Consolidation, Not Vendor**: The laundering of hedges into confident facts is a property of LLM consolidation itself, not specific memory products or extraction LLMs. The experiments are quantitatively presented with clear rates, using temperature 0 for deterministic behavior per scenario. The results are consistent across models, highlighting a systemic issue. The distinction between "belief" and "low threshold" based on rationale analysis adds a qualitative layer to the findings.
The paper demonstrates a high commitment to reproducibility. The authors explicitly state, "We release the harness, data, and demonstrations at https://github.com/collapseindex/manufactured-confidence." They provide detailed information on the models used (exact API identifiers, providers, access dates), temperature settings, agent system prompts, memory poisoning setup, and memory backend configurations. Specific scripts (e.g., `cues.py`, `forged.py`) are mentioned, indicating a well-structured codebase. This level of detail and code release makes the experiments highly reproducible.
The authors are commendably transparent about the limitations: 1. **Constructed Scenarios**: The tasks are decision-shaped but not live deployments, and even "natural staleness" sessions are constructed, meaning the base rate of this failure mode in the wild is not measured. 2. **Scope**: The study focuses on two memory products, four extractors, and five phrasings, with deep probes primarily in access control. While robust, it's not exhaustive. The Zep probe is limited. 3. **Belief vs. Threshold**: The distinction relies on verbalized rationales, which are not ground-truth processing. 4. **Non-Adaptive Threat Model**: The proposed store-side defense is not robust against an adaptive attacker who can directly supply confident, forged authority. 5. **Sample Sizes**: While effects are large and consistent, $n$ values (e.g., 15 for decisions, 10 for poisonings) are relatively small for statistical generalization, though the deterministic nature at temperature 0 mitigates this for the constructed scenarios. 6. **Fix is a Prompt**: The hedge-preserving extraction is demonstrated via a prompt, not a fully engineered production store.
This paper has significant broader impact for the development and deployment of LLM agents. It identifies a fundamental, pervasive, and under-defended failure mode ("manufactured confidence") in how LLM agents process and store information in consolidated memory. This phenomenon can lead to agents being confidently wrong, granting unauthorized access, or making incorrect financial decisions, even without an explicit attacker. The findings challenge current memory consolidation practices and highlight the critical need for memory architectures that preserve epistemic status. The "hearsay blind spot" is a particularly alarming discovery, as it means agents may treat unverified reports as established facts. The paper provides crucial, actionable lessons for practitioners: avoid single load-bearing memories for consequential decisions, and implement redundant verification sources. While the proposed store-side fix is not a complete defense against adaptive attackers, it significantly raises the bar and improves hygiene. This work is essential for enhancing the safety, reliability, and trustworthiness of LLM agents in real-world applications. This paper identifies and rigorously diagnoses "manufactured confidence," a critical and pervasive failure mode in LLM agent memory consolidation where hedged remarks are rewritten into confident facts, leading to agents making confidently wrong decisions across diverse models and memory products. The comprehensive analysis reveals that agents obey the confidence of phrasing rather than its source, are blind to hearsay markers, and that common mitigations like passive tags or active distrust instructions are ineffective or lead to abdication, underscoring the necessity of preserving epistemic status in memory stores and employing redundant information sources for consequential decisions.
Diffusion Language Models (DLMs) are typically trained under fixed context structures, restricting denoising to predetermined token subsets. This creates a mismatch between training and inference, where models must operate over arbitrary configurations, leading to degradation off the training grid. We propose Adaptive Block Diffusion (ABD), which resolves this mismatch by optimizing denoising risk over a distribution of prefix-window configurations. By treating the configuration as a stochastic variable, ABD trains a single model over the full configuration space without architectural changes. We show that generalization across decoding strategies is governed by the support of the training distribution, and that ABD guarantees denoising optimality for any inference policy whose configurations are covered during training. Empirically, ABD exhibits structural invariance across decoding scales, avoiding off-grid collapse and recovering a monotonic relationship between block size and perplexity, while matching or outperforming fixed-block specialists at their target scales.
Primary: Microsoft AI
All Institutions: Microsoft AI
Adaptive Block Diffusion has significant broader impact for the field of Diffusion Language Models and potentially for structured generative models more generally. By providing a principled and effective solution to the training-inference mismatch, ABD makes DLMs more robust, generalizable, and practical for real-world deployment. The ability of a single model to perform well across various decoding scales and strategies simplifies model management and enables more flexible inference scenarios, such as adaptive generation speeds or streaming decoding. This could accelerate the adoption and development of DLMs in diverse applications. The theoretical framework also offers a valuable new perspective on understanding generalization in structured generative models, potentially inspiring similar training paradigms in other domains beyond text. The demonstrated improved zero-shot generalization to domain-shifted text further suggests that ABD could lead to more versatile and less domain-specific language models. Adaptive Block Diffusion resolves the training-inference mismatch in Diffusion Language Models by optimizing denoising risk over a stochastic distribution of prefix-window configurations, leading to a single model that exhibits structural invariance and robust generalization across diverse decoding scales. This paper presents a theoretically grounded and empirically validated framework that significantly enhances the robustness and flexibility of Diffusion Language Models, addressing a critical limitation of prior fixed-structure approaches and paving the way for more adaptable and generalizable text generation.
The methodology is robust and elegantly addresses a core problem in Diffusion Language Models (DLMs): the training-inference mismatch caused by fixed context structures. Adaptive Block Diffusion (ABD) proposes a novel training objective that treats the denoising configuration (prefix length $k$ and window length $\ell$) as a stochastic variable, optimizing denoising risk over a distribution $\pi$ of these configurations. This approach is commendable for not requiring architectural changes, instead focusing on a principled modification to the training process. The theoretical analysis is a significant strength, formally defining conditional denoising risk and proving statistical consistency over the support of $\pi$. The "Training-Inference Alignment" theorem, leveraging the Radon-Nikodym theorem, rigorously demonstrates that if an inference policy's configuration distribution is covered by the training distribution's support, then denoising optimality is guaranteed. This provides a strong theoretical foundation for the empirical claims of structural invariance. The practical implementation details, particularly the attention mask construction and the `ABDBoundaryManager` for sampling block lengths, are clearly described in the appendix, showcasing a well-thought-out and implementable solution.
The experimental evaluation is comprehensive, well-designed, and provides strong empirical evidence supporting the theoretical claims. The authors use standard language modeling benchmarks (LM1B, OpenWebText) and ensure fair comparisons by using an identical transformer architecture to existing baselines (MDLM, BD3LM). The most compelling result is the demonstration of "structural invariance": ABD successfully recovers the monotonic relationship between block size and perplexity, a fundamental property for generative models, which fixed-block specialists fail to maintain off their training grid. This directly validates the core hypothesis that training over a broad configuration distribution leads to better generalization. Furthermore, ABD matches or outperforms fixed-block specialists at their target scales, indicating that multi-scale training acts as a regularizer rather than a compromise. The zero-shot generalization experiments on diverse datasets, including scientific text, show improved robustness and suggest that ABD learns a more configuration-invariant language representation. The ablations on configuration distribution types (categorical exponential, uniform, lognormal) and training budget allocation are particularly insightful, offering practical guidance on how to tune ABD for specific inference regimes and demonstrating the trade-offs involved.
The paper excels in reproducibility. The methodology is clearly articulated, and the appendix provides detailed pseudocode for the critical components, including the `abd_attention_mask` and `ABDBoundaryManager`. The authors explicitly state that they leverage the same codebase, datasets, architecture, likelihood evaluation, and inference setup as a previously published work (arriola2025blockdiffusioninterpolatingautoregressive), which significantly lowers the barrier to reproduction. Specific details regarding training budget allocation and configuration sampling strategies are also provided. This level of detail and reliance on a shared foundation is exemplary.
The authors openly acknowledge several limitations. A key one is the dependence on the choice of the configuration distribution $\pi$. While $\pi$ offers a principled way to balance performance across decoding regimes, an suboptimal choice can bias the model towards frequently sampled configurations, potentially leading to uneven performance across scales. This implies that careful tuning of $\pi$ is necessary for specific application scenarios. Additionally, ABD does not directly address inference efficiency; while it enables flexible decoding, the selection of optimal inference-time policies remains an open problem. Finally, the theoretical analysis provides optimality guarantees under support coverage but does not offer finite-sample guarantees, meaning practical performance might still be influenced by the quality and density of training coverage in finite data regimes.
Adaptive Block Diffusion has significant broader impact for the field of Diffusion Language Models and potentially for structured generative models more generally. By providing a principled and effective solution to the training-inference mismatch, ABD makes DLMs more robust, generalizable, and practical for real-world deployment. The ability of a single model to perform well across various decoding scales and strategies simplifies model management and enables more flexible inference scenarios, such as adaptive generation speeds or streaming decoding. This could accelerate the adoption and development of DLMs in diverse applications. The theoretical framework also offers a valuable new perspective on understanding generalization in structured generative models, potentially inspiring similar training paradigms in other domains beyond text. The demonstrated improved zero-shot generalization to domain-shifted text further suggests that ABD could lead to more versatile and less domain-specific language models. Adaptive Block Diffusion resolves the training-inference mismatch in Diffusion Language Models by optimizing denoising risk over a stochastic distribution of prefix-window configurations, leading to a single model that exhibits structural invariance and robust generalization across diverse decoding scales. This paper presents a theoretically grounded and empirically validated framework that significantly enhances the robustness and flexibility of Diffusion Language Models, addressing a critical limitation of prior fixed-structure approaches and paving the way for more adaptable and generalizable text generation.
Existing computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limitations of frontier agents. We introduce OSWorld 2.0, a benchmark of 108 long-horizon computer-use workflows across everyday and professional tasks, designed to capture complex and challenging real-world phenomena. Each task represents a realistic end-to-end workflow that takes human users a median of about 1.6 hours to complete and requires an average of 318 tool calls with Claude Opus 4.7 using maximum thinking, compared with about 30 in OSWorld 1.0. OSWorld 2.0 targets challenge phenomena that are common in real workflows yet underrepresented in prior benchmarks, spanning interaction-design challenges such as streaming interaction and dynamic environments, as well as agent-pattern challenges such as cross-source reasoning, implicit-state inference, and visual-spatial precision. Tasks are grounded in authentic input artifacts and cross-referenced against realistic stateful user profile data, and include separate safety reports auditing safety-sensitive execution. Under our primary binary-completion metric at 500 steps, Claude Opus 4.8 with maximum thinking and batched tool calls scores best but still completes only 20.6% of tasks at a 54.8% partial score; GPT-5.5 is far more token-efficient yet plateaus near 13%. These results show that current agents are still far from professional-level computer use: rather than stumbling on basic GUI control or coding, they lose track of constraints, miss information that arrives mid-task, guess rather than ask the user, and skip verification, struggling most when a task hinges on hidden state they must recover.
Primary: XLang
All Institutions: XLang
OSWorld 2.0 establishes a rigorous, long-horizon benchmark for computer-use agents, revealing that current frontier models struggle significantly with state tracking, verification, and dynamic environments, setting a new, more realistic standard for evaluating autonomous agent capabilities.
The paper introduces OSWorld 2.0, a benchmark designed to evaluate computer-use agents on long-horizon, real-world workflows. The methodology shifts focus from short, isolated GUI interactions to complex, multi-application tasks that mimic professional work (e.g., reimbursement, data analysis, creative editing). Key methodological innovations include: 1) The use of self-hosted, stateful web services (email, banking, chat) to simulate realistic environments without relying on volatile live websites. 2) A fine-grained, checkpoint-based evaluation system (averaging ~27 checkpoints per task) rather than binary pass/fail, allowing for partial credit and more nuanced analysis. 3) The annotation of tasks with specific "challenge phenomena" (e.g., cross-source reasoning, dynamic environments, implicit-state inference) to diagnose specific agent failures. 4) The inclusion of a simulated user channel and dynamic environment updates to test agent robustness to information arrival and state changes. This approach is rigorous and addresses a critical gap in current benchmarks which often overstate agent capabilities by using short, static tasks.
The authors evaluate seven frontier models (Claude Opus 4.7/4.8, GPT-5.5, Sonnet 4.6, Qwen 3.7-Plus, MiniMax M3, Kimi 2.6) under various constraints (step budgets, thinking levels, batching). The results are stark: even the best configuration (Claude Opus 4.8 with max thinking and batching) achieves only 20.6% binary completion and 54.8% partial score. The paper provides a detailed analysis of failure modes, highlighting that agents struggle with hidden state recovery, constraint tracking, and verification, rather than basic GUI control. The analysis of token efficiency vs. performance is particularly insightful, showing that GPT-5.5 is more efficient but plateaus earlier, while Opus models spend significantly more tokens for marginal gains. The breakdown of performance by challenge phenomena provides actionable insights for future research. The experiments are comprehensive, covering multiple models, configurations, and detailed error analysis.
The paper claims to release the environment, tasks, self-hosted websites, and agent rollout trajectories. The use of self-hosted services ensures that the environment is stable and reproducible, unlike benchmarks relying on live web services. The detailed description of the task construction pipeline, including the quality assurance steps (unit tests, human re-solving, adversarial audits), enhances reproducibility. The specific model configurations and hyperparameters are clearly stated. The primary limitation is the computational cost of running these long-horizon tasks, but the provided infrastructure should allow other researchers to reproduce the evaluations.
The benchmark is limited to 108 tasks, which, while diverse, may not cover all possible real-world scenarios. The self-hosted web services, while realistic, are simulations and may not capture all edge cases or security quirks of live production systems. The "simulated user" is a simplified model of human interaction and may not fully capture the nuance of human communication. The evaluation relies on model-based judges for some open-ended tasks, which may introduce bias or inaccuracies, although the authors attempt to mitigate this with objective checklists and validation. The focus on long-horizon tasks means that short, simple tasks are underrepresented, which might skew the perceived difficulty for simpler use cases.
This paper has significant implications for the development of autonomous agents. By demonstrating that current frontier models are far from solving realistic, long-horizon computer use tasks, it sets a realistic baseline for the field. It highlights the need for improvements in state management, reasoning over long horizons, and self-correction. The safety analysis, revealing that agents can cause harmful side effects (e.g., leaking API keys, exhausting disk space), underscores the risks of deploying such agents in real-world settings. The benchmark provides a valuable tool for researchers to track progress and identify specific weaknesses in agent architectures. It encourages a shift towards more robust, reliable, and safe agent systems. OSWorld 2.0 establishes a rigorous, long-horizon benchmark for computer-use agents, revealing that current frontier models struggle significantly with state tracking, verification, and dynamic environments, setting a new, more realistic standard for evaluating autonomous agent capabilities.
Retrieval-augmented generation (RAG) typically treats context selection as ranking chunks against a single query embedding. This assumption breaks down for complex queries, such as multi-hop or ambiguous questions, where top-k selection tends to over-cover one semantic aspect while ignoring critical sub-questions. We propose GeoRAG, which recasts context selection as Information Demand Coverage Optimization. GeoRAG builds a multi-dimensional demand distribution through diverse sub-query generation and reverse-validation weighting, then selects context by minimizing the Sinkhorn-Wasserstein distance between this demand distribution and the coverage of the selected set. The resulting demand-weighted facility-location objective is monotone submodular, giving a $1-1/e$ greedy guarantee, which we approximate with a Sinkhorn-based marginal-gain surrogate. The method is unsupervised, training-free, and retrieval-agnostic. We further show that single-point, query-proximity scorers cannot cover multi-modal demands, exposing a structural limit of ranking-based selection. On six open-domain QA benchmarks, GeoRAG improves exact match (EM) by +6.5 to +7.5 points over top-k truncation (up to +9.7 on HotpotQA and ASQA) and outperforms strong baselines including MMR, DPP, BGE-Reranker, SMART-RAG, and AdaGReS, with stable gains across context budgets and sub-query generators.
Primary: Singapore Management University
All Institutions: University of Shanghai for Science and Technology, Singapore Management University
GeoRAG presents a theoretically grounded, empirically superior method for RAG context selection by modeling information demand as a multi-dimensional distribution and optimizing for coverage via submodular optimization, significantly outperforming existing ranking and diversity-based approaches on complex QA benchmarks.
The paper proposes GeoRAG, a novel context selection framework for Retrieval-Augmented Generation (RAG) that moves beyond single-point query embeddings. The core innovation is reformulating context selection as an Information Demand Coverage Optimization problem. It constructs a multi-dimensional "Information Demand Proxy" distribution using diverse sub-query generation and reverse-validation weighting. The selection process minimizes the Sinkhorn-Wasserstein distance between this demand distribution and the coverage of the selected set. The authors prove that the resulting facility-location objective is monotone submodular, providing a theoretical $(1-1/e)$ greedy guarantee. They further demonstrate a structural limitation of existing ranking-based methods (query-proximity-monotone selectors) in handling bimodal information needs, providing a rigorous theoretical foundation for their approach. The method is unsupervised and training-free, making it broadly applicable.
The experimental evaluation is comprehensive and robust. The authors test GeoRAG across six open-domain QA benchmarks (NQ, TriviaQA, HotpotQA, 2WikiMHQA, ASQA, FEVER) and six different retrieval backends (Dense, BM25, Hybrid RRF, HyDE, MultiQuery, GraphRAG). GeoRAG consistently outperforms strong baselines, including MMR, DPP, BGE-Reranker, SMART-RAG, and AdaGReS, with significant gains on multi-hop datasets (up to +9.7 EM on HotpotQA). The paper includes extensive ablation studies isolating the contributions of the demand distribution (Axis A) and the set-aware coverage selection (Axis B). Crucially, they perform a "Full-Wikipedia" experiment without gold-injection to prove the method's effectiveness in realistic, harder retrieval settings. They also provide direct measurements of demand-dimension coverage, empirically validating that GeoRAG successfully covers multiple semantic peaks where baselines fail.
The paper provides detailed algorithmic descriptions, including the specific steps for sub-query generation, reverse-validation, and the Sinkhorn-based marginal gain calculation. Hyperparameters are clearly listed. The use of standard benchmarks and open-source models (Qwen3-Embedding-8B, Qwen3-4B) enhances reproducibility. The code is not explicitly linked in the text provided, but the methodological details are sufficient for implementation.
The method relies on LLM-generated sub-queries, which introduces a dependency on the quality and diversity of the generator. While the paper shows robustness across different generators, poor sub-query generation could degrade performance. The reverse-validation step adds computational overhead, though the latency analysis suggests it is manageable. The theoretical guarantee applies to the exact facility-location objective, while the deployed method uses a Sinkhorn surrogate; the paper acknowledges this but shows the surrogate performs well. The method is primarily evaluated on open-domain QA; its performance on more complex reasoning tasks or non-QA RAG applications is less clear.
GeoRAG addresses a fundamental limitation in current RAG systems: the inability to handle complex, multi-faceted queries effectively. By providing a retrieval-agnostic, training-free solution that significantly improves answer quality, it has the potential to become a standard component in RAG pipelines. The theoretical insights into the limitations of single-point embeddings also contribute to a deeper understanding of information retrieval in the LLM era. GeoRAG presents a theoretically grounded, empirically superior method for RAG context selection by modeling information demand as a multi-dimensional distribution and optimizing for coverage via submodular optimization, significantly outperforming existing ranking and diversity-based approaches on complex QA benchmarks.
Therapy-induced cardiotoxicity is the leading non-oncological cause of treatment interruption in breast cancer patients, yet early, automated risk stratification from routine cardiac imaging remains an unsolved problem. We present EchoRisk, the first curated, multicentre, longitudinal echocardiography dataset with explicit cardiotoxicity labels, released as the primary technical reference for the EchoRisk-MICCAI 2026 challenge. The dataset comprises 422 patients enrolled in the EU-funded CARDIOCARE prospective study across five European sites, yielding 2,159 echocardiography videos across 1,123 clinical exams acquired at up to five longitudinal timepoints, alongside a dedicated cohort of 280 patients with baseline imaging for early cardiotoxicity prediction. Three clinically grounded tasks are defined: automated estimation of left ventricular ejection fraction from cine video (Task 1), classification of LV dysfunction from longitudinal imaging (Task 2), and early prediction of therapy-induced cardiotoxicity from pre-therapy baseline echocardiography alone (Task 3). For each task we specify the evaluation protocol, primary and secondary metrics, and ranking procedure. We establish baseline performance using an R(2+1)D video backbone with LSTM aggregation trained from Kinetics-400 pretrained weights, demonstrating strong discriminative performance for cardiac functional assessment and LV dysfunction classification, while early cardiotoxicity prediction from a single pre-therapy video remains a significant open problem for the community. The dataset, evaluation code, and baseline implementations are publicly available to serve as a benchmark for further collaboration, comparison, and the creation of task-specific architectures in cardio-oncology.
Primary: Foundation for Research and Technology Hellas
All Institutions: Foundation for Research and Technology Hellas, University of Ioannina, Hellenic Mediterranean University, National and Kapodistrian University of Athens, Karolinska University Hospital, Bank of Cyprus Oncology Centre
This paper introduces EchoRisk, the first curated, multicentre, longitudinal echocardiography dataset with explicit cardiotoxicity labels, along with a comprehensive benchmark for cardio-oncology, highlighting early cardiotoxicity prediction as a significant open problem. The meticulous curation of a high-quality, clinically relevant dataset from a prospective study, coupled with well-defined tasks and robust baselines, provides an invaluable resource that will drive significant research in medical AI, particularly in addressing the critical challenge of therapy-induced cardiotoxicity.
The paper introduces EchoRisk, a multicentre, longitudinal echocardiography dataset for cardio-oncology, derived from the EU-funded CARDIOCARE prospective study across five European sites. A key methodological strength is the expert-adjudicated cardiotoxicity labels, which integrate longitudinal echocardiography findings with biomarkers following ESC 2022 guidelines, representing a deliberate and rigorous curation process. This ensures high-quality ground truth, superior to automated EHR extraction. Three clinically grounded tasks are defined: Task 1 (LVEF estimation), Task 2 (LV dysfunction classification using GLS), and Task 3 (early cardiotoxicity prediction from baseline imaging). The baseline models employ a robust R(2+1)D ResNet-18 backbone, pretrained on Kinetics-400, combined with an LSTM for temporal aggregation, a standard yet powerful architecture for video analysis. Detailed preprocessing steps (greyscale conversion, fractional index sampling, resizing) and training specifics (AdamW, learning rate scheduling, specific loss functions like Focal Loss for imbalanced tasks) are provided. A dual-view strategy for Task 3 and a clinical reference baseline (logistic regression on age and LVEF) further enhance the benchmark's comprehensiveness and clinical relevance. The overall methodology for dataset construction and task definition is exceptionally strong and clinically well-aligned.
The experimental evaluation is comprehensive and rigorously conducted. Baselines are established across all three tasks, with results averaged over eight independent random seeds and ensemble predictions for robustness. For Task 1 (LVEF estimation), a test MAE of 4.98 pp is achieved, aligning with established benchmarks like EchoNet-Dynamic and validating the dataset's utility for functional assessment. Task 2 (LV dysfunction classification) demonstrates strong performance with a test AUC of 0.849, indicating effective discrimination of GLS-defined dysfunction. The most impactful finding emerges from Task 3 (early cardiotoxicity prediction): the best video baseline achieves an AUC of 0.541, which is statistically indistinguishable from the clinical reference floor (AUC 0.525). This crucial result, consistent across internal pilot experiments, highlights that early cardiotoxicity prediction from baseline echocardiography remains a significant open problem, even with advanced deep learning architectures. The detailed statistical analysis, including 95% confidence intervals via non-parametric bootstrap resampling and Wilcoxon signed-rank tests with Holm-Bonferroni correction, adds significant rigor. Calibration is also assessed via Expected Calibration Error (ECE). The experiments effectively map the current performance landscape and clearly identify a challenging frontier for future research.
The paper demonstrates an outstanding commitment to reproducibility. It explicitly states that the EchoRisk dataset, evaluation code, and baseline implementations are publicly available via a dedicated GitHub repository. The methodology section provides extensive details on the model architecture, preprocessing steps, training hyperparameters (optimizers, learning rates, weight decay, early stopping), and loss functions. The use of multiple random seeds (42-49) for all experiments, along with the procedure for ensemble predictions and handling of degenerate runs, ensures that the reported results are robust and verifiable. The detailed statistical analysis methods, including confidence interval calculation and hypothesis testing, further contribute to the transparency and reproducibility of the benchmark. This level of detail and open-source commitment is exemplary for a benchmark paper.
While a highly valuable contribution, the dataset size, though multicentre and longitudinal, is relatively modest (422 patients overall, 280 for Task 3) compared to some large-scale single-center datasets. This might limit the ability of current deep learning models to extract extremely subtle prognostic signals for Task 3. The variable follow-up window for cardiotoxicity labels in Task 3, while reflecting real-world data collection, means the positive label indicates cardiotoxicity within the *available* window, not a fixed 12-month horizon, which could introduce some variability in interpretation. The baselines, while robust, are standard video architectures; the paper's novelty lies in the benchmark itself rather than new architectural contributions. The reliance on Kinetics-400 pretraining, while common, might not be optimally suited for medical ultrasound, suggesting future work could explore domain-specific pretraining.
EchoRisk has profound broader impact potential. It addresses a critical and growing clinical challenge in cardio-oncology: the early detection and risk stratification of therapy-induced cardiotoxicity in breast cancer patients. By providing the first multicentre, longitudinal echocardiography dataset with expert-adjudicated cardiotoxicity labels, it establishes a foundational resource for the machine learning community. Its role as the primary technical reference for the EchoRisk-MICCAI 2026 challenge ensures widespread adoption and will catalyze significant research into novel AI methods for cardiac ultrasound. Success in tasks like early cardiotoxicity prediction could lead to personalized treatment strategies, timely cardioprotective interventions, reduced treatment interruptions, and ultimately improved long-term cardiovascular outcomes for cancer patients. The open-source nature of the dataset and tools will foster collaborative research, accelerating progress in this vital area of medical AI and serving as a model for future clinically relevant benchmarks. This paper introduces EchoRisk, the first curated, multicentre, longitudinal echocardiography dataset with explicit cardiotoxicity labels, along with a comprehensive benchmark for cardio-oncology, highlighting early cardiotoxicity prediction as a significant open problem. The meticulous curation of a high-quality, clinically relevant dataset from a prospective study, coupled with well-defined tasks and robust baselines, provides an invaluable resource that will drive significant research in medical AI, particularly in addressing the critical challenge of therapy-induced cardiotoxicity.
Safe motion planning in dynamic environments requires reasoning about the uncertainty in predicted obstacle motion without sacrificing real-time performance. Existing conformal approaches conformalize a scalar score that aggregates per-obstacle prediction errors, losing spatial coherence and scaling poorly with scene density. We instead conformalize the entire predicted distance field at once. This functional conformal prediction (FCP) framework yields a distribution-free, field-level lower bound, from which safety follows uniformly: any trajectory satisfying the resulting constraint is certified safe, independent of how the control space is sampled. The key enabler is that the residual distance field is empirically low-rank and approximately time-invariant, which makes the bound decomposable in coefficient space. An envelope is fitted offline via functional PCA and a Gaussian-mixture inductive conformal procedure, then refined online by a lightweight adaptive functional conformal (AFCP) update on a low-dimensional vector. This keeps the per-step cost largely insensitive to obstacle count and retains long-run field coverage under distribution shift. We embed the envelope as a tightened safety constraint in a sampling-based model predictive controller, FCP-MPC. On the ETH--UCY pedestrian benchmarks and a dense 3D quadrotor task with up to 280 dynamic obstacles, FCP-MPC attains a favorable balance of safety, feasibility, and efficiency, reaching goals where pointwise and egocentric conformal baselines become too conservative or too expensive, while keeping per-step computation far below online uncertainty-reasoning baselines.
Primary: Seoul National University
All Institutions: Seoul National University
This paper introduces a novel Functional Conformal Prediction framework for safe motion planning, leveraging the low-rank structure of prediction errors to provide scalable, distribution-free safety guarantees in dynamic environments. The approach effectively addresses the computational and spatial coherence limitations of prior conformal methods, offering a significant advancement in the integration of statistical uncertainty quantification with real-time robotic control.
The paper proposes a Functional Conformal Prediction (FCP) framework to address the scalability and spatial coherence issues of existing conformal prediction (CP) methods in safe motion planning. Instead of conformalizing scalar scores per obstacle, the authors treat the prediction error of the distance field as a functional object in a Hilbert space. They leverage the empirical observation that residual distance fields are low-rank and approximately time-invariant. This allows them to perform Functional PCA (FPCA) to decompose the field into a few principal components. A Gaussian Mixture Model (GMM) is fitted to the coefficients of these components in an offline stage, and an inductive conformal procedure is used to create a distribution-free envelope. Online, an Adaptive Functional Conformal Prediction (AFCP) update adjusts a scalar multiplier to handle distribution shifts. This approach decouples the expensive statistical calibration from the real-time planning loop, allowing the safety constraint to be evaluated efficiently for any sampled trajectory in an MPC framework. The methodology is theoretically sound, providing asymptotic safety guarantees under both exchangeable and non-exchangeable (adaptive) settings.
The authors evaluate FCP-MPC on two benchmarks: the ETH-UCY pedestrian dataset (2D) and a dense 3D quadrotor simulation with up to 280 dynamic obstacles. They compare against pointwise and egocentric conformal baselines, as well as online uncertainty-reasoning methods. The results indicate that FCP-MPC achieves a favorable balance of safety, feasibility, and efficiency. It successfully reaches goals where pointwise methods are too conservative and egocentric methods are too expensive or lose coverage. The per-step computation remains largely insensitive to obstacle count, demonstrating the scalability of the functional approach. The experiments are comprehensive, covering both 2D and 3D scenarios and varying densities.
The paper provides a GitHub repository link (https://github.com/CORE-SNU/FCP-MPC), which significantly aids reproducibility. The methodology is described in detail, including the offline FPCA and GMM fitting, and the online AFCP update. The use of standard benchmarks (ETH-UCY) also facilitates comparison. However, the specific implementation details of the "dense 3D quadrotor task" (e.g., exact dynamics, sensor noise models, prediction model architecture) might require careful reading of the appendix or code to fully replicate.
The method relies on the assumption that the residual distance field is low-rank and approximately time-invariant. While verified empirically, this may not hold in all environments (e.g., highly dynamic, non-stationary scenes with complex occlusions). The offline calibration requires a sufficiently large and representative dataset of residual fields. The adaptive update (AFCP) provides long-run coverage but may take time to converge to the correct threshold under rapid distribution shifts. The soft-constraint variant degrades safety guarantees by a controllable slack, which might be unacceptable for some high-risk applications.
This work contributes to the field of safe autonomous systems by providing a scalable and theoretically grounded method for uncertainty-aware motion planning. By enabling real-time safety guarantees in dense, dynamic environments, it facilitates the deployment of robots in more complex real-world scenarios. The functional conformal prediction framework could also be applicable to other domains involving spatial or functional data uncertainty, such as medical imaging or environmental monitoring. This paper introduces a novel Functional Conformal Prediction framework for safe motion planning, leveraging the low-rank structure of prediction errors to provide scalable, distribution-free safety guarantees in dynamic environments. The approach effectively addresses the computational and spatial coherence limitations of prior conformal methods, offering a significant advancement in the integration of statistical uncertainty quantification with real-time robotic control.
Video predictive models are emerging as a powerful paradigm in robotics, offering a promising path toward task generalization, long-horizon planning, and flexible decision-making. However, prevailing approaches often operate on 2D video sequences, inherently lacking the 3D geometric understanding necessary for precise spatial reasoning and physical consistency. We introduce a Structured 4D Latent Predictive Model, which predicts the evolution of a scene's 3D structure in a structured latent space conditioned on observations and textual instructions. Our representation encodes the scene holistically and can be decoded into diverse 3D formats, enabling a more complete and 3D consistent scene understanding. This structured 4D latent predictive model serves as a planner, generating future scenes that are translated into executable actions by a goal-conditioned inverse dynamics module. Experiments demonstrate that our model generates futures with strong visual quality, substantially better 3D consistency and multi-view coherence compared to state-of-the-art video-based planners. Consequently, our full planning pipeline achieves superior performance on complex manipulation tasks, exhibits robust generalization to novel visual conditions, and proves effective on real-world robotic platforms. Our website is available at https://structured-4d-model.github.io/.
Primary: Harvard University
All Institutions: MIT, Harvard University
This paper presents a Structured 4D Latent Predictive Model for robot planning that predicts future 3D scene structures in a sparse voxel latent space, enabling more geometrically consistent and robust manipulation compared to 2D video-based planners. The work is a significant contribution to the intersection of 3D generative modeling and robotics, offering a compelling alternative to end-to-end policies and 2D video planners by explicitly modeling 3D dynamics. The experimental results are strong, demonstrating state-of-the-art performance on several benchmarks and successful real-world deployment. The technical approach is well-motivated and rigorously evaluated.
The paper proposes a Structured 4D Latent Predictive Model for robot planning. The core innovation lies in moving from 2D video prediction to 3D latent space prediction using sparse voxel grids. The architecture leverages a pre-trained encoder/decoder (from TRELLIS) to map between multi-view images and structured 3D latents. The predictive model itself is split into a Single Dynamics Model (SD) for geometry/position and a Latent Generator (LG) for features, both using conditional flow matching. This is then coupled with a goal-conditioned inverse dynamics module to generate actions. The approach is technically sound, leveraging recent advances in 3D generation (3DGS, sparse voxels) and flow matching. The separation of geometry and feature dynamics is a practical design choice to handle the complexity of 3D generation.
The experiments cover simulation (ManiSkill3, LIBERO, RLBench) and real-world deployment. The paper demonstrates superior 3D consistency and multi-view coherence compared to video-based baselines (UniPi, TesserAct). Success rates on manipulation tasks are competitive with or better than imitation learning baselines (Diffusion Policy, DP3), particularly in zero-shot generalization to visual/viewpoint changes. The real-world experiment on a block-in-basket task provides strong empirical validation. The ablation studies on camera views and inverse dynamics inputs are thorough.
The paper provides detailed descriptions of the architecture, training objectives (flow matching), and data preparation. It references specific pre-trained models (TRELLIS, DINOv2, CLIP) and datasets (ManiSkill3, LIBERO). The website link suggests code availability. The use of standard benchmarks enhances reproducibility.
The method relies on calibrated multi-view RGB-D observations for the initial state reconstruction, which can be a limitation in single-view or uncalibrated real-world settings. The computational cost of 3D latent generation and decoding might be higher than 2D video generation. The reliance on a pre-trained 3D encoder/decoder means the method is tied to the capabilities of those models.
This work advances the field of embodied AI by providing a more geometrically grounded approach to robot planning. It has potential applications in autonomous robotics, simulation-to-real transfer, and interactive AI agents. The focus on 3D consistency addresses a key bottleneck in current video-based planning methods. This paper presents a Structured 4D Latent Predictive Model for robot planning that predicts future 3D scene structures in a sparse voxel latent space, enabling more geometrically consistent and robust manipulation compared to 2D video-based planners. The work is a significant contribution to the intersection of 3D generative modeling and robotics, offering a compelling alternative to end-to-end policies and 2D video planners by explicitly modeling 3D dynamics. The experimental results are strong, demonstrating state-of-the-art performance on several benchmarks and successful real-world deployment. The technical approach is well-motivated and rigorously evaluated.
Embodied task planning asks an agent to turn a natural-language instruction into an executable sequence of actions in a physical scene, and is a building block for household, assistive, and service robots. Recent prompting-based and reinforcement-learning planners generate fluent action text but lack a cheap deterministic check that the produced plan is valid in the target world, while high-fidelity simulation is too slow to serve as an inner-loop training signal. The general problem is therefore how to obtain verifiable supervision and rewards for embodied planners without relying on string-level matching or full simulation. Here we show that a single BDDL specification, automatically constructed from open-world video evidence or curated tasks, can serve as a shared interface for data construction, plan verification, and reward design. A video-to-BDDL parser, an LLM verifier, and a lightweight symbolic engine together supply dense feedback at millisecond latency. We further introduce GroupAdapt, a difficulty-aware length schedule that uses the in-batch group pass rate as a zero-cost signal so that hard prompts get wider length tolerance and automatically tighten as their pass rate improves. Under the guidance of the proposed verifier and GroupAdapt schedule, the 8B planner attains a Strict-Pass score of 97.3 on BEHAVIOR-1000, yielding a 25.9 percent relative improvement over the Qwen3-8B baseline. This result exceeds the strongest large-model baseline by 3.5 percent, while simultaneously compressing the response length by 79 percent to 207 tokens, demonstrating both effectiveness and efficiency.
Primary: The Hong Kong University of Science and Technology
All Institutions: The Hong Kong University of Science and Technology, University of London
This paper presents a significant advancement in embodied AI by introducing a BDDL-centric pipeline that integrates symbolic verification with reinforcement learning, enabling compact and correct task planning for 8B models that outperform larger baselines. The rigorous evaluation and clear methodology make it a valuable contribution to the field of robotics and machine learning.
The paper proposes a coherent pipeline for embodied task planning that bridges the gap between open-world natural language instructions and executable symbolic plans. The core methodological innovation lies in the use of BDDL (Behavior Domain Definition Language) as a unified interface for data construction, verification, and reward design. Specifically, the authors introduce a video-to-BDDL parser to generate training data from open-world videos, an LLM verifier to ensure semantic consistency, and a lightweight symbolic engine for millisecond-latency verification. The training methodology combines Supervised Fine-Tuning (SFT) with Symbolic-Reinforcement Learning (using DAPO). A key technical contribution is "GroupAdapt," a difficulty-aware length scheduling mechanism that uses the in-batch group pass rate to dynamically adjust length tolerance, allowing harder prompts to have more flexibility while enforcing conciseness on easier ones. This approach effectively decouples correctness learning from length compression, addressing a common failure mode in LLM planning where early compression leads to errors.
The experimental evaluation is rigorous and comprehensive. The authors evaluate on the BEHAVIOR-1K benchmark, specifically B-100 and B-1000, using metrics like Strict-Pass (SP), Engine-Pass (EP), and Goal Completion Ratio (GCR). The results show that the proposed 8B model significantly outperforms larger baselines (e.g., Qwen3-8B, Gemma-4-31B) in terms of SP score (97.3% on B-1000) while maintaining competitive performance on other metrics. The ablation studies effectively demonstrate the contribution of each component: SFT initialization, symbolic reward shaping, and GroupAdapt. The analysis of length compression is particularly strong, showing a 79% reduction in response length without sacrificing correctness. The inclusion of out-of-domain mathematical reasoning tasks (AIME, MATH) serves as a sanity check to ensure that the length compression does not degrade general reasoning capabilities, which is a valuable addition.
The paper provides detailed descriptions of the methodology, including the BDDL structure, the symbolic engine logic, and the RL hyperparameters (DAPO settings, group size, learning rates). The appendix contains extensive details on data construction, action library expansion, and reward landscape analysis. The use of open-weight models (Qwen3, Gemma) and standard benchmarks (BEHAVIOR-1K) enhances reproducibility. However, the specific implementation of the video-to-BDDL parser and the LLM verifier (likely proprietary or custom-built) might present some challenges for exact replication, although the logical flow is clear. The code for the symbolic engine and RL training loop appears to be the primary barrier to full reproducibility, but the paper provides sufficient detail for a competent researcher to implement.
The paper acknowledges several limitations. First, the method is a planning model and does not handle low-level control, which is a necessary layer for real-world deployment. Second, the reliance on BDDL requires robust scene understanding and object grounding, which can be noisy in real-world settings. The paper notes that real-time scene-to-BDDL construction is an open problem. Third, the performance is evaluated in simulation; real-world transferability is not demonstrated. Finally, the method's effectiveness is tied to the quality of the BDDL specifications and the action library, which may need manual curation or extensive LLM-assisted expansion for new domains.
This work has significant implications for the development of autonomous robots and embodied AI systems. By providing a scalable and verifiable method for training planners, it addresses a critical bottleneck in making robots capable of following complex, natural language instructions in unstructured environments. The emphasis on efficiency (shorter response times) and correctness (symbolic verification) aligns with the industry's need for reliable and deployable AI systems. The use of open-world video data for training also suggests a path towards more data-efficient and generalizable planning models. However, the reliance on simulation and symbolic representations may limit immediate applicability in highly dynamic or unstructured real-world scenarios without significant additional engineering. This paper presents a significant advancement in embodied AI by introducing a BDDL-centric pipeline that integrates symbolic verification with reinforcement learning, enabling compact and correct task planning for 8B models that outperform larger baselines. The rigorous evaluation and clear methodology make it a valuable contribution to the field of robotics and machine learning.
Traditional robot programming is challenging: it requires orchestrating multimodal perception, managing physical contact dynamics, and handling diverse configurations and execution failures. We introduce ASPIRE (Agentic Skill Programming through Iterative Robot Exploration), a continual learning system that autonomously writes and refines robot control programs in a code-as-policy paradigm while compounding experience into a reusable skill library. ASPIRE discovers skills that persist across tasks, simulation and real-world settings, and embodiments. It operates in an open-ended loop with three components: (1) a closed-loop robot execution engine that exposes fine-grained multimodal traces, enabling autonomous failure diagnosis, repair synthesis, and validation; (2) a continually expanding skill library that distills validated fixes into reusable, transferable knowledge; and (3) evolutionary search that generates diverse task sequences and control programs to explore beyond single-trajectory refinement. ASPIRE surpasses prior methods by up to 77% on LIBERO-Pro manipulation under perturbation, 72% on Robosuite bimanual handover, and 32% on BEHAVIOR-1K long-horizon household tasks. Its accumulated library also enables zero-shot generalization to unseen long-horizon tasks: on LIBERO-Pro Long, ASPIRE achieves 31% success versus 4% for prior methods despite their use of test-time reasoning and retries. Finally, simulation-discovered skills provide initial evidence of sim-to-real transfer, substantially reducing real-robot programming effort across different embodiments and robot APIs.
Primary: UC Berkeley
All Institutions: UC Berkeley
ASPIRE presents a compelling agentic framework for autonomous skill discovery in robotics, demonstrating significant empirical gains in simulation benchmarks through iterative code refinement and skill library accumulation, though real-world transfer remains preliminary.
The paper proposes ASPIRE, a framework for agentic skill programming in robotics. The core methodology involves a continual learning loop where an LLM-based agent writes, executes, and refines robot control code (code-as-policy). Key components include a closed-loop execution engine providing multimodal traces for failure diagnosis, a persistent skill library for distilling reusable fixes, and evolutionary search to generate diverse task sequences. The approach attempts to move beyond single-trajectory refinement by compounding experience into a transferable skill library. The methodology is technically sound, leveraging recent advances in LLM-based code generation and robotic simulation. However, the novelty is somewhat incremental; the combination of LLMs for code generation, simulation-based self-improvement, and skill libraries has been explored in various forms (e.g., RoboGen, RT-2, various LLM-robotics works). The specific contribution here is the "agentic" loop with evolutionary search for skill discovery, which is a reasonable engineering synthesis rather than a fundamental theoretical breakthrough.
The evaluation covers LIBERO-Pro, Robosuite, BEHAVIOR-1K, and LIBERO-Pro Long. The reported improvements are significant (up to 77% on LIBERO-Pro, 31% vs 4% on LIBERO-Pro Long). These results are compelling and suggest strong empirical performance. The use of standard benchmarks adds credibility. However, the comparison to "prior methods" needs careful scrutiny; if prior methods are not using test-time reasoning/retries as noted, the comparison might be slightly unfair or at least asymmetric. The sim-to-real transfer claim is mentioned as "initial evidence," which is a weak point for a high-impact claim. The experiments are extensive but largely confined to simulation, with real-world results being preliminary.
The paper describes a complex system involving LLMs, simulation environments, and evolutionary search. While the components are standard, the specific integration and hyperparameters for the agentic loop are crucial. The authors likely provide code (implied by the nature of such papers, though URL extraction returned none, suggesting it might not be publicly linked in the text provided or is new). The reliance on specific LLM APIs and simulation setups might pose reproducibility challenges for others without similar compute resources. The "skill library" mechanism needs clear definition in terms of storage and retrieval to be fully reproducible.
The primary limitation is the heavy reliance on simulation for skill discovery and the weak evidence for sim-to-real transfer. The "agentic" nature implies high compute costs and latency, which may not be suitable for real-time control. The approach may struggle with tasks requiring precise physical dynamics that are hard to capture in simulation or with LLM-generated code. The evaluation on real robots is limited to "initial evidence," lacking rigorous statistical analysis or long-term stability tests. The generalization to "unseen long-horizon tasks" is promising but relies on the assumption that discovered skills are composable in novel ways, which is not always guaranteed.
This work contributes to the automation of robot programming, potentially lowering the barrier to entry for deploying robots in complex tasks. It aligns with the trend of using LLMs for embodied AI. However, it also raises questions about the reliability and safety of autonomous code generation in physical systems. The potential for widespread adoption in industrial settings is high, but the current limitations in real-world robustness must be addressed. ASPIRE presents a compelling agentic framework for autonomous skill discovery in robotics, demonstrating significant empirical gains in simulation benchmarks through iterative code refinement and skill library accumulation, though real-world transfer remains preliminary.
Reward design remains a central bottleneck for autonomous robot policy improvement, especially in long-horizon manipulation tasks where sparse success labels provide too little signal and binary preferences collapse many competing notions of quality into one ambiguous signal. We introduce Freeform Preference Learning (FPL), a method for learning robot policies from freeform human preferences. Rather than asking annotators which of two trajectories is better overall, FPL lets them define natural-language preference axes, such as speed, safety, quality of placement, or carefulness, and provide pairwise preferences along each axis. These annotations are used to learn a language-conditioned reward model that maps a trajectory and preference label to an axis-specific reward. We use this model to train a reward-conditioned policy that optimizes across the multiple human-specified dimensions. Across four real-world and two simulated long-horizon manipulation tasks, FPL improves over sparse-reward and binary-preference methods by 38 percentage points. Beyond improved performance, FPL learns dense progress signals without explicit subtask segmentation, shows compositionality of behavior not present in the data, and allows users to steer the policy towards different behaviors at test time without retraining. Blog post with videos available at https://freeform-pl.github.io/fpl.website/
Primary: University of California, Berkeley
All Institutions: University of California, Berkeley, Stanford University, Toyota Research Institute
This paper presents a significant advancement in preference-based reinforcement learning for robotics by introducing Freeform Preference Learning, which leverages natural language to define multi-dimensional reward axes, enabling more flexible and effective policy optimization in long-horizon manipulation tasks.
The paper introduces Freeform Preference Learning (FPL), a framework that moves beyond binary pairwise comparisons in preference learning. Instead of asking "which is better?", it allows annotators to define natural-language axes (e.g., "speed", "safety") and provide preferences along these specific dimensions. The core technical innovation lies in training a language-conditioned reward model that maps a trajectory and a preference axis to an axis-specific scalar reward. This reward model is then used to train a reward-conditioned policy (likely using techniques similar to RLHF or conditional diffusion/behavioral cloning depending on the specific implementation details not fully expanded in the abstract, but implied by "reward-conditioned policy"). This approach decouples the definition of quality from the optimization, allowing for multi-objective steering at test time. The methodology addresses the ambiguity of binary preferences in complex, long-horizon tasks by providing dense, semantic feedback.
The evaluation is robust, covering four real-world and two simulated long-horizon manipulation tasks. The key result is a 38 percentage point improvement over sparse-reward and binary-preference baselines. This is a significant empirical gain, suggesting that the granularity of feedback provided by freeform axes is crucial for learning high-quality policies in complex environments. The paper also reports qualitative benefits: learning dense progress signals without explicit subtask segmentation, demonstrating compositionality (behaviors not seen in training data can be composed), and enabling zero-shot steering of behavior at test time. These results strongly support the claim that FPL provides a more flexible and effective interface for human-in-the-loop learning.
The paper provides a blog post with videos, which is helpful for qualitative assessment. However, the project URL is listed as "none" in the extraction, though the demo URL is provided. For full reproducibility, code and pre-trained models would be necessary. The abstract mentions "dense progress signals without explicit subtask segmentation," which implies a level of generalization that might be sensitive to implementation details of the reward model and policy trainer. While the method is conceptually clear, the lack of a public code repository in the metadata makes independent replication difficult at this stage.
The primary limitation is the reliance on natural language understanding for both defining axes and potentially interpreting them during reward modeling. If the language model fails to align the semantic meaning of the axis with the actual trajectory features, the reward signal may be noisy or misleading. Additionally, the complexity of the annotation task increases for users; defining multiple axes and providing pairwise comparisons for each might be more cognitively demanding than simple binary choices, potentially leading to annotator fatigue. The "compositionality" claim, while promising, may be limited to the specific distribution of axes and trajectories seen during training.
FPL has significant potential to democratize robot learning by making the reward specification process more intuitive and flexible for non-experts. By allowing users to steer behavior via natural language, it bridges the gap between high-level human intent and low-level robotic control. This could accelerate the deployment of robots in unstructured environments where explicit reward engineering is infeasible. However, it also raises questions about the alignment of the learned reward model with true human values, as the "axes" are user-defined and may not capture all ethical or safety nuances. This paper presents a significant advancement in preference-based reinforcement learning for robotics by introducing Freeform Preference Learning, which leverages natural language to define multi-dimensional reward axes, enabling more flexible and effective policy optimization in long-horizon manipulation tasks.
General-purpose robot policies should be modeled as dynamical systems, yet many VLA and generative imitation policies still rely on present observations or short windows. This Markovian shortcut fails in memory-dependent manipulation: identical observations can demand different actions after different histories. We present Chronos, a physics-informed full-history framework for non-Markovian long-horizon manipulation. The key idea is to elevate observation history from auxiliary context to the latent state of the policy dynamics. At each physical control step, Chronos forms one state-representative token by fusing observation and proprioception, so the token sequence is aligned one-to-one with physical time. A selective state space model propagates this causal historical state, which conditions a multimodal coarse action prior through implicit maximum likelihood estimation (IMLE). This prior is then refined by a second-order Schrodinger-inspired bridge that predicts acceleration fields, yielding smoother and more physically grounded robot motion. Across 16 simulated tasks and 4 real-world experiments, Chronos is evaluated on precision insertion, general manipulation, and memory-dependent long-horizon control. On RMBench, where success requires remembering task phase, Chronos achieves 73.6% average success, outperforming Markovian VLA baseline pi0.5 by +62.4 percentage points, a 6.6x relative gain, while using 10x fewer parameters. It also surpasses the memory VLA Mem-0 by 22.8 points while using over 30x fewer parameters. In real-world dual-arm experiments using a single RGB camera, Chronos achieves 78% average success over four tasks, including 72% on the three memory-dependent tasks, whereas pi0.5 achieves 7% overall and 0% on the memory-dependent subset. These results suggest that history should not be treated as auxiliary context, but as the latent state of the manipulation policy.
Primary: Huazhong University of Science and Technology
All Institutions: Huazhong University of Science and Technology
[One sentence main contribution]. Chronos introduces a physics-informed, full-history state-space framework for non-Markovian manipulation, achieving state-of-the-art performance on memory-dependent benchmarks with significantly fewer parameters than existing VLA models. [Comprehensive analysis of the technical contribution, methodology, and significance to the field]. The paper makes a significant technical contribution by addressing the non-Markovian nature of long-horizon robotic manipulation through a novel combination of Selective State Space Models for full-history encoding and a Schr\"odinger-inspired second-order bridge for action refinement. The methodology is rigorous, with a clear theoretical derivation linking quantum mechanical concepts to action-space acceleration fields, and the empirical results are strong, particularly on RMBench where it outperforms much larger memory-augmented VLAs. The approach is highly relevant to the current landscape of robot learning, offering a scalable and efficient alternative to large-scale transformer-based policies for tasks requiring temporal memory. The comprehensive evaluation across simulated and real-world tasks, along with detailed ablations, provides strong evidence for the efficacy of the proposed method.
The paper proposes Chronos, a framework addressing the non-Markovian nature of long-horizon manipulation. The core methodological contributions are twofold: (1) A full-history state representation using a Selective State Space Model (Mamba) that treats the entire observation history as the latent state, rather than using it as auxiliary context or a short window. This allows for precise temporal credit assignment across the full trajectory. (2) A physics-informed action generation module based on a "Schr\"odinger-inspired bridge." This module uses Implicit Maximum Likelihood Estimation (IMLE) to generate a coarse multimodal prior, which is then refined by a second-order differential equation solver that predicts acceleration fields. The derivation from the Schr\"odinger equation via Madelung transformation to a quantum Hamilton-Jacobi equation provides a theoretical justification for modeling action refinement as a physical process involving position stabilization and velocity dissipation. The approach is theoretically grounded and distinct from standard diffusion or flow-matching policies by explicitly modeling acceleration and using a quartic noise schedule compatible with second-order dynamics.
The evaluation is comprehensive, covering 16 simulated tasks and 4 real-world experiments. The results are compelling, particularly on RMBench, where Chronos achieves a 73.6% average success rate, significantly outperforming Markovian baselines like pi0.5 (+62.4 points) and memory-augmented VLAs like Mem-0 (+22.8 points), while using substantially fewer parameters (0.3B vs >10B for Mem-0). On RoboTwin 2.0, it achieves state-of-the-art performance in general manipulation. The ablation studies effectively isolate the contributions of the SSM memory and the second-order bridge, demonstrating that the acceleration-based refinement provides smoother and more precise actions, especially in contact-rich tasks like precision insertion. The real-world results on dual-arm manipulation further validate the transferability of the learned policies.
The paper provides a project page and code repository link. The methodology is described with sufficient mathematical detail, including the derivation of the acceleration target and the specific noise schedules. The use of standard components (Mamba, PointNet, DINOv2) facilitates implementation. However, the specific hyperparameters for the Schr\"odinger bridge integration steps and the IMLE latent update dynamics are crucial for reproduction and are partially detailed in the text. The claim of "memory-efficient training" via chunked perception is a practical detail that aids reproducibility.
The paper acknowledges that in fully observable, local-geometry-dominated tasks (e.g., Put Bottles Dustbin), Chronos slightly underperforms strong Markovian diffusion policies like DP3. This suggests that the overhead of full-history modeling may not always be beneficial when the present state is a sufficient statistic. Additionally, the reliance on a single RGB camera in real-world experiments might limit performance in complex lighting or occlusion scenarios compared to multi-view setups. The theoretical derivation, while elegant, is a specific projection of quantum mechanics concepts to control theory, and its generalizability to other domains beyond robotics is unclear.
This work advances the field of robotic manipulation by providing a robust solution to the long-standing problem of memory-dependent control. By demonstrating that full-history modeling can be efficient and effective, it challenges the prevailing trend of scaling VLA models with short-context windows. The physics-informed action generation could inspire more physically grounded generative models in other control domains. The significant performance gap on memory benchmarks highlights the limitations of current foundation models in temporal reasoning, guiding future research towards better temporal architectures. [One sentence main contribution]. Chronos introduces a physics-informed, full-history state-space framework for non-Markovian manipulation, achieving state-of-the-art performance on memory-dependent benchmarks with significantly fewer parameters than existing VLA models. [Comprehensive analysis of the technical contribution, methodology, and significance to the field]. The paper makes a significant technical contribution by addressing the non-Markovian nature of long-horizon robotic manipulation through a novel combination of Selective State Space Models for full-history encoding and a Schr\"odinger-inspired second-order bridge for action refinement. The methodology is rigorous, with a clear theoretical derivation linking quantum mechanical concepts to action-space acceleration fields, and the empirical results are strong, particularly on RMBench where it outperforms much larger memory-augmented VLAs. The approach is highly relevant to the current landscape of robot learning, offering a scalable and efficient alternative to large-scale transformer-based policies for tasks requiring temporal memory. The comprehensive evaluation across simulated and real-world tasks, along with detailed ablations, provides strong evidence for the efficacy of the proposed method.
Large-scale dexterous grasp datasets encode rich priors over hand-object interaction, but their use has largely been confined to grasp generation and pick-and-place manipulation. We study whether such data can instead support functional dexterity in articulated tool use, where a robot must acquire a tool, maintain contact, and operate its functional moving parts. We adapt a hierarchical imitation learning framework that combines high-level hand sub-goal prediction with a low-level goal-conditioned controller. We construct a 355k-trajectory grasp-pretraining dataset from large-scale dexterous grasp annotations and use it to pretrain the low-level controller. The controller is then fine-tuned on downstream task demonstrations. To evaluate this setting, we introduce DexCraft, a simulation benchmark with six articulated tool-use tasks requiring coordinated finger motion. Across simulation and real-world experiments, our approach outperforms end-to-end diffusion policy baselines and hierarchical policies trained from scratch. In the real world, it improves full-task success by 33.3 percentage points over DP3. These results show that grasp datasets can serve not only as resources for grasp synthesis, but also as scalable pretraining data for contact-rich dexterous manipulation. Videos are shown on https://yingyuan0414.github.io/grasp2dexterity/ .
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
The paper presents a compelling method for leveraging large-scale grasp datasets to enable dexterous tool use, demonstrating significant performance gains through hierarchical imitation learning and pretraining.
The paper proposes a hierarchical imitation learning framework for dexterous tool use. The core methodological contribution is the adaptation of a low-level goal-conditioned controller (based on Diffusion Policy) pre-trained on a large-scale synthetic grasp dataset (G2D-Pretrain, derived from Dexonomy). The high-level policy predicts 16-DoF hand keypoints as sub-goals, addressing the insufficiency of coarse gripper-centric sub-goals for dexterous hands. The approach effectively bridges the gap between static grasp synthesis data and dynamic, contact-rich manipulation tasks by leveraging the rich kinematic priors in grasp datasets. The hierarchical decomposition (high-level planning, low-level execution) is well-motivated and technically sound, particularly the semantic mapping of joint spaces between the Shadow hand (pretraining) and LEAP hand (fine-tuning).
The evaluation includes a new simulation benchmark, DexCraft, with six articulated tool-use tasks. The paper provides extensive ablation studies comparing end-to-end policies (DP, DP3), hierarchical policies from scratch, and their pre-trained counterparts. The results demonstrate significant improvements, particularly in the real-world setting where the proposed method improves full-task success by 33.3 percentage points over DP3. The sample efficiency analysis further supports the claim that pretraining reduces the need for downstream demonstrations. The inclusion of both simulation and real-world experiments strengthens the validity of the claims, although the real-world evaluation is limited to three tasks and a single robot setup.
The paper provides detailed descriptions of the data augmentation process for G2D-Pretrain, the policy architectures, and the experimental setups. The project website link suggests code or video availability, which aids reproducibility. The use of standard simulators (ManiSkill3) and datasets (Dexonomy) facilitates replication. However, the specific details of the teleoperation setup and the exact implementation of the semantic joint mapping for the LEAP hand might require additional clarification for perfect reproducibility.
The reliance on manually annotated sub-goals for training the high-level policy limits scalability. The simulation benchmark uses single object instances per task, which may not fully capture the generalization capabilities required for diverse object geometries. The real-world evaluation is constrained by the specific hardware setup (Franka + LEAP Hand) and does not explore the impact of tactile feedback or online adaptation, which are critical for robust dexterous manipulation.
This work significantly advances the field of dexterous manipulation by demonstrating that large-scale grasp datasets, previously underutilized for dynamic tasks, can serve as powerful pretraining resources. This could lower the barrier to entry for learning complex manipulation skills by reducing the need for costly real-world demonstrations. The DexCraft benchmark provides a valuable resource for evaluating articulated tool use, encouraging further research in this area. The paper presents a compelling method for leveraging large-scale grasp datasets to enable dexterous tool use, demonstrating significant performance gains through hierarchical imitation learning and pretraining.