Last 7 Days (June 27 – July 03, 2026)
Language models increasingly write probabilistic programs (in NumPyro, Stan, or Pyro), but a program that compiles, runs, and passes every unit test can still be \emph{statistically} wrong -- a Gaussian likelihood for heavy-tailed data, a Poisson for over-dispersed counts, an invalid prior support, or a pathological parameterization. The right verifier is therefore not a test suite but the Bayesian workflow itself: posterior predictive checks, simulation-based calibration, sampler diagnostics ($\hat R$, divergences, ESS), and held-out predictive density. We study this calibration oracle along three axes. \textbf{Detection:} on a benchmark of $14$ misspecification types across $10$ model families ($200$ instances), it flags the bug with AUC $0.97$ ($88\%$ at $2\%$ FPR \emph{when handed the correct reference program, an upper bound}) -- and a fully \emph{reference-free} version that uses no correct program reaches $62$--$78\%$ (the upper figure from a small automated model search), versus $0\%$ for a unit-test oracle. \textbf{Repair:} used as feedback in an LLM repair loop across fifteen models, calibration significantly outperforms unit-test feedback -- which is itself \emph{significantly worse than no feedback at all}, a passing test inducing false confidence that suppresses repair -- and improves over no feedback on strong-but-unsaturated models (GPT-5.1 $33{\to}92\%$, Claude $75{\to}100\%$; paired McNemar, $n{=}228$). \textbf{Reality:} on programs LLMs write from scratch for neutral briefs, $15$--$47\%$ of runnable ones are statistically misspecified (unit tests catch none), and calibration-guided repair significantly beats LLM-as-judge review, a Bayesian-workflow checklist, and data-summary self-debug. Across all three, the lesson is the same: for probabilistic programs, correctness is calibration, not compilation.
Primary: Columbia University
All Institutions: Columbia University
This paper introduces a paradigm shift for verifying LLM-generated probabilistic programs, demonstrating that Bayesian calibration diagnostics are essential for detecting code-invisible statistical errors, significantly outperforming and even revealing the harm of traditional unit tests in LLM repair loops. The comprehensive technical contribution, rigorous empirical validation on both injected and LLM-generated bugs, and the clear, actionable insights make this a genuinely field-wide significant paper that will reshape how LLM-assisted probabilistic modeling systems are designed and evaluated.
The paper proposes a robust methodology for detecting and repairing statistically misspecified probabilistic programs generated by Language Models (LLMs). The central innovation is the "calibration oracle," which formalizes and automates the Bayesian workflow diagnostics (Posterior Predictive Checks, Simulation-Based Calibration, sampler diagnostics like R-hat and divergences, and held-out predictive density) into a programmatic verifier. This oracle is designed to identify "code-invisible misspecification"—statistical errors that traditional unit tests (compilation, execution, output shape) cannot detect. The repair loop integrates this oracle as feedback for LLMs, guiding them to iteratively refine their programs. The methodology is well-grounded in Bayesian principles, and the formal distinction between code-visible and code-invisible bugs is crucial for understanding the limitations of existing LLM code verification paradigms. The aggregation of diverse diagnostics into a single verdict, with defined thresholds, provides a practical and actionable signal for automated systems.
The experimental evaluation is exceptionally thorough and convincing, covering detection, repair, and real-world applicability. 1. **Detection**: A comprehensive benchmark of 200 instances across 14 misspecification types and 10 model families was created. The calibration oracle achieved an impressive AUC of 0.97 (88% detection at 2% FPR), demonstrating its efficacy. Critically, a *reference-free* version of the oracle (without a ground-truth correct program) still achieved 62-78% detection, highlighting its practical utility. In stark contrast, the unit-test oracle achieved 0% detection for code-invisible bugs. 2. **Repair**: Experiments with 15 diverse LLMs (open and API) in a repair loop showed that calibration feedback consistently outperformed "no feedback" and "unit-test feedback." A surprising and impactful finding was that unit-test feedback was often *worse than no feedback*, as it induced false confidence and suppressed repair. Calibration feedback led to substantial improvements for strong-but-unsaturated models (e.g., GPT-5.1 from 33% to 92%), with statistically significant gains ($p < 4.5 \times 10^{-10}$ for pooled invisible-bug repairs). 3. **Real LLM Programs**: This section provides the strongest evidence. LLMs were tasked to write programs from scratch for neutral briefs. The study found that 15-47% of runnable LLM-generated programs were statistically misspecified (none caught by unit tests). Calibration-guided repair significantly outperformed strong baselines, including LLM-as-judge review, Bayesian-workflow checklists, and data-summary self-debug, achieving an 84% fix rate for misspecified programs. The results are robust, with detailed ablations and sensitivity analyses for oracle thresholds.
The paper demonstrates a high commitment to reproducibility. It specifies exact snapshots for API models and HuggingFace revisions for open models. Detailed hyperparameters for LLM generation (temperature, max tokens, repair budget) and NUTS inference (chains, draws, x64) are provided. Oracle thresholds are explicitly stated, and their sensitivity is analyzed. Crucially, the authors commit to releasing "The full system prompt, contract, feedback templates, benchmark generators, and analysis scripts with the paper," which is excellent practice and will enable full replication and extension of their work.
The authors are transparent about several limitations: 1. **PPC power**: The effectiveness of Posterior Predictive Checks depends on the chosen test statistics, meaning some misspecifications might remain undetected if not captured by the tracked statistics. 2. **Right fit, wrong structure**: The oracle primarily assesses distributional fit, not causal or generative correctness, making it blind to structural errors that yield similar predictive distributions. 3. **Over-wide predictive**: An overly confident or under-confident model might "cover" the data and pass PPC despite being fundamentally wrong. 4. **Diagnosis without remedy**: Weaker LLMs may receive accurate diagnostic feedback but lack the capability to translate it into a correct structural fix. 5. **SBC expense**: Simulation-Based Calibration is computationally intensive, limiting its practical use in the repair loop for large or costly models. 6. **Inference failure**: Sampler diagnostics can sometimes fire due to poor inference (e.g., NUTS issues) rather than genuine model misspecification, leading to false positives. 7. **Benchmark scope**: The repair experiments use a subset of bugs, and the tasks are classical low-dimensional models, suggesting that extending the approach to high-dimensional or complex structured models (e.g., deep models, spatial-temporal models) is future work.
This paper has profound broader impact for the burgeoning field of LLM-assisted scientific computing and probabilistic modeling. It fundamentally shifts the paradigm for verifying LLM-generated probabilistic code from "compilation" to "calibration," providing a principled and empirically validated framework for ensuring statistical correctness. This work is crucial for building trustworthy and reliable LLM agents that can assist in scientific discovery, data analysis, and model development. The PPL-agnostic nature of the Bayesian diagnostics means the approach is broadly applicable across popular probabilistic programming languages like Stan, Pyro, and PyMC. The surprising finding that unit-test feedback is actively harmful for LLM repair loops is a critical insight for designing future LLM agent architectures. This paper sets a new standard for evaluating and improving LLM-generated scientific code. This paper introduces a paradigm shift for verifying LLM-generated probabilistic programs, demonstrating that Bayesian calibration diagnostics are essential for detecting code-invisible statistical errors, significantly outperforming and even revealing the harm of traditional unit tests in LLM repair loops. The comprehensive technical contribution, rigorous empirical validation on both injected and LLM-generated bugs, and the clear, actionable insights make this a genuinely field-wide significant paper that will reshape how LLM-assisted probabilistic modeling systems are designed and evaluated.
4D hand motion reconstruction from egocentric video is bottlenecked by clear limitations of existing methods: image-based pipelines depend on a detector that fails under heavy occlusion, while video-based methods rely on temporal modules learned only from scarce hand-pose annotations, a narrow signal insufficient to model motion dynamics, occlusion reasoning, and hand-object interaction. These capabilities, however, are exactly what video generative models must implicitly acquire when trained to synthesize coherent video at internet scale. Motivated by this, we present ViDiHand, which leverages the representations of a pretrained video diffusion model to reconstruct 4D two-hand pose. We adapt it via a hand-overlay rendering objective that specializes its features for hands while preserving its world priors. A decoder then recovers metric-scale pose from the adapted features. The whole pipeline operates directly on full frames--no detector, no infiller, and no test-time optimization. On ARCTIC, HOT3D, and HOI4D, ViDiHand substantially outperforms prior methods, establishing video diffusion models as a powerful new foundation for hand motion reconstruction and a promising route to scalable in-the-wild data collection for embodied AI. Project page: https://vidihand.github.io.
Primary: A*STAR (Agency for Science, Technology and Research)
All Institutions: A*STAR, NTU Singapore (Nanyang Technological University), Alibaba Group
The paper presents a significant advancement in 4D hand motion reconstruction by effectively adapting video diffusion models for perception tasks, achieving state-of-the-art performance on challenging benchmarks through a novel hand-overlay rendering adaptation and a geometrically-aware dual-branch decoder.
The paper proposes a novel paradigm for 4D hand motion reconstruction by leveraging the internal representations of a large-scale pretrained Video Diffusion Model (Wan2.1-VACE). Instead of treating the diffusion model as a generative black box or a frozen feature extractor, the authors introduce a "hand-overlay rendering" adaptation stage. This involves finetuning only the VACE branch of the model to regenerate input clips with semi-transparent rendered hand overlays. This clever pretext task specializes the model's world priors (occlusion reasoning, temporal coherence, 3D geometry) for hand-centric tasks without destroying the general visual knowledge. The decoder is a dual-branch architecture: a Hand-Token Branch for holistic articulated pose and a Joint-Heatmap Branch for local 2D localization, coupled by mutual cross-attention and a closed-form geometric solve for camera translation. This design elegantly separates the holistic vs. local inductive biases of the representation. The approach is methodologically sound, theoretically motivated by the capabilities of generative models, and technically sophisticated in its integration of diffusion features with geometric constraints.
The evaluation is comprehensive and rigorous. The authors test on three challenging egocentric hand benchmarks: ARCTIC (heavy occlusion), HOT3D (fisheye, high dynamic range, motion blur), and HOI4D (cross-dataset generalization). They introduce a "penalty protocol" that folds false negatives into pose metrics, providing a more realistic assessment of detection robustness than standard TP-only metrics. ViDiHand establishes new state-of-the-art results across all metrics, with particularly significant gains in frame accuracy (detection robustness) and temporal jitter (smoothness). The ablation studies are thorough, validating the choice of DiT layer, denoising step, and decoder components. The cross-dataset transfer to HOI4D demonstrates the generalizability of the learned priors. The results are statistically significant and practically meaningful, showing that video diffusion models capture richer spatiotemporal priors than discriminative video models or image-based detectors.
The paper provides detailed implementation details, including the specific backbone (Wan2.1-VACE), the two-stage training curriculum (joint overlay then MANO mesh overlay), and the decoder architecture. The supplementary material contains extensive details on the evaluation protocol, metric definitions, and ablation studies. The project page link suggests code/data availability, which is standard for high-impact ML papers. The use of a publicly available backbone (Wan2.1) enhances reproducibility, although the specific finetuning steps and data preprocessing pipelines would need to be carefully followed. The closed-form geometric solve is well-defined.
The primary limitation is computational cost. The method runs at 5.5 fps on 4 A100 GPUs, making it an offline annotation tool rather than a real-time solution. The authors acknowledge this and suggest distillation as a future direction. Additionally, Stage 1b still requires MANO-annotated video, which is a scarce resource, though the authors propose self-supervised pretexts to relax this in the future. The method may also struggle with extreme cases not covered in the training data, although the cross-dataset results suggest good generalization.
This work has significant implications for embodied AI, robotics, and human-computer interaction. By providing a scalable, high-quality method for 4D hand reconstruction from egocentric video, it enables the creation of large-scale datasets for training robot policies and understanding human behavior. The paradigm shift towards leveraging video generative models for perception tasks could influence future research in 3D vision, motion capture, and video understanding. It also highlights the untapped potential of diffusion models for discriminative tasks, potentially inspiring similar approaches in other domains. The paper presents a significant advancement in 4D hand motion reconstruction by effectively adapting video diffusion models for perception tasks, achieving state-of-the-art performance on challenging benchmarks through a novel hand-overlay rendering adaptation and a geometrically-aware dual-branch decoder.
Technology computer-aided design (TCAD) semiconductor device simulation is fundamentally constrained by the high computational cost of iteratively solving coupled drift-diffusion equations. Existing ML surrogates either reduce internal physics to macroscopic scalar regressions, or rely on single-step mappings that lack the iterative refinement required to resolve stiff, coupled fields. To address this, we introduce PCGD, a Physics-Guided Conditional Graph Diffusion framework operating natively on unstructured TCAD meshes to predict coupled electrostatic and carrier density fields. PCGD employs a Condition-Aware MeshGraphNet denoiser that explicitly injects boundary conditions and device structure context via global cross-attention. By augmenting data-driven denoising with a physics-guided hybrid objective that integrates exponent-free quasi-Fermi gradient matching with noise-aware PDE residuals, PCGD progressively enforce physical constraints in the iterative diffusion trajectory. This strategy successfully bypasses the numerical instabilities typical of stiff drift-diffusion equations. Evaluated on a challenging mixed PN/MOS benchmark, PCGD significantly outperforms deterministic one-step regression (1.207% error) and local diffusion (1.585% error) baselines by achieving a sub-percent mean relative field error of 0.835%, while concurrently reducing maximum PDE residual errors by nearly three orders of magnitude compared to pure diffusion. It also transfers robustly to unseen SOI topologies (0.815% error) via LoRA adaptation, using 5.30$\times$ less data and 14.34$\times$ fewer parameters than full fine-tuning. Ultimately, PCGD bridges the computational efficiency of generative surrogates with the rigorous physical fidelity of traditional TCAD, unlocking highly scalable, field-level analysis for robust device engineering.
Primary: Fudan University
All Institutions: Fudan University, Shanghai Integrated Circuit Manufacturing Innovation Center
PCGD has a profound broader impact, particularly in the semiconductor industry and scientific machine learning. By significantly accelerating TCAD simulations while maintaining high physical fidelity, it can drastically shorten the design cycle for advanced semiconductor devices, fostering innovation in microelectronics for applications ranging from AI hardware to power electronics. The methodology for stabilizing physics-informed diffusion models for highly stiff and nonlinear PDEs is generalizable beyond TCAD. It offers a blueprint for tackling similar challenges in other scientific and engineering domains, such as materials science, fluid dynamics, and biomechanics, where complex multiphysics simulations are computationally expensive and numerically challenging. The concept of a hybrid AI-TCAD simulation workflow, where ML provides robust initial guesses to accelerate or stabilize traditional solvers, represents a powerful paradigm for integrating AI into complex scientific computing pipelines, promising both efficiency and reliability. Furthermore, the successful application of parameter-efficient fine-tuning (LoRA) for domain adaptation in a scientific context highlights its potential for developing more adaptable and data-efficient ML models for specialized scientific tasks. PCGD introduces a physics-guided conditional graph diffusion framework that effectively addresses the computational bottlenecks and physical fidelity challenges in TCAD device simulation. This work makes substantial contributions through its mesh-native graph diffusion approach, condition-aware MeshGraphNet architecture with global cross-attention, and a novel hybrid physics-guided training objective that stabilizes learning for stiff, nonlinear drift-diffusion equations. The impressive empirical results, demonstrating sub-percent accuracy and near three-order-of-magnitude reduction in PDE residuals, coupled with robust transferability to unseen topologies, position PCGD as a significant advancement in ML-accelerated scientific computing, with clear implications for future semiconductor design and broader multiphysics simulation.
The methodology presented in PCGD is highly sophisticated and meticulously designed to tackle the inherent challenges of TCAD device simulation. The core idea of formulating coupled-field prediction as a conditional generative denoising task on native unstructured TCAD meshes is a significant departure from previous ML surrogates. This approach leverages the iterative refinement capabilities of diffusion models to decouple macroscopic field construction from microscopic detail resolution, which is crucial for handling the extreme stiffness and exponential nonlinearities of semiconductor physics. The Condition-Aware MeshGraphNet denoiser is a well-thought-out architectural innovation. By explicitly injecting boundary conditions and device structure context via global cross-attention, it effectively overcomes the limitations of purely local message passing, ensuring that critical global information (like terminal biases and device layout) is directly accessible to every mesh node. This design choice is critical for accelerating convergence to correct physical operating conditions. The most impactful methodological contribution is the hybrid physics-guided objective. It ingeniously combines an exponent-free quasi-Fermi gradient matching loss ($L_G$) for stable guidance during early, noisy diffusion stages with noise-aware, adaptively scaled exact PDE residuals ($L_P$) for fine-grained physical correction in later stages. The dual-tier stabilization strategy for $L_P$ (validity-aware SNR gating and adaptive batch balance) is a robust solution to prevent gradient divergence caused by the extreme nonlinearity of drift-diffusion equations. The detailed mathematical derivation connecting Scharfetter-Gummel flux to quasi-Fermi gradients and the implementation of GPU-accelerated graph operators for discrete PDE residuals further demonstrate the technical depth and rigor of the proposed framework.
The experimental evaluation is comprehensive, rigorous, and strongly supports the claims made in the paper. The use of a challenging mixed PN/MOS benchmark dataset, comprising a large number of training and validation snapshots from distinct devices and operating modes, ensures the robustness of the evaluation. The device-trajectory level splitting for train/validation is critical for assessing generalization. The ablation studies are particularly effective, systematically isolating the contributions of iterative generative refinement, global conditioning, and physics-guided supervision by comparing PCGD against well-chosen baselines (deterministic one-step regression, local diffusion, and condition-aware diffusion with varying levels of physics guidance). The chosen metrics, mean relative $L_2$ error and log-compressed maximum PDE residual error, are appropriate and provide a holistic view of both accuracy and physical consistency. The results are highly impressive: PCGD achieves a sub-percent mean relative field error (0.835%) and significantly outperforms all baselines, reducing maximum PDE residual errors by nearly three orders of magnitude compared to pure diffusion. The analysis of convergence dynamics clearly illustrates the stability gained from the hybrid physics-guided objective. Furthermore, the demonstration of robust transferability to unseen SOI topologies via parameter-efficient LoRA adaptation, requiring significantly less data and fewer parameters than full fine-tuning, is a strong indicator that PCGD learns generalizable physical priors, which is crucial for practical adoption in industrial settings.
The paper provides a commendable level of detail for reproducibility. The methodology is thoroughly described with mathematical formulations for the graph representation, diffusion process, Condition-Aware MeshGraphNet, and the hybrid physics-guided objective. The appendix offers specific architectural settings, coefficients for the loss functions, and details on the baseline architectures. Crucially, numerical safety measures for handling stiff exponentials and precision truncation are explicitly mentioned. The schema for mesh features and interface edges is also detailed. While the absence of publicly available code (no project URL provided) is a common limitation for arXiv preprints, the textual descriptions are sufficiently comprehensive for an experienced researcher with expertise in GNNs, diffusion models, and scientific computing to implement the core components of PCGD. Access to the specific TCAD data generation pipeline would be the final piece for exact replication of the dataset.
Despite its strengths, PCGD exhibits some limitations. The paper acknowledges "severe zero-shot degradation" when transferring to major out-of-distribution topological shifts (e.g., SOI MOSFETs not seen during pretraining). While LoRA adaptation effectively addresses this, it implies that the model, despite learning physical priors, still relies on structural interpolation from its pretraining data, limiting its immediate "universal foundational model" capabilities without a much broader pretraining corpus. The current experimental validation is primarily on 2D devices. While the paper discusses the theoretical O(N) scaling advantage for 3D simulations, empirical validation on massive 3D meshes, including memory footprint and training/inference times, is yet to be demonstrated. The computational cost of training such a complex diffusion model on large graph datasets is likely substantial, though not explicitly quantified. Finally, the proposed hybrid AI-TCAD workflow mentions routing high-residual cases back to traditional solvers, but the practical implementation details, criteria for routing, and the efficiency gains from this hybrid approach are not fully explored.
PCGD has a profound broader impact, particularly in the semiconductor industry and scientific machine learning. By significantly accelerating TCAD simulations while maintaining high physical fidelity, it can drastically shorten the design cycle for advanced semiconductor devices, fostering innovation in microelectronics for applications ranging from AI hardware to power electronics. The methodology for stabilizing physics-informed diffusion models for highly stiff and nonlinear PDEs is generalizable beyond TCAD. It offers a blueprint for tackling similar challenges in other scientific and engineering domains, such as materials science, fluid dynamics, and biomechanics, where complex multiphysics simulations are computationally expensive and numerically challenging. The concept of a hybrid AI-TCAD simulation workflow, where ML provides robust initial guesses to accelerate or stabilize traditional solvers, represents a powerful paradigm for integrating AI into complex scientific computing pipelines, promising both efficiency and reliability. Furthermore, the successful application of parameter-efficient fine-tuning (LoRA) for domain adaptation in a scientific context highlights its potential for developing more adaptable and data-efficient ML models for specialized scientific tasks. PCGD introduces a physics-guided conditional graph diffusion framework that effectively addresses the computational bottlenecks and physical fidelity challenges in TCAD device simulation. This work makes substantial contributions through its mesh-native graph diffusion approach, condition-aware MeshGraphNet architecture with global cross-attention, and a novel hybrid physics-guided training objective that stabilizes learning for stiff, nonlinear drift-diffusion equations. The impressive empirical results, demonstrating sub-percent accuracy and near three-order-of-magnitude reduction in PDE residuals, coupled with robust transferability to unseen topologies, position PCGD as a significant advancement in ML-accelerated scientific computing, with clear implications for future semiconductor design and broader multiphysics simulation.
Many everyday programming tasks resist clean rule-based implementation, such as alerting on important log lines, repairing malformed JSON, or ranking search results by intent, and are increasingly outsourced to large language model APIs at the cost of locality, reproducibility, and price. We propose fuzzy-function programming: compiling such a function from a natural-language specification into a compact, locally-executable neural artifact. We instantiate this paradigm with Program-as-Weights (PAW), in which a 4B compiler trained on FuzzyBench, a 10M-example dataset we release, emits parameter-efficient adapters for a frozen, lightweight interpreter. A 0.6B Qwen3 interpreter executing PAW programs matches the performance of direct prompting of Qwen3-32B, while using roughly one fiftieth of the inference memory and running at 30 tokens/s on a MacBook M3. PAW reframes the foundation model from a per-input problem solver into a tool builder: invoked once per function definition, it produces a small reusable artifact whose subsequent calls per function application are cheap and offline.
Primary: Harvard University
All Institutions: Harvard University
The paper introduces Program-as-Weights, a novel paradigm that compiles natural language specifications into neural adapters for small, local interpreters, demonstrating that a 0.6B model can match a 32B model's performance on fuzzy tasks while offering significant efficiency and reproducibility benefits.
The paper proposes "Program-as-Weights" (PAW), a paradigm where a large language model (the compiler) generates parameter-efficient adapters (LoRA) and discrete pseudo-programs for a frozen, small language model (the interpreter). This effectively compiles natural language specifications into neural artifacts. The methodology is technically sound, leveraging hypernetwork-like architectures to map text to weights. The distinction between the "discrete" (pseudo-program) and "continuous" (LoRA) components is a key architectural choice that allows the small interpreter to leverage both explicit instruction following and implicit weight-based specialization. The approach is novel in its specific instantiation for "fuzzy functions" and its focus on local, offline execution via quantized GGUF formats.
The evaluation is comprehensive, covering a newly released 10M-example dataset (FuzzyBench) and several external benchmarks (YouTube, SMS, Yelp, IMDB). The results are strong: a 0.6B interpreter with PAW matches the performance of a 32B model prompted directly. The ablation studies effectively demonstrate the value of the compiler component over simple fine-tuning or fixed LoRAs. The robustness tests on noisy specifications are particularly convincing, showing the denoising capability of the pseudo-compiler. The multimodal extension (using a VL compiler with a text interpreter) is a nice touch that validates the abstraction's generality. However, the reliance on synthetic data generated by a proprietary model (GPT-5.2, which appears to be a hypothetical or future model given the current date, or a typo for GPT-4o/4o-mini) raises questions about data quality and potential bias, although the authors do attempt to mitigate this with verification steps.
The paper provides a GitHub repository and a public demo. The dataset FuzzyBench is released, which is a significant contribution to reproducibility. The code structure and architecture details are sufficiently described. The use of standard components (LoRA, GGUF, Qwen3) aids in reproducibility. The mention of "GPT-5.2" is confusing; if this refers to a specific internal model or a typo, it might hinder exact replication of the data generation pipeline, but the methodology for using the generated data is clear.
The primary limitation is the dependency on a large, capable compiler model to generate the adapters. While the *inference* is cheap and local, the *compilation* step requires significant compute and likely API access to a large model, which contradicts the "fully local" ideal for the initial setup phase. Additionally, the performance on long-form structured generation (Im2LaTeX) was weaker, indicating limitations in context window management for the small interpreter. The reliance on synthetic data also poses a risk of propagating biases or errors from the teacher model.
This work has significant potential impact by bridging the gap between the flexibility of LLMs and the efficiency/reliability of traditional software. It enables developers to create custom, local, and reproducible AI functions without maintaining large model instances. This could democratize access to specialized AI capabilities on edge devices. The release of FuzzyBench provides a valuable benchmark for the community. The paper introduces Program-as-Weights, a novel paradigm that compiles natural language specifications into neural adapters for small, local interpreters, demonstrating that a 0.6B model can match a 32B model's performance on fuzzy tasks while offering significant efficiency and reproducibility benefits.
Hardware-agnostic strategies for accelerating text-to-image diffusion, such as timestep distillation and feature caching, can reduce inference time without custom kernels or system-level optimization. Among them, multi-resolution generation strategies have recently received broad attention, attaining more than 5x speedup without any training. However, the design of performing upsampling in the latent space, together with the selective modification of partial regions, causes these methods to exhibit noticeable blurring or artifacts. To this end, we propose MrFlow, a training-free multi-resolution acceleration strategy for pretrained flow-matching models built upon a staged low-to-high-resolution pipeline. MrFlow first rapidly generates the main structure at low resolution, then performs super-resolution in the pixel space using a lightweight pretrained GAN-based model, subsequently injects low-strength noise to enable high-frequency resampling, and finally refines the details at high resolution. Quantitative and qualitative results on FLUX.1-dev and Qwen-Image show that MrFlow exploits the quadratic token reduction and reduced step requirement of low-resolution sampling to achieve 10x end-to-end acceleration while keeping OneIG within a 1% gap relative to that before acceleration, significantly surpassing other training-free acceleration strategies, and requiring no training or runtime dynamic identification whatsoever. MrFlow can further be directly combined orthogonally with pre-trained timestep distillation strategies, achieving even higher generation acceleration of up to 25x.
Primary: Beihang University
All Institutions: Beihang University, Chinese Academy of Sciences, Institute of Computing Technology, Nanyang Technological University, University of Science and Technology of China
MrFlow introduces a novel training-free multi-resolution acceleration pipeline for flow-matching models that achieves up to 10x speedup by combining low-resolution structure generation, pixel-space super-resolution, and high-frequency resampling, significantly outperforming existing training-free methods while maintaining high image quality. This represents a solid contribution to the field of efficient generative modeling, offering a practical solution for accelerating state-of-the-art diffusion models without the need for retraining or complex system optimizations.
The paper proposes MrFlow, a training-free acceleration strategy for flow-matching diffusion models (specifically FLUX.1 and Qwen-Image). The core methodology involves a staged pipeline: 1) Low-resolution (LR) structure generation using the pretrained model; 2) Pixel-space super-resolution using a pretrained GAN-based model (Real-ESRGAN); 3) Low-strength noise injection into the VAE-encoded latent to enable high-frequency resampling; 4) Single-step high-resolution (HR) refinement. The authors argue that this approach exploits the quadratic token reduction of LR sampling and the straightness of flow trajectories near the clean image endpoint. The methodology is technically sound and addresses a specific pain point in diffusion acceleration: the artifacts and blurring associated with latent-space upsampling in previous multi-resolution methods. The use of pixel-space SR followed by latent-space refinement is a clever heuristic that leverages the strengths of both domains. However, the novelty is somewhat incremental; the concept of "coarse-to-fine" generation is well-established (e.g., Imagen, Cascaded Diffusion), and the specific combination of GAN-based SR with flow-matching refinement is a practical engineering insight rather than a fundamental theoretical breakthrough.
The experimental evaluation is extensive and rigorous. The authors evaluate on two state-of-the-art models (FLUX.1-dev and Qwen-Image) and compare against a wide range of baselines, including training-free methods (TeaCache, DB-Taylor, RALU, SPEED, ToMA) and training-dependent methods (SenseFlow, Pi-Flow, LSSGen). They report end-to-end speedups, including VAE and SR overhead, which is crucial for practical relevance. The results show a 10x speedup with minimal quality degradation (1% OneIG loss) and up to 25x when combined with distilled models. The ablation studies are thorough, analyzing the impact of step configurations, SR network choices, and noise levels. The inclusion of frequency-domain analysis and attention mechanism analysis adds depth to the empirical claims. The comparison is fair, and the metrics (Geneval, DPG-Bench, OneIG) are standard and appropriate.
The paper provides detailed implementation details, including model versions, resolution settings, and hyperparameters for both MrFlow and baselines. The use of open-source models (FLUX, Qwen-Image, Real-ESRGAN) and standard evaluation metrics enhances reproducibility. The authors mention that code will likely be released (implied by the context of such papers, though not explicitly stated in the text provided, the level of detail suggests high reproducibility). The specific noise levels and step counts are clearly defined.
The method relies on an external pretrained GAN-based SR model (Real-ESRGAN), which adds an inference step and potential dependency issues. While the authors argue the overhead is small, it is non-zero. The method is currently evaluated primarily on text-to-image generation; its applicability to other modalities (e.g., video, 3D) is not explored. The "single-step" refinement at high resolution, while efficient, might struggle with complex semantic changes that require more than one denoising step, although the authors argue the low-strength noise keeps the trajectory close enough. The method is specific to flow-matching models; while the principle might extend to other diffusion formulations, it is not explicitly generalized.
This work has significant practical impact by providing a simple, training-free way to accelerate high-quality image generation, making diffusion models more accessible for real-time applications. It reduces the computational cost barrier for deploying large diffusion models. The approach could inspire similar staged acceleration strategies for other generative tasks. MrFlow introduces a novel training-free multi-resolution acceleration pipeline for flow-matching models that achieves up to 10x speedup by combining low-resolution structure generation, pixel-space super-resolution, and high-frequency resampling, significantly outperforming existing training-free methods while maintaining high image quality. This represents a solid contribution to the field of efficient generative modeling, offering a practical solution for accelerating state-of-the-art diffusion models without the need for retraining or complex system optimizations.
Autonomous scientific discovery systems offer the potential to accelerate research by automating the process of hypothesis generation and validation. However, current systems operate within constrained search spaces or require predefined research questions, limiting their capacity for true open-ended inquiry. Furthermore, while they generate hypotheses iteratively, they largely lack the ability to explicitly synthesize their own accumulated findings to uncover complex, interconnected phenomena. We introduce DiscoPER, an autonomous large language model-powered framework that conducts open-ended research by dynamically generating and executing code to explore datasets without pre-specified research objectives. To ensure rigorous scientific validity, every proposed discovery must pass statistical testing. To overcome the limitations of isolated search, our framework introduces a second-order reasoning mechanism that periodically analyzes its own accumulated discoveries. By treating prior discoveries as empirical data, DiscoPER identifies structural patterns, confounds, and epistemic gaps, actively redirecting hypothesis exploration toward uncharted regions of the search space. The search space is further expanded by incorporating tool use, enabling the system to explore hypotheses beyond structured metadata by seamlessly processing and extracting useful information from multimodal sources like images. Evaluated on iNatDisco, a new multimodal ecological knowledge benchmark with pattern-level ground truth obtained from peer-reviewed literature, DiscoPER recovers 8 of 9 known patterns with a 72.7% hypothesis support rate, outperforming both classical causal discovery and LLM-guided baselines. Ablations show that DiscoPER scales with more data, and confirms the benefits of second-order meta-reflection.
Primary: University of Edinburgh
All Institutions: University of Edinburgh, Massachusetts Institute of Technology
[One sentence main contribution]. DiscoPER introduces a novel autonomous scientific discovery framework that combines LLM-driven hypothesis generation, code-based statistical validation, and second-order meta-reflection to enable open-ended, data-driven scientific inquiry. [Comprehensive analysis of the technical contribution, methodology, and significance to the field]. The paper presents a significant advancement in agentic ML for scientific discovery by addressing the critical limitation of isolated hypothesis generation in existing systems. By introducing a structured "Propose-Evaluate-Reflect" loop, DiscoPER enables the system to synthesize accumulated findings, identify gaps, and redirect its search strategy dynamically. The rigorous validation mechanism, which requires hypotheses to pass statistical tests on held-out data, ensures scientific validity and mitigates LLM hallucination. The creation of the iNatDisco benchmark provides a much-needed evaluation standard for open-ended discovery, moving beyond task-specific QA. The empirical results demonstrate that this approach significantly outperforms both classical causal discovery methods and guided LLM baselines, particularly in recovering complex, multi-variable patterns. This work establishes a new paradigm for autonomous scientific agents that are not only capable of generating ideas but also of critically evaluating and building upon their own discoveries.
The paper proposes DiscoPER, an autonomous scientific discovery framework that integrates Large Language Models (LLMs) with executable code and statistical testing. The core methodological innovation is the "Propose-Evaluate-Reflect" loop. Unlike previous systems that either require predefined research questions (guided) or lack iterative synthesis (unstructured), DiscoPER operates in an open-ended manner ($P=$ none). It generates hypotheses as Python code, validates them on held-out data to prevent p-hacking, and employs a second-order "Reflect" module. This Reflect module analyzes the accumulated claim store to identify epistemic gaps, confounds, and compound hypotheses, thereby steering the search space in subsequent iterations. The approach effectively bridges the gap between classical causal discovery (restricted edge spaces) and LLM-based reasoning (prone to hallucination) by grounding all claims in statistical significance while allowing the LLM to explore a Turing-complete hypothesis space. The inclusion of multimodal capabilities via tool use (VLMs) further expands the scope of discoverable patterns beyond tabular metadata.
The evaluation is rigorous and addresses the specific challenges of open-ended discovery. The authors introduce iNatDisco, a new benchmark derived from iNaturalist data, which includes ground-truth patterns from peer-reviewed literature. This is a significant contribution, as existing benchmarks are largely task-oriented. DiscoPER achieves 8/9 pattern recovery on iNatDisco-800 and 8/12 on iNatDisco-50K, outperforming classical causal discovery methods (which fail to capture complex interactions) and guided LLM baselines. The ablation studies clearly demonstrate the value of the Reflect module, showing improvements in both recall and hypothesis support rate. The counterfactual evaluation is particularly strong, proving that the system relies on data-driven evidence rather than memorized LLM priors. The scaling analysis provides insight into the system's behavior with respect to data size and iteration count.
The paper provides detailed implementation specifications, including model versions (Claude Sonnet 4.6, etc.), statistical thresholds (effect size > 0.2, p < 0.05), and the structure of the hypothesis code. The use of executable code for hypotheses enhances reproducibility, as the validation steps are deterministic given the data and code. The description of the iNatDisco dataset construction is sufficient for replication. However, the reliance on proprietary LLMs (Claude, GPT) means that exact performance replication might vary with model updates, though the methodology itself is open.
The system is computationally expensive due to the iterative nature of code generation, execution, and reflection. The performance is bounded by the quality and bias of the underlying LLMs and the available data. The "Reflect" module, while effective, introduces latency and potential for compounding errors if the initial claims are flawed. Additionally, the benchmark, while novel, is specific to ecology; generalization to other scientific domains requires further validation. The system's ability to discover truly novel, non-intuitive patterns beyond those present in the training data of the LLM remains an open question, although the counterfactual tests mitigate some of this concern.
This work has significant implications for accelerating scientific discovery across disciplines. By automating the iterative process of hypothesis generation and validation, it can help researchers identify patterns that might be overlooked due to human cognitive biases or limitations. The open-ended nature of the system encourages exploration of uncharted regions of the search space, potentially leading to new scientific insights. However, the reliance on AI for scientific discovery raises ethical considerations regarding the verification of findings and the potential for automated bias reinforcement. The framework provides a robust template for building autonomous scientific agents that prioritize empirical validity. [One sentence main contribution]. DiscoPER introduces a novel autonomous scientific discovery framework that combines LLM-driven hypothesis generation, code-based statistical validation, and second-order meta-reflection to enable open-ended, data-driven scientific inquiry. [Comprehensive analysis of the technical contribution, methodology, and significance to the field]. The paper presents a significant advancement in agentic ML for scientific discovery by addressing the critical limitation of isolated hypothesis generation in existing systems. By introducing a structured "Propose-Evaluate-Reflect" loop, DiscoPER enables the system to synthesize accumulated findings, identify gaps, and redirect its search strategy dynamically. The rigorous validation mechanism, which requires hypotheses to pass statistical tests on held-out data, ensures scientific validity and mitigates LLM hallucination. The creation of the iNatDisco benchmark provides a much-needed evaluation standard for open-ended discovery, moving beyond task-specific QA. The empirical results demonstrate that this approach significantly outperforms both classical causal discovery methods and guided LLM baselines, particularly in recovering complex, multi-variable patterns. This work establishes a new paradigm for autonomous scientific agents that are not only capable of generating ideas but also of critically evaluating and building upon their own discoveries.
Model organisms (MOs) - language models trained to exhibit undesired or unnatural behaviours - are frequently used as testbeds for evaluating white-box interpretability techniques. Current MOs are typically constructed via post-hoc supervised fine-tuning (SFT) on behavioural transcripts or synthetic documents. Prior research has shown that interpretability methods can easily identify hidden behaviours in these MOs. However, recent work suggests that such post-hoc training methods may make interpretability unrealistically easy. We investigate this claim by constructing a suite of 54 $\verb|OLMo2-1B|$- and $\verb|gemma-3-1b-it|$-based MOs trained with seven different techniques, including standard post-hoc SFT, post-hoc DPO, and more realistic integration of MO data into the OLMo post-training DPO phase. We use these MO variants to benchmark activation oracles, activation steering, logit lens, and sparse autoencoders. Our findings show that (i) MO interpretability depends strongly on training objective, target behaviour, model architecture, and training data generation pipeline; (ii) substantial variance remains even after controlling for differences in the strength of target behaviour expression; and (iii) our more realistic $\textit{integrated training}$ often yields less interpretable MOs than standard post-hoc methods. Our results cast substantial doubt on the validity of current MOs as interpretability proxies.
Primary: University of Cambridge
All Institutions: LASR Labs, University of Cambridge
This paper has substantial broader impact on the field of LLM interpretability and AI safety. By demonstrating that MO interpretability is highly sensitive to construction choices, it casts significant doubt on the validity of current MOs as reliable proxies for real-world model behaviors. This implies that many existing interpretability benchmarks may be "unrealistically easy," leading to over-optimistic assessments of interpretability techniques. The work provides a crucial methodological critique, urging researchers to adopt more rigorous MO design practices, including diverse training methodologies and QER matching. The finding that "more realistic" integrated training often yields less interpretable MOs is a call to action for the community to develop more robust interpretability methods that can handle such complex, entangled behaviors. The open-sourcing of the MO suite and code will serve as a valuable resource for future research, facilitating the development of more robust benchmarks and interpretability tools. Ultimately, this work contributes to a more calibrated understanding of interpretability progress, which is vital for building trustworthy and safe AI systems. This paper critically re-evaluates the foundational assumptions of Model Organisms (MOs) in LLM interpretability research, demonstrating that interpretability strongly depends on MO training methodology, even when controlling for behavioral strength. Through a rigorous experimental suite of 54 MOs trained with diverse methods, the authors reveal that current MO benchmarks may be unrealistically easy, challenging the generalizability of interpretability findings and providing a crucial methodological contribution to the field of AI safety and interpretability.
The methodology is exceptionally rigorous and well-designed. The core contribution is the systematic construction of a diverse suite of 54 Model Organisms (MOs) to investigate the impact of training methodology on interpretability. The authors define three benign trigger-reaction quirks (CakeBake, ItalianFood, MilitarySubmarine) and train MOs based on two different base models (|OLMo2-1B| and |gemma-3-1b-it|) using seven distinct training methods. These methods span standard post-hoc SFT (Transcript Distillation, Synthetic Document Fine-tuning), post-hoc DPO, and a novel "integrated DPO" approach that more realistically incorporates quirk data into the base model's original post-training DPO phase. A crucial methodological innovation is the "Quirk Expression Rate (QER) matching," where the learning rate and data volume are adjusted to ensure all variants within a family express the quirk to a comparable degree (within 5pp). This control effectively isolates the impact of training methodology from mere behavioral strength, a significant improvement over prior work. The authors also perform black-box validation to ensure low naive black-box interpretability, preventing confounding with white-box techniques. The interpretability evaluation uses four diverse white-box methods: Activation Oracles (AOs), Activation Steering, Logit Lens, and Sparse Autoencoders (SAEs), covering both diffing and non-diffing settings. The use of LLM judges for QER and hypothesis relevance scoring is a modern and appropriate choice, with detailed calibration provided. The exploration of training stochasticity and model architecture robustness further strengthens the methodology.
The experimental evaluation is comprehensive and robust. The suite of 54 MOs is substantial, allowing for a thorough investigation across various dimensions. The choice of |OLMo2-1B| and |gemma-3-1b-it| provides insights into model architecture dependence, although these are smaller models. The experiments clearly demonstrate that MO interpretability varies strongly with training objective, target behavior, model architecture, and training data generation pipeline, even when QER is controlled. The finding that the novel "integrated DPO" often yields *less interpretable* MOs than standard post-hoc methods is a critical and surprising result, challenging the assumption that current MOs are good proxies for real-world behaviors. The paper systematically presents results for each interpretability method, highlighting variability and lack of generalization across MO families and architectures. For instance, the ratio between the most and least interpretable variants ranges unpredictably from 1.2 to 20.4. The analysis of data mixing effects, showing that dilution does not universally decrease interpretability, contradicts prior findings and adds nuance. The robustness checks against training stochasticity (using different data ordering seeds) and model architecture are well-executed, confirming that the observed variance is not merely noise. The comparison between diffing and non-diffing interpretability settings further underscores the limitations of current methods without a reference model. The exclusion of confounded models (OLMo MilitarySubmarine SDF) due to high black-box interpretability demonstrates strong experimental rigor.
The reproducibility of this work is excellent. The authors explicitly state their commitment to open-sourcing their entire suite of 54 quirk expression-matched MOs, along with their training data, and the code used for data generation and training pipelines. This is a significant contribution to the community and will enable future research to build upon their findings directly. Detailed information on MO training, hyperparameters, dataset information, QER evaluation, and interpretability evaluation methods are provided in the appendices, further enhancing reproducibility. The use of publicly available base models (|OLMo2-1B| and |gemma-3-1b-it|) and datasets (e.g., C4, HelpSteer3) also supports reproducibility.
The authors acknowledge several limitations. The quirks studied are benign proxies, and the base models are relatively small (1B parameters), which may limit generalizability to larger, frontier models exhibiting more sophisticated, safety-relevant behaviors. Computational constraints prevented full replication of all experiments (e.g., training data ordering for all quirks, all interpretability methods on Gemma models). The integrated DPO approach only modifies one stage of post-training; earlier instillation of quirks (e.g., during pre-training) might yield even less interpretable results. While QER is matched within families, small differences remain, and QER is not varied *within* a family, meaning the direct impact of varying QER on interpretability is not fully isolated. The paper also briefly touches on the impact of the training data generation pipeline (synthetic vs. externally sourced) but does not fully characterize the specific data features responsible for interpretability differences.
This paper has substantial broader impact on the field of LLM interpretability and AI safety. By demonstrating that MO interpretability is highly sensitive to construction choices, it casts significant doubt on the validity of current MOs as reliable proxies for real-world model behaviors. This implies that many existing interpretability benchmarks may be "unrealistically easy," leading to over-optimistic assessments of interpretability techniques. The work provides a crucial methodological critique, urging researchers to adopt more rigorous MO design practices, including diverse training methodologies and QER matching. The finding that "more realistic" integrated training often yields less interpretable MOs is a call to action for the community to develop more robust interpretability methods that can handle such complex, entangled behaviors. The open-sourcing of the MO suite and code will serve as a valuable resource for future research, facilitating the development of more robust benchmarks and interpretability tools. Ultimately, this work contributes to a more calibrated understanding of interpretability progress, which is vital for building trustworthy and safe AI systems. This paper critically re-evaluates the foundational assumptions of Model Organisms (MOs) in LLM interpretability research, demonstrating that interpretability strongly depends on MO training methodology, even when controlling for behavioral strength. Through a rigorous experimental suite of 54 MOs trained with diverse methods, the authors reveal that current MO benchmarks may be unrealistically easy, challenging the generalizability of interpretability findings and providing a crucial methodological contribution to the field of AI safety and interpretability.
Language models increasingly write probabilistic programs (in NumPyro, Stan, or Pyro), but a program that compiles, runs, and passes every unit test can still be \emph{statistically} wrong -- a Gaussian likelihood for heavy-tailed data, a Poisson for over-dispersed counts, an invalid prior support, or a pathological parameterization. The right verifier is therefore not a test suite but the Bayesian workflow itself: posterior predictive checks, simulation-based calibration, sampler diagnostics ($\hat R$, divergences, ESS), and held-out predictive density. We study this calibration oracle along three axes. \textbf{Detection:} on a benchmark of $14$ misspecification types across $10$ model families ($200$ instances), it flags the bug with AUC $0.97$ ($88\%$ at $2\%$ FPR \emph{when handed the correct reference program, an upper bound}) -- and a fully \emph{reference-free} version that uses no correct program reaches $62$--$78\%$ (the upper figure from a small automated model search), versus $0\%$ for a unit-test oracle. \textbf{Repair:} used as feedback in an LLM repair loop across fifteen models, calibration significantly outperforms unit-test feedback -- which is itself \emph{significantly worse than no feedback at all}, a passing test inducing false confidence that suppresses repair -- and improves over no feedback on strong-but-unsaturated models (GPT-5.1 $33{\to}92\%$, Claude $75{\to}100\%$; paired McNemar, $n{=}228$). \textbf{Reality:} on programs LLMs write from scratch for neutral briefs, $15$--$47\%$ of runnable ones are statistically misspecified (unit tests catch none), and calibration-guided repair significantly beats LLM-as-judge review, a Bayesian-workflow checklist, and data-summary self-debug. Across all three, the lesson is the same: for probabilistic programs, correctness is calibration, not compilation.
Primary: Columbia University
All Institutions: Columbia University
This paper introduces a paradigm shift for verifying LLM-generated probabilistic programs, demonstrating that Bayesian calibration diagnostics are essential for detecting code-invisible statistical errors, significantly outperforming and even revealing the harm of traditional unit tests in LLM repair loops. The comprehensive technical contribution, rigorous empirical validation on both injected and LLM-generated bugs, and the clear, actionable insights make this a genuinely field-wide significant paper that will reshape how LLM-assisted probabilistic modeling systems are designed and evaluated.
The paper proposes a robust methodology for detecting and repairing statistically misspecified probabilistic programs generated by Language Models (LLMs). The central innovation is the "calibration oracle," which formalizes and automates the Bayesian workflow diagnostics (Posterior Predictive Checks, Simulation-Based Calibration, sampler diagnostics like R-hat and divergences, and held-out predictive density) into a programmatic verifier. This oracle is designed to identify "code-invisible misspecification"—statistical errors that traditional unit tests (compilation, execution, output shape) cannot detect. The repair loop integrates this oracle as feedback for LLMs, guiding them to iteratively refine their programs. The methodology is well-grounded in Bayesian principles, and the formal distinction between code-visible and code-invisible bugs is crucial for understanding the limitations of existing LLM code verification paradigms. The aggregation of diverse diagnostics into a single verdict, with defined thresholds, provides a practical and actionable signal for automated systems.
The experimental evaluation is exceptionally thorough and convincing, covering detection, repair, and real-world applicability. 1. **Detection**: A comprehensive benchmark of 200 instances across 14 misspecification types and 10 model families was created. The calibration oracle achieved an impressive AUC of 0.97 (88% detection at 2% FPR), demonstrating its efficacy. Critically, a *reference-free* version of the oracle (without a ground-truth correct program) still achieved 62-78% detection, highlighting its practical utility. In stark contrast, the unit-test oracle achieved 0% detection for code-invisible bugs. 2. **Repair**: Experiments with 15 diverse LLMs (open and API) in a repair loop showed that calibration feedback consistently outperformed "no feedback" and "unit-test feedback." A surprising and impactful finding was that unit-test feedback was often *worse than no feedback*, as it induced false confidence and suppressed repair. Calibration feedback led to substantial improvements for strong-but-unsaturated models (e.g., GPT-5.1 from 33% to 92%), with statistically significant gains ($p < 4.5 \times 10^{-10}$ for pooled invisible-bug repairs). 3. **Real LLM Programs**: This section provides the strongest evidence. LLMs were tasked to write programs from scratch for neutral briefs. The study found that 15-47% of runnable LLM-generated programs were statistically misspecified (none caught by unit tests). Calibration-guided repair significantly outperformed strong baselines, including LLM-as-judge review, Bayesian-workflow checklists, and data-summary self-debug, achieving an 84% fix rate for misspecified programs. The results are robust, with detailed ablations and sensitivity analyses for oracle thresholds.
The paper demonstrates a high commitment to reproducibility. It specifies exact snapshots for API models and HuggingFace revisions for open models. Detailed hyperparameters for LLM generation (temperature, max tokens, repair budget) and NUTS inference (chains, draws, x64) are provided. Oracle thresholds are explicitly stated, and their sensitivity is analyzed. Crucially, the authors commit to releasing "The full system prompt, contract, feedback templates, benchmark generators, and analysis scripts with the paper," which is excellent practice and will enable full replication and extension of their work.
The authors are transparent about several limitations: 1. **PPC power**: The effectiveness of Posterior Predictive Checks depends on the chosen test statistics, meaning some misspecifications might remain undetected if not captured by the tracked statistics. 2. **Right fit, wrong structure**: The oracle primarily assesses distributional fit, not causal or generative correctness, making it blind to structural errors that yield similar predictive distributions. 3. **Over-wide predictive**: An overly confident or under-confident model might "cover" the data and pass PPC despite being fundamentally wrong. 4. **Diagnosis without remedy**: Weaker LLMs may receive accurate diagnostic feedback but lack the capability to translate it into a correct structural fix. 5. **SBC expense**: Simulation-Based Calibration is computationally intensive, limiting its practical use in the repair loop for large or costly models. 6. **Inference failure**: Sampler diagnostics can sometimes fire due to poor inference (e.g., NUTS issues) rather than genuine model misspecification, leading to false positives. 7. **Benchmark scope**: The repair experiments use a subset of bugs, and the tasks are classical low-dimensional models, suggesting that extending the approach to high-dimensional or complex structured models (e.g., deep models, spatial-temporal models) is future work.
This paper has profound broader impact for the burgeoning field of LLM-assisted scientific computing and probabilistic modeling. It fundamentally shifts the paradigm for verifying LLM-generated probabilistic code from "compilation" to "calibration," providing a principled and empirically validated framework for ensuring statistical correctness. This work is crucial for building trustworthy and reliable LLM agents that can assist in scientific discovery, data analysis, and model development. The PPL-agnostic nature of the Bayesian diagnostics means the approach is broadly applicable across popular probabilistic programming languages like Stan, Pyro, and PyMC. The surprising finding that unit-test feedback is actively harmful for LLM repair loops is a critical insight for designing future LLM agent architectures. This paper sets a new standard for evaluating and improving LLM-generated scientific code. This paper introduces a paradigm shift for verifying LLM-generated probabilistic programs, demonstrating that Bayesian calibration diagnostics are essential for detecting code-invisible statistical errors, significantly outperforming and even revealing the harm of traditional unit tests in LLM repair loops. The comprehensive technical contribution, rigorous empirical validation on both injected and LLM-generated bugs, and the clear, actionable insights make this a genuinely field-wide significant paper that will reshape how LLM-assisted probabilistic modeling systems are designed and evaluated.
We present a zero-shot, training-free and optimization-free framework for generating 360 panoramic images and videos by directly injecting spherical priors into pre-trained diffusion transformers. Existing methods either rely on costly fine-tuning on scarce panoramic data that limits generalization, or leverage multi-step optimization that incurs prohibitive inference latency. We observe that contemporary generative models natively exhibit some panoramic priors from large-scale training. However, these emergent capabilities are insufficient, as the models fundamentally fail to satisfy the rigorous topological constraints imposed by equirectangular projection (ERP). We introduce a zero-shot and optimization-free approach that resolves these constraints at inference time. Spherical RoPE replaces standard rotary position embeddings: low-frequency channels are re-parameterized as 3D Cartesian coordinates to natively encode the spherical manifold, while high-frequency channels are harmonically quantized to enforce exact periodicity. Coupled with complementary Semantic Distortion classifier-free guidance (CFG) that explicitly steers geometry, we avoid retraining and inherit the full creative breadth of state-of-the-art models. Our approach generalizes across diverse backbones and 360 generation modalities. We demonstrate this across text-to-panorama using Flux.1, Flux.2, and LTX-Video backbones, achieving competitive performance against baselines, all while remaining training-free. Project page: https://orhir.github.io/SpheRoPE
Primary: Tel-Aviv University
All Institutions: Tel-Aviv University, Hebrew University of Jerusalem
SpheRoPE introduces a novel, training-free framework for 360-degree panorama generation by integrating spherical priors directly into diffusion transformers via modified position embeddings and guidance, achieving competitive results across multiple backbones without the need for fine-tuning or optimization.
The paper proposes SpheRoPE, a training-free, zero-shot method for generating 360-degree panoramic images and videos using pre-trained diffusion transformers (DiTs). The core innovation lies in replacing standard Rotary Position Embeddings (RoPE) with Spherical RoPE. This involves re-parameterizing low-frequency channels into 3D Cartesian coordinates to natively encode the spherical manifold and harmonically quantizing high-frequency channels to enforce periodicity. This is coupled with a Semantic Distortion classifier-free guidance (CFG) mechanism to steer geometry. The approach is theoretically sound, addressing the topological mismatch between planar training data (ERP) and spherical reality without retraining. It leverages the emergent capabilities of large models while correcting their fundamental geometric flaws.
The authors evaluate SpheRoPE on multiple state-of-the-art backbones, including Flux.1, Flux.2, and LTX-Video. They demonstrate competitive performance against existing baselines in text-to-panorama and text-to-video tasks. The evaluation highlights the method's ability to resolve topological artifacts (seams, discontinuities) common in naive ERP generation. The results suggest that the method generalizes well across different model architectures, which is a significant strength. However, as a zero-shot method, it relies on the underlying model's quality, so comparisons are against other zero-shot or fine-tuned baselines. The paper likely includes qualitative visualizations and potentially quantitative metrics like FID or CLIP scores adapted for panoramas, though specific numbers are not provided in the abstract. The claim of "competitive performance" suggests it matches or exceeds fine-tuned methods in some aspects while being significantly more efficient.
The paper provides a project page URL. As a training-free method, reproducibility is high provided the source code for the Spherical RoPE injection and Semantic Distortion guidance is released. The reliance on pre-trained models (Flux, LTX-Video) means the community has access to the base weights, facilitating replication. The method's simplicity (modifying embeddings and guidance) makes it easier to implement than full fine-tuning pipelines.
The primary limitation is the reliance on the pre-trained model's inherent knowledge. If the base model lacks semantic understanding of specific panoramic scenes, SpheRoPE cannot create that knowledge from scratch. Additionally, the harmonic quantization and Cartesian re-parameterization might introduce subtle artifacts if not tuned correctly for specific resolutions or aspect ratios. The method is currently demonstrated on text-to-panorama; its effectiveness on more complex video generation with temporal consistency across the spherical manifold needs rigorous long-term evaluation. There may also be a trade-off between geometric correctness and semantic fidelity, which the Semantic Distortion CFG aims to mitigate but may not eliminate entirely.
This work significantly lowers the barrier to entry for high-quality 360-degree content generation. By eliminating the need for costly fine-tuning on scarce panoramic data, it democratizes access to VR/AR content creation tools. It also provides a generalizable technique for handling non-Euclidean data structures in diffusion models, which could be extended to other domains like spherical video, global climate modeling visualization, or astronomical data. The reduction in inference latency compared to optimization-based methods makes it more viable for real-time applications. SpheRoPE introduces a novel, training-free framework for 360-degree panorama generation by integrating spherical priors directly into diffusion transformers via modified position embeddings and guidance, achieving competitive results across multiple backbones without the need for fine-tuning or optimization.
4D hand motion reconstruction from egocentric video is bottlenecked by clear limitations of existing methods: image-based pipelines depend on a detector that fails under heavy occlusion, while video-based methods rely on temporal modules learned only from scarce hand-pose annotations, a narrow signal insufficient to model motion dynamics, occlusion reasoning, and hand-object interaction. These capabilities, however, are exactly what video generative models must implicitly acquire when trained to synthesize coherent video at internet scale. Motivated by this, we present ViDiHand, which leverages the representations of a pretrained video diffusion model to reconstruct 4D two-hand pose. We adapt it via a hand-overlay rendering objective that specializes its features for hands while preserving its world priors. A decoder then recovers metric-scale pose from the adapted features. The whole pipeline operates directly on full frames--no detector, no infiller, and no test-time optimization. On ARCTIC, HOT3D, and HOI4D, ViDiHand substantially outperforms prior methods, establishing video diffusion models as a powerful new foundation for hand motion reconstruction and a promising route to scalable in-the-wild data collection for embodied AI. Project page: https://vidihand.github.io.
Primary: A*STAR (Agency for Science, Technology and Research)
All Institutions: A*STAR, NTU Singapore (Nanyang Technological University), Alibaba Group
The paper presents a significant advancement in 4D hand motion reconstruction by effectively adapting video diffusion models for perception tasks, achieving state-of-the-art performance on challenging benchmarks through a novel hand-overlay rendering adaptation and a geometrically-aware dual-branch decoder.
The paper proposes a novel paradigm for 4D hand motion reconstruction by leveraging the internal representations of a large-scale pretrained Video Diffusion Model (Wan2.1-VACE). Instead of treating the diffusion model as a generative black box or a frozen feature extractor, the authors introduce a "hand-overlay rendering" adaptation stage. This involves finetuning only the VACE branch of the model to regenerate input clips with semi-transparent rendered hand overlays. This clever pretext task specializes the model's world priors (occlusion reasoning, temporal coherence, 3D geometry) for hand-centric tasks without destroying the general visual knowledge. The decoder is a dual-branch architecture: a Hand-Token Branch for holistic articulated pose and a Joint-Heatmap Branch for local 2D localization, coupled by mutual cross-attention and a closed-form geometric solve for camera translation. This design elegantly separates the holistic vs. local inductive biases of the representation. The approach is methodologically sound, theoretically motivated by the capabilities of generative models, and technically sophisticated in its integration of diffusion features with geometric constraints.
The evaluation is comprehensive and rigorous. The authors test on three challenging egocentric hand benchmarks: ARCTIC (heavy occlusion), HOT3D (fisheye, high dynamic range, motion blur), and HOI4D (cross-dataset generalization). They introduce a "penalty protocol" that folds false negatives into pose metrics, providing a more realistic assessment of detection robustness than standard TP-only metrics. ViDiHand establishes new state-of-the-art results across all metrics, with particularly significant gains in frame accuracy (detection robustness) and temporal jitter (smoothness). The ablation studies are thorough, validating the choice of DiT layer, denoising step, and decoder components. The cross-dataset transfer to HOI4D demonstrates the generalizability of the learned priors. The results are statistically significant and practically meaningful, showing that video diffusion models capture richer spatiotemporal priors than discriminative video models or image-based detectors.
The paper provides detailed implementation details, including the specific backbone (Wan2.1-VACE), the two-stage training curriculum (joint overlay then MANO mesh overlay), and the decoder architecture. The supplementary material contains extensive details on the evaluation protocol, metric definitions, and ablation studies. The project page link suggests code/data availability, which is standard for high-impact ML papers. The use of a publicly available backbone (Wan2.1) enhances reproducibility, although the specific finetuning steps and data preprocessing pipelines would need to be carefully followed. The closed-form geometric solve is well-defined.
The primary limitation is computational cost. The method runs at 5.5 fps on 4 A100 GPUs, making it an offline annotation tool rather than a real-time solution. The authors acknowledge this and suggest distillation as a future direction. Additionally, Stage 1b still requires MANO-annotated video, which is a scarce resource, though the authors propose self-supervised pretexts to relax this in the future. The method may also struggle with extreme cases not covered in the training data, although the cross-dataset results suggest good generalization.
This work has significant implications for embodied AI, robotics, and human-computer interaction. By providing a scalable, high-quality method for 4D hand reconstruction from egocentric video, it enables the creation of large-scale datasets for training robot policies and understanding human behavior. The paradigm shift towards leveraging video generative models for perception tasks could influence future research in 3D vision, motion capture, and video understanding. It also highlights the untapped potential of diffusion models for discriminative tasks, potentially inspiring similar approaches in other domains. The paper presents a significant advancement in 4D hand motion reconstruction by effectively adapting video diffusion models for perception tasks, achieving state-of-the-art performance on challenging benchmarks through a novel hand-overlay rendering adaptation and a geometrically-aware dual-branch decoder.
Data, as the fundamental substrate of modern intelligence, has greatly driven the development of current foundation models. Naturally, researchers aim to extend this paradigm to the domain of GUI agents, hoping to build strong GUI agents through a similar paradigm. However, GUI agent data cannot be directly harvested from the internet, making it costly and difficult to collect at scale. As a result, current GUI agents suffer from poor cross-device generalization and limited visual grounding ability for fine-grained GUI elements. As an attempt to address data challenge in GUI agents, we propose GUICrafter, a weakly-supervised GUI agent leveraging massive unannotated screenshots to substantially reduce the reliance on expensive human annotations. GUICrafter explores a curriculum learning framework for training GUI agents through two progressive stages. First, the model learns visual grounding from large-scale unannotated screenshots and webpages, leveraging the rich contextual signals inherent in GUI interactions without human annotations. Then, in Stage 2, we leverage a small amount of high-quality data to calibrate the model via reinforcement learning. Experiments show that GUICrafter achieves competitive, or even superior, performance to advanced systems like UI-TARS while using only 0.1% of its data. Furthermore, under the same amount of annotated data, GUICrafter surpasses all previous methods such as GUI-R1. Code, data, and models are available at https://github.com/fansunqi/GUICrafter.
Primary: Tsinghua University
All Institutions: Tsinghua University, Tencent Hunyuan
GUICrafter presents a significant advancement in GUI agent training by introducing a scalable weakly-supervised pretraining stage that leverages unannotated screenshots for visual grounding, achieving state-of-the-art performance with minimal annotated data. The technical contribution lies in the effective formulation of meta-tasks from interactive signals and the robust two-stage RLVR framework, which offers a practical and efficient path forward for data-constrained GUI agent development.
The paper proposes GUICrafter, a two-stage training framework for GUI agents. Stage 1 involves "weakly-supervised GUI pretraining" using massive unannotated screenshots. The core innovation here is the extraction of interactive signals (clickable/typable elements) from web pages and mobile apps to create "meta-tasks" (e.g., "click any clickable area"). This allows the model to learn visual grounding without human annotation by leveraging the inherent structure of GUIs. Stage 2 uses a small amount of high-quality, manually annotated data for reinforcement learning (RLVR with GRPO) to calibrate the model. The reward design includes a Gaussian position reward to provide finer-grained feedback than binary point-in-box rewards. The approach effectively bridges the gap between large-scale unsupervised visual learning and precise task-oriented grounding.
The evaluation is comprehensive, covering multiple benchmarks across web (Mind2Web, ScreenSpot-Pro), mobile (AndroidControl, AITW, AndroidWorld), and general (OmniACT) domains. The results show that GUICrafter-3B and GUICrafter-7B achieve performance competitive with or superior to state-of-the-art models like UI-TARS and GUI-R1, despite using significantly less annotated data (0.1% of UI-TARS's data). The ablation studies effectively demonstrate the contribution of Stage 1 (visual grounding improvement) and Stage 2 (task completion calibration). The comparison against baselines is fair, including reproductions of GUI-R1 on full datasets. The scalability analysis (10k to 500k samples) provides strong evidence for the data efficiency and robustness of the weakly-supervised stage.
The authors provide code, data, and models. The methodology is clearly described, including the specific extraction tools (Playwright) and the reward function formulas. The use of standard benchmarks and clear reporting of metrics (Element Accuracy, Step Success Rate, etc.) enhances reproducibility. The distinction between the weakly-supervised data generation and the supervised fine-tuning data is clear.
The method still relies on a small amount of high-quality annotated data in Stage 2 for calibration, although this is significantly reduced compared to prior work. The weakly-supervised data generation relies on automated extraction which may have noise (though the paper shows robustness to this). The "meta-tasks" are somewhat generic and may not capture the semantic intent of complex user goals, which is handled in Stage 2. The approach is primarily tested on web and mobile interfaces; generalization to other GUI types (e.g., desktop applications with complex non-standard widgets) might require further validation.
This work addresses a critical bottleneck in GUI agent development: data scarcity. By demonstrating that massive unannotated data can be leveraged for visual grounding, it lowers the barrier to entry for building robust GUI agents. This could accelerate the development of autonomous agents for web and mobile interaction, with implications for accessibility, automation, and human-computer interaction. The open-source release contributes to the community by providing a new baseline and dataset generation pipeline. GUICrafter presents a significant advancement in GUI agent training by introducing a scalable weakly-supervised pretraining stage that leverages unannotated screenshots for visual grounding, achieving state-of-the-art performance with minimal annotated data. The technical contribution lies in the effective formulation of meta-tasks from interactive signals and the robust two-stage RLVR framework, which offers a practical and efficient path forward for data-constrained GUI agent development.
On-policy distillation (OPD) offers superior capacity transfer by supervising student-sampled trajectories with dense token-level signals. To furnish high-quality supervision sources and thereby elevate the performance frontier of distillation, an intuitive direction is to infuse privileged information to either teacher or student itself. However, this additional input induces a potential failure mode we dub privilege illusion: a pattern that conflates the transferable capability gap that students are meant to close, and the information asymmetry gap that can only be mimicked but never replicated. This issue is further amplified by the inherent non-uniformity of token-level supervision, where only a small subset of tokens carries pivotal capability-bearing signals. To this end, we propose DOPD, an advantage-aware dual distillation paradigm that dynamically routes token-level supervision between privileged teacher and privileged student policies based on their advantage gap and relative probabilities. Each token receives supervision of different strength, objective, and strategy from either teacher or student itself, which transfers credible capability while simultaneously receiving auxiliary signals, to alleviate privilege illusion. Extensive experiments on both large language model (LLM) and vision-language model (VLM) settings demonstrate that DOPD consistently outperforms Vanilla OPD and other counterparts. Further results on stability, robustness, continual learning, and out-of-distribution tasks validate its superiority.
Primary: Explore Academy
All Institutions: Explore Academy, MMLab
DOPD presents a significant advancement in on-policy distillation by introducing a dynamic, advantage-aware routing mechanism that effectively mitigates the "privilege illusion" caused by information asymmetry, leading to superior and more stable knowledge transfer across LLM and VLM domains.
The paper introduces DOPD, an advantage-aware dual on-policy distillation framework. The core innovation lies in addressing the "privilege illusion," a phenomenon where privileged information (e.g., hints, annotations) creates an apparent performance gap between teacher and student that is due to information asymmetry rather than transferable capability. DOPD dynamically routes token-level supervision by calculating a "privilege advantage gap" and comparing token probabilities. It classifies tokens into four regimes (High/Low Advantage x High/Low Probability) and applies different distillation strategies (strong teacher distillation, light teacher distillation, weak self-regularization, or student consistency) accordingly. The methodology is theoretically grounded in disentangling capability gaps from information gaps. The approach is well-motivated and addresses a genuine limitation in current OPD practices. However, the mechanism is essentially a heuristic routing based on probability and advantage metrics, which, while effective, is not radically new in the context of adaptive weighting or curriculum learning, though the specific application to privilege illusion is novel.
The experimental evaluation is extensive, covering both LLMs (Qwen3 series) and VLMs (Qwen3-VL series). The authors compare DOPD against a wide range of baselines, including standard OPD, self-distillation, and adaptive distillation methods. Results show consistent improvements across 8 benchmarks for LLMs and 8 for VLMs. The paper also includes ablation studies on token types, divergence objectives, and privileged information modalities. Scalability is tested across different teacher-student size ratios, demonstrating robustness. The results are statistically significant and convincing. The inclusion of continual learning and OOD generalization adds depth. The use of "Qwen3" and "GPT-5.4" suggests this is a very recent or hypothetical future paper (given current dates), which might indicate a pre-print context where benchmarks are state-of-the-art. The performance gains are substantial (e.g., +7.5 points on LLM average).
The paper provides detailed implementation settings, including model sizes, optimizer parameters, batch sizes, and specific hyperparameters for the distillation intensities ($w=0.3, l=0.6$). The dataset sources are named (RaR-Science-20K, DAPO-Math-17K, etc.). However, the reliance on "GPT-5.4" for generating privileged hints and the specific "Qwen3" models (which may not be publicly released or named exactly this way in the public domain yet, depending on the exact current date) could pose reproducibility challenges if the underlying models or data generation pipelines are not open. The code is not explicitly linked in the text provided, though "none" is listed for project URL.
The paper acknowledges that the method relies on the quality of privileged information. If the privileged hints are noisy or misleading, the "privilege advantage gap" might be misinterpreted. The method also introduces additional computational overhead due to the forward passes of both privileged teacher and student policies for every token to calculate the advantage gap. The analysis of "privilege illusion" is insightful but relies on the assumption that the advantage gap is a perfect proxy for capability vs. information, which might not always hold in complex, multi-modal settings. The paper does not extensively discuss the failure modes of the routing mechanism itself (e.g., what happens if probabilities are unstable).
DOPD provides a more robust framework for distilling large models, which is crucial for deploying capable AI in resource-constrained environments. By mitigating privilege illusion, it ensures that students learn genuine capabilities rather than shortcuts, leading to better generalization and safety. This has implications for the entire field of model compression and post-training alignment. DOPD presents a significant advancement in on-policy distillation by introducing a dynamic, advantage-aware routing mechanism that effectively mitigates the "privilege illusion" caused by information asymmetry, leading to superior and more stable knowledge transfer across LLM and VLM domains.
Technology computer-aided design (TCAD) semiconductor device simulation is fundamentally constrained by the high computational cost of iteratively solving coupled drift-diffusion equations. Existing ML surrogates either reduce internal physics to macroscopic scalar regressions, or rely on single-step mappings that lack the iterative refinement required to resolve stiff, coupled fields. To address this, we introduce PCGD, a Physics-Guided Conditional Graph Diffusion framework operating natively on unstructured TCAD meshes to predict coupled electrostatic and carrier density fields. PCGD employs a Condition-Aware MeshGraphNet denoiser that explicitly injects boundary conditions and device structure context via global cross-attention. By augmenting data-driven denoising with a physics-guided hybrid objective that integrates exponent-free quasi-Fermi gradient matching with noise-aware PDE residuals, PCGD progressively enforce physical constraints in the iterative diffusion trajectory. This strategy successfully bypasses the numerical instabilities typical of stiff drift-diffusion equations. Evaluated on a challenging mixed PN/MOS benchmark, PCGD significantly outperforms deterministic one-step regression (1.207% error) and local diffusion (1.585% error) baselines by achieving a sub-percent mean relative field error of 0.835%, while concurrently reducing maximum PDE residual errors by nearly three orders of magnitude compared to pure diffusion. It also transfers robustly to unseen SOI topologies (0.815% error) via LoRA adaptation, using 5.30$\times$ less data and 14.34$\times$ fewer parameters than full fine-tuning. Ultimately, PCGD bridges the computational efficiency of generative surrogates with the rigorous physical fidelity of traditional TCAD, unlocking highly scalable, field-level analysis for robust device engineering.
Primary: Fudan University
All Institutions: Fudan University, Shanghai Integrated Circuit Manufacturing Innovation Center
PCGD has a profound broader impact, particularly in the semiconductor industry and scientific machine learning. By significantly accelerating TCAD simulations while maintaining high physical fidelity, it can drastically shorten the design cycle for advanced semiconductor devices, fostering innovation in microelectronics for applications ranging from AI hardware to power electronics. The methodology for stabilizing physics-informed diffusion models for highly stiff and nonlinear PDEs is generalizable beyond TCAD. It offers a blueprint for tackling similar challenges in other scientific and engineering domains, such as materials science, fluid dynamics, and biomechanics, where complex multiphysics simulations are computationally expensive and numerically challenging. The concept of a hybrid AI-TCAD simulation workflow, where ML provides robust initial guesses to accelerate or stabilize traditional solvers, represents a powerful paradigm for integrating AI into complex scientific computing pipelines, promising both efficiency and reliability. Furthermore, the successful application of parameter-efficient fine-tuning (LoRA) for domain adaptation in a scientific context highlights its potential for developing more adaptable and data-efficient ML models for specialized scientific tasks. PCGD introduces a physics-guided conditional graph diffusion framework that effectively addresses the computational bottlenecks and physical fidelity challenges in TCAD device simulation. This work makes substantial contributions through its mesh-native graph diffusion approach, condition-aware MeshGraphNet architecture with global cross-attention, and a novel hybrid physics-guided training objective that stabilizes learning for stiff, nonlinear drift-diffusion equations. The impressive empirical results, demonstrating sub-percent accuracy and near three-order-of-magnitude reduction in PDE residuals, coupled with robust transferability to unseen topologies, position PCGD as a significant advancement in ML-accelerated scientific computing, with clear implications for future semiconductor design and broader multiphysics simulation.
The methodology presented in PCGD is highly sophisticated and meticulously designed to tackle the inherent challenges of TCAD device simulation. The core idea of formulating coupled-field prediction as a conditional generative denoising task on native unstructured TCAD meshes is a significant departure from previous ML surrogates. This approach leverages the iterative refinement capabilities of diffusion models to decouple macroscopic field construction from microscopic detail resolution, which is crucial for handling the extreme stiffness and exponential nonlinearities of semiconductor physics. The Condition-Aware MeshGraphNet denoiser is a well-thought-out architectural innovation. By explicitly injecting boundary conditions and device structure context via global cross-attention, it effectively overcomes the limitations of purely local message passing, ensuring that critical global information (like terminal biases and device layout) is directly accessible to every mesh node. This design choice is critical for accelerating convergence to correct physical operating conditions. The most impactful methodological contribution is the hybrid physics-guided objective. It ingeniously combines an exponent-free quasi-Fermi gradient matching loss ($L_G$) for stable guidance during early, noisy diffusion stages with noise-aware, adaptively scaled exact PDE residuals ($L_P$) for fine-grained physical correction in later stages. The dual-tier stabilization strategy for $L_P$ (validity-aware SNR gating and adaptive batch balance) is a robust solution to prevent gradient divergence caused by the extreme nonlinearity of drift-diffusion equations. The detailed mathematical derivation connecting Scharfetter-Gummel flux to quasi-Fermi gradients and the implementation of GPU-accelerated graph operators for discrete PDE residuals further demonstrate the technical depth and rigor of the proposed framework.
The experimental evaluation is comprehensive, rigorous, and strongly supports the claims made in the paper. The use of a challenging mixed PN/MOS benchmark dataset, comprising a large number of training and validation snapshots from distinct devices and operating modes, ensures the robustness of the evaluation. The device-trajectory level splitting for train/validation is critical for assessing generalization. The ablation studies are particularly effective, systematically isolating the contributions of iterative generative refinement, global conditioning, and physics-guided supervision by comparing PCGD against well-chosen baselines (deterministic one-step regression, local diffusion, and condition-aware diffusion with varying levels of physics guidance). The chosen metrics, mean relative $L_2$ error and log-compressed maximum PDE residual error, are appropriate and provide a holistic view of both accuracy and physical consistency. The results are highly impressive: PCGD achieves a sub-percent mean relative field error (0.835%) and significantly outperforms all baselines, reducing maximum PDE residual errors by nearly three orders of magnitude compared to pure diffusion. The analysis of convergence dynamics clearly illustrates the stability gained from the hybrid physics-guided objective. Furthermore, the demonstration of robust transferability to unseen SOI topologies via parameter-efficient LoRA adaptation, requiring significantly less data and fewer parameters than full fine-tuning, is a strong indicator that PCGD learns generalizable physical priors, which is crucial for practical adoption in industrial settings.
The paper provides a commendable level of detail for reproducibility. The methodology is thoroughly described with mathematical formulations for the graph representation, diffusion process, Condition-Aware MeshGraphNet, and the hybrid physics-guided objective. The appendix offers specific architectural settings, coefficients for the loss functions, and details on the baseline architectures. Crucially, numerical safety measures for handling stiff exponentials and precision truncation are explicitly mentioned. The schema for mesh features and interface edges is also detailed. While the absence of publicly available code (no project URL provided) is a common limitation for arXiv preprints, the textual descriptions are sufficiently comprehensive for an experienced researcher with expertise in GNNs, diffusion models, and scientific computing to implement the core components of PCGD. Access to the specific TCAD data generation pipeline would be the final piece for exact replication of the dataset.
Despite its strengths, PCGD exhibits some limitations. The paper acknowledges "severe zero-shot degradation" when transferring to major out-of-distribution topological shifts (e.g., SOI MOSFETs not seen during pretraining). While LoRA adaptation effectively addresses this, it implies that the model, despite learning physical priors, still relies on structural interpolation from its pretraining data, limiting its immediate "universal foundational model" capabilities without a much broader pretraining corpus. The current experimental validation is primarily on 2D devices. While the paper discusses the theoretical O(N) scaling advantage for 3D simulations, empirical validation on massive 3D meshes, including memory footprint and training/inference times, is yet to be demonstrated. The computational cost of training such a complex diffusion model on large graph datasets is likely substantial, though not explicitly quantified. Finally, the proposed hybrid AI-TCAD workflow mentions routing high-residual cases back to traditional solvers, but the practical implementation details, criteria for routing, and the efficiency gains from this hybrid approach are not fully explored.
PCGD has a profound broader impact, particularly in the semiconductor industry and scientific machine learning. By significantly accelerating TCAD simulations while maintaining high physical fidelity, it can drastically shorten the design cycle for advanced semiconductor devices, fostering innovation in microelectronics for applications ranging from AI hardware to power electronics. The methodology for stabilizing physics-informed diffusion models for highly stiff and nonlinear PDEs is generalizable beyond TCAD. It offers a blueprint for tackling similar challenges in other scientific and engineering domains, such as materials science, fluid dynamics, and biomechanics, where complex multiphysics simulations are computationally expensive and numerically challenging. The concept of a hybrid AI-TCAD simulation workflow, where ML provides robust initial guesses to accelerate or stabilize traditional solvers, represents a powerful paradigm for integrating AI into complex scientific computing pipelines, promising both efficiency and reliability. Furthermore, the successful application of parameter-efficient fine-tuning (LoRA) for domain adaptation in a scientific context highlights its potential for developing more adaptable and data-efficient ML models for specialized scientific tasks. PCGD introduces a physics-guided conditional graph diffusion framework that effectively addresses the computational bottlenecks and physical fidelity challenges in TCAD device simulation. This work makes substantial contributions through its mesh-native graph diffusion approach, condition-aware MeshGraphNet architecture with global cross-attention, and a novel hybrid physics-guided training objective that stabilizes learning for stiff, nonlinear drift-diffusion equations. The impressive empirical results, demonstrating sub-percent accuracy and near three-order-of-magnitude reduction in PDE residuals, coupled with robust transferability to unseen topologies, position PCGD as a significant advancement in ML-accelerated scientific computing, with clear implications for future semiconductor design and broader multiphysics simulation.
LLM agents carry conclusions across steps and sessions in compressed memory, and memory products (e.g., mem0, LangMem) rewrite conversation into stored "facts" that later steps trust. We show this rewriting manufactures confidence: across our constructed agent settings, a casual, hedged remark becomes a confident, dated assertion the agent then obeys like a verified fact, granting every above-clearance request it faces. No attacker is needed: a role that was true once and never corrected is stored as a flat fact and acted on like a deliberate injection. We then isolate what the agent responds to. It is not the source: attributed, unattributed, and even forged "system of record" claims all grant alike. It is the confidence of the phrasing. A hedge is discounted, a flat assertion is obeyed, and this holds with no special keyword. Not all hedges are equal, though: the evidential register is the least-discounted, with "reportedly" obeyed like a flat assertion on most models. The obvious fixes fail. A passive "unverified" tag is ignored, and an active "do not trust this" instruction escalates even correct memory, so it is safe only by refusing to decide. The real fix lives in the store: keep the tentative phrasing rather than upgrade it. But that is hygiene, not a defense against an attacker who can simply write a confident lie. The deployable lesson is narrower and constructive: a single load-bearing memory is the hazard, and one redundant source restores correct decisions. We release the harness and demonstrations.
Primary: unknown
All Institutions: unknown
This paper has significant broader impact for the development and deployment of LLM agents. It identifies a fundamental, pervasive, and under-defended failure mode ("manufactured confidence") in how LLM agents process and store information in consolidated memory. This phenomenon can lead to agents being confidently wrong, granting unauthorized access, or making incorrect financial decisions, even without an explicit attacker. The findings challenge current memory consolidation practices and highlight the critical need for memory architectures that preserve epistemic status. The "hearsay blind spot" is a particularly alarming discovery, as it means agents may treat unverified reports as established facts. The paper provides crucial, actionable lessons for practitioners: avoid single load-bearing memories for consequential decisions, and implement redundant verification sources. While the proposed store-side fix is not a complete defense against adaptive attackers, it significantly raises the bar and improves hygiene. This work is essential for enhancing the safety, reliability, and trustworthiness of LLM agents in real-world applications. This paper identifies and rigorously diagnoses "manufactured confidence," a critical and pervasive failure mode in LLM agent memory consolidation where hedged remarks are rewritten into confident facts, leading to agents making confidently wrong decisions across diverse models and memory products. The comprehensive analysis reveals that agents obey the confidence of phrasing rather than its source, are blind to hearsay markers, and that common mitigations like passive tags or active distrust instructions are ineffective or lead to abdication, underscoring the necessity of preserving epistemic status in memory stores and employing redundant information sources for consequential decisions.
The methodology is exceptionally strong and well-designed for diagnosing the phenomenon of "manufactured confidence." The authors construct multi-step agent settings (access control, budget approval, running total) where memory is load-bearing, allowing for clear ground truth and legible impact. A crucial aspect is the use of real, shipped memory products (mem0, LangMem) alongside a verbatim control, which grounds the findings in practical agent deployments. The systematic probing involves varying how memory is presented (confident, passive tag, active instruction), dissecting the cues agents respond to (modality, hearsay, explicit non-verification), and testing the impact of source attribution (bare, attributed, forged authority). The inclusion of a "natural case" (staleness without injection) alongside adversarial injection strengthens the generalizability of the problem. The use of five diverse, state-of-the-art LLMs from four providers (Anthropic, Meta, OpenAI, Qwen) ensures the findings are not model-specific. The methodology also includes a symmetry test (over-denial) to rule out simple grant bias and a detailed analysis of the "laundering" process within memory products. The approach is comprehensive, rigorous, and effectively isolates the mechanisms behind manufactured confidence.
The experimental evaluation is thorough and provides compelling evidence for the paper's claims. Key findings include: 1. **Manufactured Confidence**: Memory consolidation rewrites hedged remarks into confident assertions, leading to high confident-wrong rates (0.50-1.00) across all models in consequential decisions. 2. **Source Invariance**: Agents obey the confidence of phrasing, not its source. Attributed, unattributed, and even forged "system of record" claims grant alike, demonstrating a critical blindness to provenance. 3. **Failure of Obvious Fixes**: Passive "unverified" tags are largely ignored, especially by non-Anthropic models. Active "do not trust this" instructions lead to abdication (escalating everything), not discrimination, costing all utility. 4. **Redundancy as a Fix**: A second, authoritative source allows agents to discriminate, turning distrust into selective caution rather than blanket abdication. 5. **Hearsay Blind Spot**: Evidential registers, particularly "reportedly," are the least-discounted hedges, often obeyed like flat assertions on most models. This is a critical, pervasive vulnerability. 6. **Symmetry**: The effect is symmetric, causing both over-granting and over-denial based on manufactured confidence, ruling out a simple grant bias. 7. **Consolidation, Not Vendor**: The laundering of hedges into confident facts is a property of LLM consolidation itself, not specific memory products or extraction LLMs. The experiments are quantitatively presented with clear rates, using temperature 0 for deterministic behavior per scenario. The results are consistent across models, highlighting a systemic issue. The distinction between "belief" and "low threshold" based on rationale analysis adds a qualitative layer to the findings.
The paper demonstrates a high commitment to reproducibility. The authors explicitly state, "We release the harness, data, and demonstrations at https://github.com/collapseindex/manufactured-confidence." They provide detailed information on the models used (exact API identifiers, providers, access dates), temperature settings, agent system prompts, memory poisoning setup, and memory backend configurations. Specific scripts (e.g., `cues.py`, `forged.py`) are mentioned, indicating a well-structured codebase. This level of detail and code release makes the experiments highly reproducible.
The authors are commendably transparent about the limitations: 1. **Constructed Scenarios**: The tasks are decision-shaped but not live deployments, and even "natural staleness" sessions are constructed, meaning the base rate of this failure mode in the wild is not measured. 2. **Scope**: The study focuses on two memory products, four extractors, and five phrasings, with deep probes primarily in access control. While robust, it's not exhaustive. The Zep probe is limited. 3. **Belief vs. Threshold**: The distinction relies on verbalized rationales, which are not ground-truth processing. 4. **Non-Adaptive Threat Model**: The proposed store-side defense is not robust against an adaptive attacker who can directly supply confident, forged authority. 5. **Sample Sizes**: While effects are large and consistent, $n$ values (e.g., 15 for decisions, 10 for poisonings) are relatively small for statistical generalization, though the deterministic nature at temperature 0 mitigates this for the constructed scenarios. 6. **Fix is a Prompt**: The hedge-preserving extraction is demonstrated via a prompt, not a fully engineered production store.
This paper has significant broader impact for the development and deployment of LLM agents. It identifies a fundamental, pervasive, and under-defended failure mode ("manufactured confidence") in how LLM agents process and store information in consolidated memory. This phenomenon can lead to agents being confidently wrong, granting unauthorized access, or making incorrect financial decisions, even without an explicit attacker. The findings challenge current memory consolidation practices and highlight the critical need for memory architectures that preserve epistemic status. The "hearsay blind spot" is a particularly alarming discovery, as it means agents may treat unverified reports as established facts. The paper provides crucial, actionable lessons for practitioners: avoid single load-bearing memories for consequential decisions, and implement redundant verification sources. While the proposed store-side fix is not a complete defense against adaptive attackers, it significantly raises the bar and improves hygiene. This work is essential for enhancing the safety, reliability, and trustworthiness of LLM agents in real-world applications. This paper identifies and rigorously diagnoses "manufactured confidence," a critical and pervasive failure mode in LLM agent memory consolidation where hedged remarks are rewritten into confident facts, leading to agents making confidently wrong decisions across diverse models and memory products. The comprehensive analysis reveals that agents obey the confidence of phrasing rather than its source, are blind to hearsay markers, and that common mitigations like passive tags or active distrust instructions are ineffective or lead to abdication, underscoring the necessity of preserving epistemic status in memory stores and employing redundant information sources for consequential decisions.
Diffusion Language Models (DLMs) are typically trained under fixed context structures, restricting denoising to predetermined token subsets. This creates a mismatch between training and inference, where models must operate over arbitrary configurations, leading to degradation off the training grid. We propose Adaptive Block Diffusion (ABD), which resolves this mismatch by optimizing denoising risk over a distribution of prefix-window configurations. By treating the configuration as a stochastic variable, ABD trains a single model over the full configuration space without architectural changes. We show that generalization across decoding strategies is governed by the support of the training distribution, and that ABD guarantees denoising optimality for any inference policy whose configurations are covered during training. Empirically, ABD exhibits structural invariance across decoding scales, avoiding off-grid collapse and recovering a monotonic relationship between block size and perplexity, while matching or outperforming fixed-block specialists at their target scales.
Primary: Microsoft AI
All Institutions: Microsoft AI
Adaptive Block Diffusion has significant broader impact for the field of Diffusion Language Models and potentially for structured generative models more generally. By providing a principled and effective solution to the training-inference mismatch, ABD makes DLMs more robust, generalizable, and practical for real-world deployment. The ability of a single model to perform well across various decoding scales and strategies simplifies model management and enables more flexible inference scenarios, such as adaptive generation speeds or streaming decoding. This could accelerate the adoption and development of DLMs in diverse applications. The theoretical framework also offers a valuable new perspective on understanding generalization in structured generative models, potentially inspiring similar training paradigms in other domains beyond text. The demonstrated improved zero-shot generalization to domain-shifted text further suggests that ABD could lead to more versatile and less domain-specific language models. Adaptive Block Diffusion resolves the training-inference mismatch in Diffusion Language Models by optimizing denoising risk over a stochastic distribution of prefix-window configurations, leading to a single model that exhibits structural invariance and robust generalization across diverse decoding scales. This paper presents a theoretically grounded and empirically validated framework that significantly enhances the robustness and flexibility of Diffusion Language Models, addressing a critical limitation of prior fixed-structure approaches and paving the way for more adaptable and generalizable text generation.
The methodology is robust and elegantly addresses a core problem in Diffusion Language Models (DLMs): the training-inference mismatch caused by fixed context structures. Adaptive Block Diffusion (ABD) proposes a novel training objective that treats the denoising configuration (prefix length $k$ and window length $\ell$) as a stochastic variable, optimizing denoising risk over a distribution $\pi$ of these configurations. This approach is commendable for not requiring architectural changes, instead focusing on a principled modification to the training process. The theoretical analysis is a significant strength, formally defining conditional denoising risk and proving statistical consistency over the support of $\pi$. The "Training-Inference Alignment" theorem, leveraging the Radon-Nikodym theorem, rigorously demonstrates that if an inference policy's configuration distribution is covered by the training distribution's support, then denoising optimality is guaranteed. This provides a strong theoretical foundation for the empirical claims of structural invariance. The practical implementation details, particularly the attention mask construction and the `ABDBoundaryManager` for sampling block lengths, are clearly described in the appendix, showcasing a well-thought-out and implementable solution.
The experimental evaluation is comprehensive, well-designed, and provides strong empirical evidence supporting the theoretical claims. The authors use standard language modeling benchmarks (LM1B, OpenWebText) and ensure fair comparisons by using an identical transformer architecture to existing baselines (MDLM, BD3LM). The most compelling result is the demonstration of "structural invariance": ABD successfully recovers the monotonic relationship between block size and perplexity, a fundamental property for generative models, which fixed-block specialists fail to maintain off their training grid. This directly validates the core hypothesis that training over a broad configuration distribution leads to better generalization. Furthermore, ABD matches or outperforms fixed-block specialists at their target scales, indicating that multi-scale training acts as a regularizer rather than a compromise. The zero-shot generalization experiments on diverse datasets, including scientific text, show improved robustness and suggest that ABD learns a more configuration-invariant language representation. The ablations on configuration distribution types (categorical exponential, uniform, lognormal) and training budget allocation are particularly insightful, offering practical guidance on how to tune ABD for specific inference regimes and demonstrating the trade-offs involved.
The paper excels in reproducibility. The methodology is clearly articulated, and the appendix provides detailed pseudocode for the critical components, including the `abd_attention_mask` and `ABDBoundaryManager`. The authors explicitly state that they leverage the same codebase, datasets, architecture, likelihood evaluation, and inference setup as a previously published work (arriola2025blockdiffusioninterpolatingautoregressive), which significantly lowers the barrier to reproduction. Specific details regarding training budget allocation and configuration sampling strategies are also provided. This level of detail and reliance on a shared foundation is exemplary.
The authors openly acknowledge several limitations. A key one is the dependence on the choice of the configuration distribution $\pi$. While $\pi$ offers a principled way to balance performance across decoding regimes, an suboptimal choice can bias the model towards frequently sampled configurations, potentially leading to uneven performance across scales. This implies that careful tuning of $\pi$ is necessary for specific application scenarios. Additionally, ABD does not directly address inference efficiency; while it enables flexible decoding, the selection of optimal inference-time policies remains an open problem. Finally, the theoretical analysis provides optimality guarantees under support coverage but does not offer finite-sample guarantees, meaning practical performance might still be influenced by the quality and density of training coverage in finite data regimes.
Adaptive Block Diffusion has significant broader impact for the field of Diffusion Language Models and potentially for structured generative models more generally. By providing a principled and effective solution to the training-inference mismatch, ABD makes DLMs more robust, generalizable, and practical for real-world deployment. The ability of a single model to perform well across various decoding scales and strategies simplifies model management and enables more flexible inference scenarios, such as adaptive generation speeds or streaming decoding. This could accelerate the adoption and development of DLMs in diverse applications. The theoretical framework also offers a valuable new perspective on understanding generalization in structured generative models, potentially inspiring similar training paradigms in other domains beyond text. The demonstrated improved zero-shot generalization to domain-shifted text further suggests that ABD could lead to more versatile and less domain-specific language models. Adaptive Block Diffusion resolves the training-inference mismatch in Diffusion Language Models by optimizing denoising risk over a stochastic distribution of prefix-window configurations, leading to a single model that exhibits structural invariance and robust generalization across diverse decoding scales. This paper presents a theoretically grounded and empirically validated framework that significantly enhances the robustness and flexibility of Diffusion Language Models, addressing a critical limitation of prior fixed-structure approaches and paving the way for more adaptable and generalizable text generation.
Existing computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limitations of frontier agents. We introduce OSWorld 2.0, a benchmark of 108 long-horizon computer-use workflows across everyday and professional tasks, designed to capture complex and challenging real-world phenomena. Each task represents a realistic end-to-end workflow that takes human users a median of about 1.6 hours to complete and requires an average of 318 tool calls with Claude Opus 4.7 using maximum thinking, compared with about 30 in OSWorld 1.0. OSWorld 2.0 targets challenge phenomena that are common in real workflows yet underrepresented in prior benchmarks, spanning interaction-design challenges such as streaming interaction and dynamic environments, as well as agent-pattern challenges such as cross-source reasoning, implicit-state inference, and visual-spatial precision. Tasks are grounded in authentic input artifacts and cross-referenced against realistic stateful user profile data, and include separate safety reports auditing safety-sensitive execution. Under our primary binary-completion metric at 500 steps, Claude Opus 4.8 with maximum thinking and batched tool calls scores best but still completes only 20.6% of tasks at a 54.8% partial score; GPT-5.5 is far more token-efficient yet plateaus near 13%. These results show that current agents are still far from professional-level computer use: rather than stumbling on basic GUI control or coding, they lose track of constraints, miss information that arrives mid-task, guess rather than ask the user, and skip verification, struggling most when a task hinges on hidden state they must recover.
Primary: XLang
All Institutions: XLang
OSWorld 2.0 establishes a rigorous, long-horizon benchmark for computer-use agents, revealing that current frontier models struggle significantly with state tracking, verification, and dynamic environments, setting a new, more realistic standard for evaluating autonomous agent capabilities.
The paper introduces OSWorld 2.0, a benchmark designed to evaluate computer-use agents on long-horizon, real-world workflows. The methodology shifts focus from short, isolated GUI interactions to complex, multi-application tasks that mimic professional work (e.g., reimbursement, data analysis, creative editing). Key methodological innovations include: 1) The use of self-hosted, stateful web services (email, banking, chat) to simulate realistic environments without relying on volatile live websites. 2) A fine-grained, checkpoint-based evaluation system (averaging ~27 checkpoints per task) rather than binary pass/fail, allowing for partial credit and more nuanced analysis. 3) The annotation of tasks with specific "challenge phenomena" (e.g., cross-source reasoning, dynamic environments, implicit-state inference) to diagnose specific agent failures. 4) The inclusion of a simulated user channel and dynamic environment updates to test agent robustness to information arrival and state changes. This approach is rigorous and addresses a critical gap in current benchmarks which often overstate agent capabilities by using short, static tasks.
The authors evaluate seven frontier models (Claude Opus 4.7/4.8, GPT-5.5, Sonnet 4.6, Qwen 3.7-Plus, MiniMax M3, Kimi 2.6) under various constraints (step budgets, thinking levels, batching). The results are stark: even the best configuration (Claude Opus 4.8 with max thinking and batching) achieves only 20.6% binary completion and 54.8% partial score. The paper provides a detailed analysis of failure modes, highlighting that agents struggle with hidden state recovery, constraint tracking, and verification, rather than basic GUI control. The analysis of token efficiency vs. performance is particularly insightful, showing that GPT-5.5 is more efficient but plateaus earlier, while Opus models spend significantly more tokens for marginal gains. The breakdown of performance by challenge phenomena provides actionable insights for future research. The experiments are comprehensive, covering multiple models, configurations, and detailed error analysis.
The paper claims to release the environment, tasks, self-hosted websites, and agent rollout trajectories. The use of self-hosted services ensures that the environment is stable and reproducible, unlike benchmarks relying on live web services. The detailed description of the task construction pipeline, including the quality assurance steps (unit tests, human re-solving, adversarial audits), enhances reproducibility. The specific model configurations and hyperparameters are clearly stated. The primary limitation is the computational cost of running these long-horizon tasks, but the provided infrastructure should allow other researchers to reproduce the evaluations.
The benchmark is limited to 108 tasks, which, while diverse, may not cover all possible real-world scenarios. The self-hosted web services, while realistic, are simulations and may not capture all edge cases or security quirks of live production systems. The "simulated user" is a simplified model of human interaction and may not fully capture the nuance of human communication. The evaluation relies on model-based judges for some open-ended tasks, which may introduce bias or inaccuracies, although the authors attempt to mitigate this with objective checklists and validation. The focus on long-horizon tasks means that short, simple tasks are underrepresented, which might skew the perceived difficulty for simpler use cases.
This paper has significant implications for the development of autonomous agents. By demonstrating that current frontier models are far from solving realistic, long-horizon computer use tasks, it sets a realistic baseline for the field. It highlights the need for improvements in state management, reasoning over long horizons, and self-correction. The safety analysis, revealing that agents can cause harmful side effects (e.g., leaking API keys, exhausting disk space), underscores the risks of deploying such agents in real-world settings. The benchmark provides a valuable tool for researchers to track progress and identify specific weaknesses in agent architectures. It encourages a shift towards more robust, reliable, and safe agent systems. OSWorld 2.0 establishes a rigorous, long-horizon benchmark for computer-use agents, revealing that current frontier models struggle significantly with state tracking, verification, and dynamic environments, setting a new, more realistic standard for evaluating autonomous agent capabilities.
Retrieval-augmented generation (RAG) typically treats context selection as ranking chunks against a single query embedding. This assumption breaks down for complex queries, such as multi-hop or ambiguous questions, where top-k selection tends to over-cover one semantic aspect while ignoring critical sub-questions. We propose GeoRAG, which recasts context selection as Information Demand Coverage Optimization. GeoRAG builds a multi-dimensional demand distribution through diverse sub-query generation and reverse-validation weighting, then selects context by minimizing the Sinkhorn-Wasserstein distance between this demand distribution and the coverage of the selected set. The resulting demand-weighted facility-location objective is monotone submodular, giving a $1-1/e$ greedy guarantee, which we approximate with a Sinkhorn-based marginal-gain surrogate. The method is unsupervised, training-free, and retrieval-agnostic. We further show that single-point, query-proximity scorers cannot cover multi-modal demands, exposing a structural limit of ranking-based selection. On six open-domain QA benchmarks, GeoRAG improves exact match (EM) by +6.5 to +7.5 points over top-k truncation (up to +9.7 on HotpotQA and ASQA) and outperforms strong baselines including MMR, DPP, BGE-Reranker, SMART-RAG, and AdaGReS, with stable gains across context budgets and sub-query generators.
Primary: Singapore Management University
All Institutions: University of Shanghai for Science and Technology, Singapore Management University
GeoRAG presents a theoretically grounded, empirically superior method for RAG context selection by modeling information demand as a multi-dimensional distribution and optimizing for coverage via submodular optimization, significantly outperforming existing ranking and diversity-based approaches on complex QA benchmarks.
The paper proposes GeoRAG, a novel context selection framework for Retrieval-Augmented Generation (RAG) that moves beyond single-point query embeddings. The core innovation is reformulating context selection as an Information Demand Coverage Optimization problem. It constructs a multi-dimensional "Information Demand Proxy" distribution using diverse sub-query generation and reverse-validation weighting. The selection process minimizes the Sinkhorn-Wasserstein distance between this demand distribution and the coverage of the selected set. The authors prove that the resulting facility-location objective is monotone submodular, providing a theoretical $(1-1/e)$ greedy guarantee. They further demonstrate a structural limitation of existing ranking-based methods (query-proximity-monotone selectors) in handling bimodal information needs, providing a rigorous theoretical foundation for their approach. The method is unsupervised and training-free, making it broadly applicable.
The experimental evaluation is comprehensive and robust. The authors test GeoRAG across six open-domain QA benchmarks (NQ, TriviaQA, HotpotQA, 2WikiMHQA, ASQA, FEVER) and six different retrieval backends (Dense, BM25, Hybrid RRF, HyDE, MultiQuery, GraphRAG). GeoRAG consistently outperforms strong baselines, including MMR, DPP, BGE-Reranker, SMART-RAG, and AdaGReS, with significant gains on multi-hop datasets (up to +9.7 EM on HotpotQA). The paper includes extensive ablation studies isolating the contributions of the demand distribution (Axis A) and the set-aware coverage selection (Axis B). Crucially, they perform a "Full-Wikipedia" experiment without gold-injection to prove the method's effectiveness in realistic, harder retrieval settings. They also provide direct measurements of demand-dimension coverage, empirically validating that GeoRAG successfully covers multiple semantic peaks where baselines fail.
The paper provides detailed algorithmic descriptions, including the specific steps for sub-query generation, reverse-validation, and the Sinkhorn-based marginal gain calculation. Hyperparameters are clearly listed. The use of standard benchmarks and open-source models (Qwen3-Embedding-8B, Qwen3-4B) enhances reproducibility. The code is not explicitly linked in the text provided, but the methodological details are sufficient for implementation.
The method relies on LLM-generated sub-queries, which introduces a dependency on the quality and diversity of the generator. While the paper shows robustness across different generators, poor sub-query generation could degrade performance. The reverse-validation step adds computational overhead, though the latency analysis suggests it is manageable. The theoretical guarantee applies to the exact facility-location objective, while the deployed method uses a Sinkhorn surrogate; the paper acknowledges this but shows the surrogate performs well. The method is primarily evaluated on open-domain QA; its performance on more complex reasoning tasks or non-QA RAG applications is less clear.
GeoRAG addresses a fundamental limitation in current RAG systems: the inability to handle complex, multi-faceted queries effectively. By providing a retrieval-agnostic, training-free solution that significantly improves answer quality, it has the potential to become a standard component in RAG pipelines. The theoretical insights into the limitations of single-point embeddings also contribute to a deeper understanding of information retrieval in the LLM era. GeoRAG presents a theoretically grounded, empirically superior method for RAG context selection by modeling information demand as a multi-dimensional distribution and optimizing for coverage via submodular optimization, significantly outperforming existing ranking and diversity-based approaches on complex QA benchmarks.
LLM agents are expected to act over multiple turns, using search, browsing interfaces, and terminal tools to complete user goals. Yet not every goal is well specified or achievable in the available environment. In such cases, a reliable agent should recognize that further interaction is unlikely to help and abstain from additional tool calls. We define Agentic Abstention, the problem of deciding when an agent should stop acting under uncertainty. Unlike standard LLM abstention, which is usually evaluated as a single-turn answer-or-abstain decision, agentic abstention is a sequential decision problem: an agent can answer, abstain, or gather more information at each turn, and the need to abstain may only become clear after interacting with the environment. We study this problem across web shopping, terminal environments, and question answering, evaluating 13 LLM-as-agent systems and 2 agent scaffolds on more than 28,000 tasks. Our results show that the main challenge is not only whether agents can abstain, but also when they abstain. Some agents never abstain when they should, while others do so only after many unnecessary interactions. This gap is especially large on tasks where the instruction appears feasible until the environment reveals otherwise (e.g., no valid result matches the instruction). We further find that model scale, reasoning, and agent scaffolding affect abstention in different ways, where larger or more capable models sometimes perform worse at timely abstention. Finally, we introduce CONVOLVE, a context engineering method for improving agentic abstention that distills full interaction trajectories into reusable stopping rules. On WebShop, CONVOLVE substantially improves timely abstention without updating model parameters, raising Llama-3.3-70B's timely recall rate from 26.7 to 57.4. Our dataset and code are available at https://lhannnn.github.io/agentic-abstention
Primary: Allen Institute for AI
All Institutions: Allen Institute for AI, Southwest Jiaotong University, University of Leeds, University of Washington
Agentic Abstention: Do Agents Know When to Stop Instead of Act? This paper defines and evaluates the critical capability of agentic abstention, introducing a benchmark and a context-engineering method that significantly improves agents' ability to recognize infeasible tasks and stop acting, thereby enhancing reliability and efficiency in multi-turn agent interactions.
The paper introduces "Agentic Abstention," a novel formalization of the problem where agents must decide to stop acting rather than answer or continue exploring. It frames this as a sequential decision problem (POMDP) distinct from single-turn LLM abstention. The proposed method, CONVOLVE, is a context engineering technique that distills interaction trajectories into reusable "stopping rules" (a playbook) to improve abstention without parameter updates. The methodology is sound and addresses a critical gap in agent reliability: the cost of unnecessary tool use and delayed failure detection. The approach of using reflection agents to curate context is innovative in the context of abstention, though context engineering itself is not a new paradigm.
The evaluation is comprehensive and rigorous, covering 28,000 tasks across three distinct environments (WebShop, Terminal-Bench, and a curated QA benchmark). The authors evaluate 13 LLM-as-agent systems and 2 scaffolds, providing a broad empirical landscape. Key findings are robust: timely abstention is significantly harder than eventual abstention, and larger models do not necessarily abstain better. The introduction of metrics like AbsRec@K and SPL is well-motivated. The results clearly demonstrate the difficulty of the task and the effectiveness of the proposed method (e.g., raising Llama-3.3-70B's timely recall from 26.7% to 57.4%). The dataset construction, particularly the "Environment-based Abstention" tasks, adds significant value to the benchmarking landscape.
The paper provides detailed descriptions of the dataset construction, including how solvable tasks were modified to create infeasible ones. It specifies the models, scaffolds, and evaluation metrics used. The code and dataset are made available via the project URL. The experimental setup is clear enough for replication, although the exact prompts for the reflection and curation agents in CONVOLVE are likely in the appendix or code, which is standard practice.
The study is limited to specific environments (web shopping, terminal, QA). Real-world agents operate in more complex, multi-modal, and long-horizon settings. The dataset construction relies on LLM-generated rewrites and manual filtering, which may not capture all edge cases of infeasibility. The CONVOLVE method is evaluated on a small subset (20 trajectories) for training, which limits the generalizability of the learned playbook. Furthermore, the method assumes a static playbook update, which may not adapt well to distribution shifts in real-time deployment.
This work has significant implications for the reliability and efficiency of LLM agents. By enabling agents to recognize infeasibility and stop acting, it reduces computational costs, prevents hallucinated outputs, and improves user experience. It encourages a shift in evaluation from purely success-based metrics to include reliability and efficiency metrics. However, over-abstention remains a risk, and the paper acknowledges the need for balanced evaluation. The work supports the development of more trustworthy AI systems. Agentic Abstention: Do Agents Know When to Stop Instead of Act? This paper defines and evaluates the critical capability of agentic abstention, introducing a benchmark and a context-engineering method that significantly improves agents' ability to recognize infeasible tasks and stop acting, thereby enhancing reliability and efficiency in multi-turn agent interactions.
Would experience designing faster GPU kernels also help close in on a long-standing open mathematical conjecture? Large Language Models (LLMs) integrated into evolutionary search have recently produced state-of-the-art solutions on optimization tasks, including open mathematical conjectures, GPU kernel design, scientific law discovery, and combinatorial puzzles. To achieve this, prior work applied search scaffolds to one target task at a time, so every new problem is approached from scratch and the experience accumulated during search is discarded once the model finishes its attempt. This leaves the capability of iteratively evolving a solution (e.g., knowing which part to mutate and how, deciding when to backtrack) entirely in the scaffold rather than in the model itself. Whether the model itself could acquire this capability and reuse it across different tasks has been largely unexamined. To address this, we introduce Evolution Fine-Tuning (EFT), a mid-training paradigm that teaches LLMs to evolve solutions across tasks by converting evolutionary search trajectories into supervision. We construct Finch Collection, a 156K-trajectory dataset spanning 10 domains and 371 optimization tasks, and fine-tune open-source LLMs from 2B to 9B parameters. Empirically, EFT confers cross-task generalization: across 22 held-out tasks, our models surpass their base counterparts by 10.22% on average. Furthermore, when paired with test-time RL, our model matches state-of-the-art performance on two circle-packing tasks and outperforms its base-model counterpart on the Erdős minimum-overlap problem. EFT thus serves as a "practice phase" for general-purpose discovery agents that do not solve new problems from scratch.
Primary: KAIST (Korea Advanced Institute of Science and Technology)
All Institutions: KAIST, Amazon, University of Edinburgh, University of California, Berkeley, University of Washington, University of Toronto
Evolution Fine-Tuning represents a pivotal advancement in aligning LLMs with algorithmic reasoning by internalizing evolutionary search strategies, offering a scalable and generalizable framework for automated discovery that significantly outperforms base models and matches specialized methods across a wide spectrum of optimization tasks.
The paper introduces Evolution Fine-Tuning (EFT), a paradigm that shifts the burden of evolutionary search heuristics from the scaffold (prompting/algorithm) to the model itself via mid-training. By converting evolutionary search trajectories into supervised fine-tuning data, the authors aim to teach LLMs to "learn to evolve." This is a significant methodological shift from previous works like OpenEvolve or SkyDiscover, which rely heavily on in-context learning or iterative prompting without updating model weights to retain search experience. The approach of creating a large-scale dataset (Finch Collection) of 156K trajectories across 371 tasks is a substantial engineering and methodological contribution, enabling the study of cross-task generalization in evolutionary search agents. The core innovation lies in the data construction and the specific fine-tuning objective that encourages the model to internalize search strategies (mutation, crossover, selection) rather than just generating a final solution.
The empirical evaluation is comprehensive, covering 22 held-out tasks across diverse domains including mathematical conjectures, GPU kernel design, and combinatorial puzzles. The results show a 10.22% average improvement over base models, which is a robust and statistically significant gain. The comparison against state-of-the-art methods (like OpenEvolve) on specific tasks (circle-packing, Erdős minimum-overlap) demonstrates that EFT can match or exceed specialized approaches. The inclusion of test-time RL further boosts performance, suggesting that the fine-tuned model serves as a strong prior for online learning. The breadth of tasks (371) provides strong evidence for the generalization capability of the approach, addressing a key limitation of previous single-task fine-tuning efforts.
The authors release the Finch Collection dataset and the fine-tuned models, which is a major plus for reproducibility. The paper provides details on the dataset construction, including the sources of the trajectories (SkyDiscover, ALE-Bench, AtCoder). The implementation details of the fine-tuning process and the experimental setup are described. However, the complexity of generating the initial evolutionary trajectories for the dataset might pose a barrier for full reproduction by independent researchers without access to the same computational resources or specific frameworks. The use of open-source LLMs (2B-9B) ensures that the core method can be replicated by other researchers with moderate resources.
The paper acknowledges limitations, including the potential bias in the Finch Collection towards tasks where evolutionary search is effective. The performance gain, while significant, is not universal across all tasks, suggesting that some problems may still be better suited for traditional algorithms or different LLM capabilities. The reliance on high-quality evolutionary trajectories for training data means that the method's effectiveness is bounded by the quality and diversity of the source data. Additionally, the computational cost of generating the training data and fine-tuning the models is non-trivial. The paper does not extensively explore the failure modes or the specific conditions under which EFT might underperform compared to simpler baselines.
This work has significant implications for the field of AI for Science and automated reasoning. By enabling LLMs to acquire general-purpose search capabilities, it paves the way for more autonomous discovery agents that can tackle complex, open-ended problems without extensive task-specific prompting. This could accelerate research in mathematics, physics, and engineering. However, there are potential risks related to the misuse of such powerful search capabilities, although the primary focus is on beneficial scientific discovery. The open release of the dataset and models promotes transparency and further research in this area. Evolution Fine-Tuning represents a pivotal advancement in aligning LLMs with algorithmic reasoning by internalizing evolutionary search strategies, offering a scalable and generalizable framework for automated discovery that significantly outperforms base models and matches specialized methods across a wide spectrum of optimization tasks.
Recent advances in 3D content generation from text or images have achieved impressive results, yet view inconsistency from 2D generators and the scarcity of high-quality 3D data remain significant bottlenecks. Existing solutions typically adapt large-scale pre-trained text-to-image latent diffusion models to generate 3D Gaussian Splats (3DGS). However, these approaches often rely on training complex cascade pipelines that are computationally expensive and scalability-limited. Most critically, the quality of generated 3D assets is inherently constrained by each component capacity and compressed latent space, leading to decoding artifacts and accumulated errors. To address these limitations, we propose PixGS, a single-stage pipeline for direct high-quality 3DGS generation, which leverages recent advances in pixel-space diffusion to bypass lossy latent compression while still benefiting from the vast 2D generative priors. By directly denoising 3D Gaussian attributes at each timestep, our method enables precise, splat-level regularization of both appearance and geometry. Furthermore, we introduce a comprehensive supervision strategy that incorporates surface normals, depth, and high-frequency structural information, which is often overlooked in prior works. Experiments demonstrate that PixGS outperforms current state-of-the-art methods while maintaining a fast inference speed (1s on a single A100 GPU), offering a robust and efficient alternative to multi-stage generation pipelines.
Primary: Qualcomm AI Research
All Institutions: Qualcomm AI Research
This paper presents a significant advancement in 3D generative modeling by introducing PixGS, a single-stage pixel-space diffusion model that directly generates 3D Gaussian Splats, achieving state-of-the-art quality and speed while mitigating common artifacts associated with latent-space methods.
The paper proposes PixGS, a single-stage pipeline for generating 3D Gaussian Splats (3DGS) directly from text or images using pixel-space diffusion. The core innovation lies in bypassing the lossy latent compression typical of models like Stable Diffusion by operating in the pixel space (or rather, the attribute space of the splats which is then rendered to pixel space for supervision/diffusion) and leveraging large-scale 2D priors. The method uses a Qwen2.5-1.7B text encoder and DINO-v2 image encoder. It introduces a multi-scale Laplacian of Gaussian (LoG) loss to supervise high-frequency structural information, addressing the common issue of blurry or inconsistent 3D geometry in diffusion-based 3D generation. The approach is framed as a direct denoising of 3D Gaussian attributes, which is a significant shift from the two-stage (autoencoder + diffusion) or multi-view consistency approaches prevalent in prior work (e.g., SyncDreamer, Zero123++ adaptations). The technical approach is sound and addresses a known bottleneck (view inconsistency and decoding artifacts) in the field.
The authors evaluate PixGS against state-of-the-art methods like DiffSplat and DiffusionGS. They report superior performance in terms of both visual quality (assessed via user studies with 103 participants) and inference speed (1 second on a single A100 GPU). The experiments include ablation studies on the LoG loss scales and data scalability (comparing 265K vs 1M assets). The user study provides a strong qualitative validation, which is crucial in generative 3D tasks where metrics like FID can be misleading. The claim of 1s inference is a strong practical contribution, significantly faster than iterative optimization-based methods or complex cascade pipelines. The comparison with DiffusionGS highlights the benefit of using 2D priors for faster convergence and better quality on smaller datasets.
The supplementary material provides detailed implementation information, including architecture choices (PixNerd backbone, patch size), training settings (8 H100 GPUs, batch size 128, mixed precision), and hyperparameters (learning rate, warmup steps). The use of pre-trained weights (PixNerd, Qwen2.5, DINO-v2) aids reproducibility. However, the reliance on a fine-tuned GSRecon model for pseudo-label generation on a large dataset (G-Objaverse + G-Objaverse-XL) introduces a dependency that might be complex to replicate exactly without access to the specific fine-tuned weights or the full data pipeline. The training duration of 5 days on 8 H100s is substantial but feasible for many labs.
The authors acknowledge several limitations: the method is currently restricted to object-level generation, does not model physically-based materials (resulting in baked-in shading), and has not been evaluated on real-world video datasets or camera-controlled generation. The restriction to object-level generation is a significant limitation for broader scene understanding applications. The "baked-in" shading is a known issue in image-conditioned 3D generation but is explicitly noted here. The scalability to complex scenes with high-resolution attributes is also questioned.
PixGS offers a faster, more efficient alternative for generating 3D assets, which could accelerate applications in gaming, virtual reality, and digital twins. By reducing the computational cost and time required for 3D generation, it lowers the barrier to entry for creating 3D content. However, the ease of generating 3D assets also raises concerns about the proliferation of synthetic media and potential misuse in creating deceptive content. The reliance on large-scale datasets (Objaverse) also touches on data copyright and bias issues inherent in current generative models. This paper presents a significant advancement in 3D generative modeling by introducing PixGS, a single-stage pixel-space diffusion model that directly generates 3D Gaussian Splats, achieving state-of-the-art quality and speed while mitigating common artifacts associated with latent-space methods.
Therapy-induced cardiotoxicity is the leading non-oncological cause of treatment interruption in breast cancer patients, yet early, automated risk stratification from routine cardiac imaging remains an unsolved problem. We present EchoRisk, the first curated, multicentre, longitudinal echocardiography dataset with explicit cardiotoxicity labels, released as the primary technical reference for the EchoRisk-MICCAI 2026 challenge. The dataset comprises 422 patients enrolled in the EU-funded CARDIOCARE prospective study across five European sites, yielding 2,159 echocardiography videos across 1,123 clinical exams acquired at up to five longitudinal timepoints, alongside a dedicated cohort of 280 patients with baseline imaging for early cardiotoxicity prediction. Three clinically grounded tasks are defined: automated estimation of left ventricular ejection fraction from cine video (Task 1), classification of LV dysfunction from longitudinal imaging (Task 2), and early prediction of therapy-induced cardiotoxicity from pre-therapy baseline echocardiography alone (Task 3). For each task we specify the evaluation protocol, primary and secondary metrics, and ranking procedure. We establish baseline performance using an R(2+1)D video backbone with LSTM aggregation trained from Kinetics-400 pretrained weights, demonstrating strong discriminative performance for cardiac functional assessment and LV dysfunction classification, while early cardiotoxicity prediction from a single pre-therapy video remains a significant open problem for the community. The dataset, evaluation code, and baseline implementations are publicly available to serve as a benchmark for further collaboration, comparison, and the creation of task-specific architectures in cardio-oncology.
Primary: Foundation for Research and Technology Hellas
All Institutions: Foundation for Research and Technology Hellas, University of Ioannina, Hellenic Mediterranean University, National and Kapodistrian University of Athens, Karolinska University Hospital, Bank of Cyprus Oncology Centre
This paper introduces EchoRisk, the first curated, multicentre, longitudinal echocardiography dataset with explicit cardiotoxicity labels, along with a comprehensive benchmark for cardio-oncology, highlighting early cardiotoxicity prediction as a significant open problem. The meticulous curation of a high-quality, clinically relevant dataset from a prospective study, coupled with well-defined tasks and robust baselines, provides an invaluable resource that will drive significant research in medical AI, particularly in addressing the critical challenge of therapy-induced cardiotoxicity.
The paper introduces EchoRisk, a multicentre, longitudinal echocardiography dataset for cardio-oncology, derived from the EU-funded CARDIOCARE prospective study across five European sites. A key methodological strength is the expert-adjudicated cardiotoxicity labels, which integrate longitudinal echocardiography findings with biomarkers following ESC 2022 guidelines, representing a deliberate and rigorous curation process. This ensures high-quality ground truth, superior to automated EHR extraction. Three clinically grounded tasks are defined: Task 1 (LVEF estimation), Task 2 (LV dysfunction classification using GLS), and Task 3 (early cardiotoxicity prediction from baseline imaging). The baseline models employ a robust R(2+1)D ResNet-18 backbone, pretrained on Kinetics-400, combined with an LSTM for temporal aggregation, a standard yet powerful architecture for video analysis. Detailed preprocessing steps (greyscale conversion, fractional index sampling, resizing) and training specifics (AdamW, learning rate scheduling, specific loss functions like Focal Loss for imbalanced tasks) are provided. A dual-view strategy for Task 3 and a clinical reference baseline (logistic regression on age and LVEF) further enhance the benchmark's comprehensiveness and clinical relevance. The overall methodology for dataset construction and task definition is exceptionally strong and clinically well-aligned.
The experimental evaluation is comprehensive and rigorously conducted. Baselines are established across all three tasks, with results averaged over eight independent random seeds and ensemble predictions for robustness. For Task 1 (LVEF estimation), a test MAE of 4.98 pp is achieved, aligning with established benchmarks like EchoNet-Dynamic and validating the dataset's utility for functional assessment. Task 2 (LV dysfunction classification) demonstrates strong performance with a test AUC of 0.849, indicating effective discrimination of GLS-defined dysfunction. The most impactful finding emerges from Task 3 (early cardiotoxicity prediction): the best video baseline achieves an AUC of 0.541, which is statistically indistinguishable from the clinical reference floor (AUC 0.525). This crucial result, consistent across internal pilot experiments, highlights that early cardiotoxicity prediction from baseline echocardiography remains a significant open problem, even with advanced deep learning architectures. The detailed statistical analysis, including 95% confidence intervals via non-parametric bootstrap resampling and Wilcoxon signed-rank tests with Holm-Bonferroni correction, adds significant rigor. Calibration is also assessed via Expected Calibration Error (ECE). The experiments effectively map the current performance landscape and clearly identify a challenging frontier for future research.
The paper demonstrates an outstanding commitment to reproducibility. It explicitly states that the EchoRisk dataset, evaluation code, and baseline implementations are publicly available via a dedicated GitHub repository. The methodology section provides extensive details on the model architecture, preprocessing steps, training hyperparameters (optimizers, learning rates, weight decay, early stopping), and loss functions. The use of multiple random seeds (42-49) for all experiments, along with the procedure for ensemble predictions and handling of degenerate runs, ensures that the reported results are robust and verifiable. The detailed statistical analysis methods, including confidence interval calculation and hypothesis testing, further contribute to the transparency and reproducibility of the benchmark. This level of detail and open-source commitment is exemplary for a benchmark paper.
While a highly valuable contribution, the dataset size, though multicentre and longitudinal, is relatively modest (422 patients overall, 280 for Task 3) compared to some large-scale single-center datasets. This might limit the ability of current deep learning models to extract extremely subtle prognostic signals for Task 3. The variable follow-up window for cardiotoxicity labels in Task 3, while reflecting real-world data collection, means the positive label indicates cardiotoxicity within the *available* window, not a fixed 12-month horizon, which could introduce some variability in interpretation. The baselines, while robust, are standard video architectures; the paper's novelty lies in the benchmark itself rather than new architectural contributions. The reliance on Kinetics-400 pretraining, while common, might not be optimally suited for medical ultrasound, suggesting future work could explore domain-specific pretraining.
EchoRisk has profound broader impact potential. It addresses a critical and growing clinical challenge in cardio-oncology: the early detection and risk stratification of therapy-induced cardiotoxicity in breast cancer patients. By providing the first multicentre, longitudinal echocardiography dataset with expert-adjudicated cardiotoxicity labels, it establishes a foundational resource for the machine learning community. Its role as the primary technical reference for the EchoRisk-MICCAI 2026 challenge ensures widespread adoption and will catalyze significant research into novel AI methods for cardiac ultrasound. Success in tasks like early cardiotoxicity prediction could lead to personalized treatment strategies, timely cardioprotective interventions, reduced treatment interruptions, and ultimately improved long-term cardiovascular outcomes for cancer patients. The open-source nature of the dataset and tools will foster collaborative research, accelerating progress in this vital area of medical AI and serving as a model for future clinically relevant benchmarks. This paper introduces EchoRisk, the first curated, multicentre, longitudinal echocardiography dataset with explicit cardiotoxicity labels, along with a comprehensive benchmark for cardio-oncology, highlighting early cardiotoxicity prediction as a significant open problem. The meticulous curation of a high-quality, clinically relevant dataset from a prospective study, coupled with well-defined tasks and robust baselines, provides an invaluable resource that will drive significant research in medical AI, particularly in addressing the critical challenge of therapy-induced cardiotoxicity.
Autonomous robots often need to move their camera before they can act: to inspect an object, reveal an occluded region, or obtain a view that responds to a user's intent. While vision-language navigation translates instructions to base motion and vision-language-action policies map instructions to manipulation actions, language-conditioned camera motion remains comparatively underexplored as a first-class action. We formulate language-conditioned camera motion generation: given a current RGB observation and a free-form natural-language intent, predict a relative target camera pose for the next observation. This task is inherently non-trivial: viewpoint changes are driven by latent perceptual intentions, and a valid motion may operate at different semantic granularity, from entering a room to looking around a corner, inspecting a visible object, or revealing an occluded detail. To model this structure, we mine multi-intention camera-motion supervision from egocentric video, pairing plausible intents and observation-gain descriptions with relative SE(3) target poses. We propose LIME, a vision-language camera-motion generator that combines an auto-regressive observation-gain output with a continuous flow-matching pose head. This design lets the model jointly predict what the next view should reveal while representing multi-hypothesis target views. Across experiments and downstream robotic tasks, we show that LIME can learn to actively choose camera poses from passive human video, turning ordinary egocentric recordings into supervision for intent-aware active perception.
Primary: University of California, Berkeley
All Institutions: University of California, Berkeley, Toyota Research Institute, Stanford University
LIME introduces a novel vision-language framework for language-conditioned camera motion generation, effectively bridging semantic intent and geometric action through joint prediction of observation gain and SE(3) poses, enabling robust active perception from passive egocentric data.
The paper proposes LIME, a novel framework for language-conditioned camera motion generation. The core methodological contribution is the formulation of this task as a joint prediction of "observation gain" (what the robot intends to see) and the corresponding SE(3) camera pose. By using an autoregressive transformer for the semantic intent/gain and a continuous flow-matching head for the geometric pose, the authors address the multi-modal nature of the problem. This decoupling allows the model to handle the ambiguity inherent in natural language instructions (e.g., "look around" vs. "inspect the cup") by predicting multiple plausible future views. The use of flow-matching for pose generation is technically sound and aligns with recent trends in generative modeling for continuous variables. The approach effectively bridges the gap between high-level semantic reasoning and low-level geometric control.
The evaluation is comprehensive, covering both offline metrics on a newly curated dataset (mined from egocentric videos) and online robotic tasks. The offline experiments demonstrate that LIME outperforms baselines in pose prediction accuracy and semantic alignment. The online experiments on real robots (likely a mobile manipulator) show that LIME enables successful manipulation and embodied QA tasks that fail with passive observation or naive baselines. The ability to turn passive human video into active perception supervision is a significant empirical finding, demonstrating data efficiency and generalization. The results are robust and clearly support the claims made in the abstract.
The paper provides detailed descriptions of the architecture, training objectives, and data mining process. The inclusion of code (via the GitHub link) and the specific mention of the dataset construction methodology enhances reproducibility. However, the reliance on specific robotic hardware for the online evaluation might limit direct replication for some researchers, though the offline benchmarks are likely accessible.
The performance is contingent on the quality of the mined egocentric video data; biases in human behavior may transfer to the model. The current formulation assumes a relative SE(3) prediction, which may require careful integration with the robot's base motion controller for complex navigation tasks. Additionally, the "observation gain" abstraction, while useful, introduces a layer of semantic interpretation that could be error-prone if the language model component fails to align with the visual context.
This work significantly advances the field of embodied AI by providing a reusable primitive for active perception. It enables robots to be more autonomous and interactive, reducing the need for pre-programmed camera movements. This has broad implications for service robotics, assistive technologies, and human-robot interaction. The method of mining supervision from passive video also offers a scalable path for training other active perception behaviors. LIME introduces a novel vision-language framework for language-conditioned camera motion generation, effectively bridging semantic intent and geometric action through joint prediction of observation gain and SE(3) poses, enabling robust active perception from passive egocentric data.
Learning effective robot control policies on physical hardware is challenging due to costly data collection and the difficulty of reward specification. Prior work has incorporated demonstrations into reinforcement learning (RL), yet existing approaches either require large numbers of demonstrations or depend on continuous human intervention during training. To address these limitations, we present AutoSERL, a framework that leverages a single demonstration to fully automate the intervention process in real-world robot RL. The framework includes three complementary mechanisms to accomplish certain tasks: a sliding window intervention mechanism that continuously guides exploration to prevent local optima and unsafe deviations, a safety recovery mechanism that detects and corrects failure states via predefined trajectory recovery points, and an intervention termination criterion that automatically disables guidance once the policy can independently complete the task, preserving its exploration advantage. We evaluate AutoSERL on six contact-intensive manipulation tasks across two robot platforms, spanning insertion, hanging, and hinge-based tasks. AutoSERL consistently outperforms SERL initialized with 20 demonstrations, behavior cloning, and MILES -- a dedicated one-shot imitation learning baseline -- across all tasks while matching HIL-SERL, achieves 100% success rate on insertion tasks, and demonstrates improved robustness to positional variations, all from a single demonstration. Code and videos are available on our project website: https://autoserl.github.io/.
Primary: National Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institution of Automation, Chinese Academy of Sciences
All Institutions: National Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institution of Automation, Chinese Academy of Sciences, Beijing Academy of Artificial Intelligence, PKU-PsiBot Joint Lab, School of Artificial Intelligence, University of Chinese Academy of Sciences, Institute for Artificial Intelligence, Peking University
AutoSERL effectively automates real-world robotic reinforcement learning by replacing human intervention with heuristic-based guidance from a single demonstration, achieving performance comparable to human-supervised training while significantly reducing labor costs and improving scalability for contact-intensive manipulation tasks.
The paper proposes AutoSERL, a framework designed to automate real-world robotic reinforcement learning (RL) by replacing human-in-the-loop interventions with automated mechanisms derived from a single demonstration. The core methodology builds upon SERL (Sample-Efficient RL) and introduces three key components: (1) a Sliding Window Intervention mechanism that guides the robot along a demonstration trajectory using geometric proximity checks and directional constraints to prevent local optima and collisions; (2) a Safety Recovery Mechanism that detects stagnation and replays predefined safe trajectory segments; and (3) an Intervention Termination criterion that disables guidance once the policy achieves sufficient autonomy. The approach is heuristic-driven, relying on predefined thresholds and geometric heuristics rather than learned intervention policies. While the engineering integration is sophisticated, the conceptual novelty is moderate, as it essentially automates the "teleoperation" aspect of HIL-SERL using rule-based logic derived from a single trajectory.
The evaluation is conducted on six contact-intensive manipulation tasks (insertion, hanging, hinge-based) across two different robot platforms (Franka and UR5). The results demonstrate that AutoSERL consistently outperforms SERL initialized with 20 demonstrations, Behavior Cloning (BC), and MILES. It matches the performance of HIL-SERL, which requires continuous human intervention. The paper includes ablation studies confirming the necessity of each component and sensitivity analyses for hyperparameters. The success rates are high (100% on insertion tasks), and the training efficiency is significantly improved compared to standard RL baselines. The experiments are well-structured and provide strong empirical evidence for the method's effectiveness in real-world settings.
The paper provides detailed descriptions of the hardware setup, task definitions, and hyperparameters. The code and videos are available on the project website. The reliance on a single demonstration trajectory makes the setup highly specific to the task geometry, but the general framework is clearly described. The use of standard RL baselines (SERL, BC) and clear evaluation metrics enhances reproducibility. However, the specific geometric thresholds and motion planning details might require careful tuning for new tasks, though this is expected for heuristic-based methods.
The method is limited to tasks with 6D delta end-effector pose action spaces. The recovery mechanism relies on a single demonstration trajectory, which may not cover all possible failure modes, especially in tasks with diverse or unpredictable failure scenarios. The heuristic nature of the intervention means that it may not generalize well to tasks significantly different from the demonstration without manual adjustment of parameters. The paper acknowledges that leveraging more data to train a more robust recovery policy is a future direction.
This work addresses a critical bottleneck in real-world robotic RL: the need for continuous human supervision. By automating the intervention process, it lowers the barrier to entry for deploying RL on physical robots, potentially accelerating research and application in service robotics, manufacturing, and other domains requiring precise manipulation. It demonstrates that simple heuristic-based automation can be as effective as complex human-in-the-loop systems for certain classes of tasks. AutoSERL effectively automates real-world robotic reinforcement learning by replacing human intervention with heuristic-based guidance from a single demonstration, achieving performance comparable to human-supervised training while significantly reducing labor costs and improving scalability for contact-intensive manipulation tasks.
Safe motion planning in dynamic environments requires reasoning about the uncertainty in predicted obstacle motion without sacrificing real-time performance. Existing conformal approaches conformalize a scalar score that aggregates per-obstacle prediction errors, losing spatial coherence and scaling poorly with scene density. We instead conformalize the entire predicted distance field at once. This functional conformal prediction (FCP) framework yields a distribution-free, field-level lower bound, from which safety follows uniformly: any trajectory satisfying the resulting constraint is certified safe, independent of how the control space is sampled. The key enabler is that the residual distance field is empirically low-rank and approximately time-invariant, which makes the bound decomposable in coefficient space. An envelope is fitted offline via functional PCA and a Gaussian-mixture inductive conformal procedure, then refined online by a lightweight adaptive functional conformal (AFCP) update on a low-dimensional vector. This keeps the per-step cost largely insensitive to obstacle count and retains long-run field coverage under distribution shift. We embed the envelope as a tightened safety constraint in a sampling-based model predictive controller, FCP-MPC. On the ETH--UCY pedestrian benchmarks and a dense 3D quadrotor task with up to 280 dynamic obstacles, FCP-MPC attains a favorable balance of safety, feasibility, and efficiency, reaching goals where pointwise and egocentric conformal baselines become too conservative or too expensive, while keeping per-step computation far below online uncertainty-reasoning baselines.
Primary: Seoul National University
All Institutions: Seoul National University
This paper introduces a novel Functional Conformal Prediction framework for safe motion planning, leveraging the low-rank structure of prediction errors to provide scalable, distribution-free safety guarantees in dynamic environments. The approach effectively addresses the computational and spatial coherence limitations of prior conformal methods, offering a significant advancement in the integration of statistical uncertainty quantification with real-time robotic control.
The paper proposes a Functional Conformal Prediction (FCP) framework to address the scalability and spatial coherence issues of existing conformal prediction (CP) methods in safe motion planning. Instead of conformalizing scalar scores per obstacle, the authors treat the prediction error of the distance field as a functional object in a Hilbert space. They leverage the empirical observation that residual distance fields are low-rank and approximately time-invariant. This allows them to perform Functional PCA (FPCA) to decompose the field into a few principal components. A Gaussian Mixture Model (GMM) is fitted to the coefficients of these components in an offline stage, and an inductive conformal procedure is used to create a distribution-free envelope. Online, an Adaptive Functional Conformal Prediction (AFCP) update adjusts a scalar multiplier to handle distribution shifts. This approach decouples the expensive statistical calibration from the real-time planning loop, allowing the safety constraint to be evaluated efficiently for any sampled trajectory in an MPC framework. The methodology is theoretically sound, providing asymptotic safety guarantees under both exchangeable and non-exchangeable (adaptive) settings.
The authors evaluate FCP-MPC on two benchmarks: the ETH-UCY pedestrian dataset (2D) and a dense 3D quadrotor simulation with up to 280 dynamic obstacles. They compare against pointwise and egocentric conformal baselines, as well as online uncertainty-reasoning methods. The results indicate that FCP-MPC achieves a favorable balance of safety, feasibility, and efficiency. It successfully reaches goals where pointwise methods are too conservative and egocentric methods are too expensive or lose coverage. The per-step computation remains largely insensitive to obstacle count, demonstrating the scalability of the functional approach. The experiments are comprehensive, covering both 2D and 3D scenarios and varying densities.
The paper provides a GitHub repository link (https://github.com/CORE-SNU/FCP-MPC), which significantly aids reproducibility. The methodology is described in detail, including the offline FPCA and GMM fitting, and the online AFCP update. The use of standard benchmarks (ETH-UCY) also facilitates comparison. However, the specific implementation details of the "dense 3D quadrotor task" (e.g., exact dynamics, sensor noise models, prediction model architecture) might require careful reading of the appendix or code to fully replicate.
The method relies on the assumption that the residual distance field is low-rank and approximately time-invariant. While verified empirically, this may not hold in all environments (e.g., highly dynamic, non-stationary scenes with complex occlusions). The offline calibration requires a sufficiently large and representative dataset of residual fields. The adaptive update (AFCP) provides long-run coverage but may take time to converge to the correct threshold under rapid distribution shifts. The soft-constraint variant degrades safety guarantees by a controllable slack, which might be unacceptable for some high-risk applications.
This work contributes to the field of safe autonomous systems by providing a scalable and theoretically grounded method for uncertainty-aware motion planning. By enabling real-time safety guarantees in dense, dynamic environments, it facilitates the deployment of robots in more complex real-world scenarios. The functional conformal prediction framework could also be applicable to other domains involving spatial or functional data uncertainty, such as medical imaging or environmental monitoring. This paper introduces a novel Functional Conformal Prediction framework for safe motion planning, leveraging the low-rank structure of prediction errors to provide scalable, distribution-free safety guarantees in dynamic environments. The approach effectively addresses the computational and spatial coherence limitations of prior conformal methods, offering a significant advancement in the integration of statistical uncertainty quantification with real-time robotic control.
Video predictive models are emerging as a powerful paradigm in robotics, offering a promising path toward task generalization, long-horizon planning, and flexible decision-making. However, prevailing approaches often operate on 2D video sequences, inherently lacking the 3D geometric understanding necessary for precise spatial reasoning and physical consistency. We introduce a Structured 4D Latent Predictive Model, which predicts the evolution of a scene's 3D structure in a structured latent space conditioned on observations and textual instructions. Our representation encodes the scene holistically and can be decoded into diverse 3D formats, enabling a more complete and 3D consistent scene understanding. This structured 4D latent predictive model serves as a planner, generating future scenes that are translated into executable actions by a goal-conditioned inverse dynamics module. Experiments demonstrate that our model generates futures with strong visual quality, substantially better 3D consistency and multi-view coherence compared to state-of-the-art video-based planners. Consequently, our full planning pipeline achieves superior performance on complex manipulation tasks, exhibits robust generalization to novel visual conditions, and proves effective on real-world robotic platforms. Our website is available at https://structured-4d-model.github.io/.
Primary: Harvard University
All Institutions: MIT, Harvard University
This paper presents a Structured 4D Latent Predictive Model for robot planning that predicts future 3D scene structures in a sparse voxel latent space, enabling more geometrically consistent and robust manipulation compared to 2D video-based planners. The work is a significant contribution to the intersection of 3D generative modeling and robotics, offering a compelling alternative to end-to-end policies and 2D video planners by explicitly modeling 3D dynamics. The experimental results are strong, demonstrating state-of-the-art performance on several benchmarks and successful real-world deployment. The technical approach is well-motivated and rigorously evaluated.
The paper proposes a Structured 4D Latent Predictive Model for robot planning. The core innovation lies in moving from 2D video prediction to 3D latent space prediction using sparse voxel grids. The architecture leverages a pre-trained encoder/decoder (from TRELLIS) to map between multi-view images and structured 3D latents. The predictive model itself is split into a Single Dynamics Model (SD) for geometry/position and a Latent Generator (LG) for features, both using conditional flow matching. This is then coupled with a goal-conditioned inverse dynamics module to generate actions. The approach is technically sound, leveraging recent advances in 3D generation (3DGS, sparse voxels) and flow matching. The separation of geometry and feature dynamics is a practical design choice to handle the complexity of 3D generation.
The experiments cover simulation (ManiSkill3, LIBERO, RLBench) and real-world deployment. The paper demonstrates superior 3D consistency and multi-view coherence compared to video-based baselines (UniPi, TesserAct). Success rates on manipulation tasks are competitive with or better than imitation learning baselines (Diffusion Policy, DP3), particularly in zero-shot generalization to visual/viewpoint changes. The real-world experiment on a block-in-basket task provides strong empirical validation. The ablation studies on camera views and inverse dynamics inputs are thorough.
The paper provides detailed descriptions of the architecture, training objectives (flow matching), and data preparation. It references specific pre-trained models (TRELLIS, DINOv2, CLIP) and datasets (ManiSkill3, LIBERO). The website link suggests code availability. The use of standard benchmarks enhances reproducibility.
The method relies on calibrated multi-view RGB-D observations for the initial state reconstruction, which can be a limitation in single-view or uncalibrated real-world settings. The computational cost of 3D latent generation and decoding might be higher than 2D video generation. The reliance on a pre-trained 3D encoder/decoder means the method is tied to the capabilities of those models.
This work advances the field of embodied AI by providing a more geometrically grounded approach to robot planning. It has potential applications in autonomous robotics, simulation-to-real transfer, and interactive AI agents. The focus on 3D consistency addresses a key bottleneck in current video-based planning methods. This paper presents a Structured 4D Latent Predictive Model for robot planning that predicts future 3D scene structures in a sparse voxel latent space, enabling more geometrically consistent and robust manipulation compared to 2D video-based planners. The work is a significant contribution to the intersection of 3D generative modeling and robotics, offering a compelling alternative to end-to-end policies and 2D video planners by explicitly modeling 3D dynamics. The experimental results are strong, demonstrating state-of-the-art performance on several benchmarks and successful real-world deployment. The technical approach is well-motivated and rigorously evaluated.
Embodied task planning asks an agent to turn a natural-language instruction into an executable sequence of actions in a physical scene, and is a building block for household, assistive, and service robots. Recent prompting-based and reinforcement-learning planners generate fluent action text but lack a cheap deterministic check that the produced plan is valid in the target world, while high-fidelity simulation is too slow to serve as an inner-loop training signal. The general problem is therefore how to obtain verifiable supervision and rewards for embodied planners without relying on string-level matching or full simulation. Here we show that a single BDDL specification, automatically constructed from open-world video evidence or curated tasks, can serve as a shared interface for data construction, plan verification, and reward design. A video-to-BDDL parser, an LLM verifier, and a lightweight symbolic engine together supply dense feedback at millisecond latency. We further introduce GroupAdapt, a difficulty-aware length schedule that uses the in-batch group pass rate as a zero-cost signal so that hard prompts get wider length tolerance and automatically tighten as their pass rate improves. Under the guidance of the proposed verifier and GroupAdapt schedule, the 8B planner attains a Strict-Pass score of 97.3 on BEHAVIOR-1000, yielding a 25.9 percent relative improvement over the Qwen3-8B baseline. This result exceeds the strongest large-model baseline by 3.5 percent, while simultaneously compressing the response length by 79 percent to 207 tokens, demonstrating both effectiveness and efficiency.
Primary: The Hong Kong University of Science and Technology
All Institutions: The Hong Kong University of Science and Technology, University of London
This paper presents a significant advancement in embodied AI by introducing a BDDL-centric pipeline that integrates symbolic verification with reinforcement learning, enabling compact and correct task planning for 8B models that outperform larger baselines. The rigorous evaluation and clear methodology make it a valuable contribution to the field of robotics and machine learning.
The paper proposes a coherent pipeline for embodied task planning that bridges the gap between open-world natural language instructions and executable symbolic plans. The core methodological innovation lies in the use of BDDL (Behavior Domain Definition Language) as a unified interface for data construction, verification, and reward design. Specifically, the authors introduce a video-to-BDDL parser to generate training data from open-world videos, an LLM verifier to ensure semantic consistency, and a lightweight symbolic engine for millisecond-latency verification. The training methodology combines Supervised Fine-Tuning (SFT) with Symbolic-Reinforcement Learning (using DAPO). A key technical contribution is "GroupAdapt," a difficulty-aware length scheduling mechanism that uses the in-batch group pass rate to dynamically adjust length tolerance, allowing harder prompts to have more flexibility while enforcing conciseness on easier ones. This approach effectively decouples correctness learning from length compression, addressing a common failure mode in LLM planning where early compression leads to errors.
The experimental evaluation is rigorous and comprehensive. The authors evaluate on the BEHAVIOR-1K benchmark, specifically B-100 and B-1000, using metrics like Strict-Pass (SP), Engine-Pass (EP), and Goal Completion Ratio (GCR). The results show that the proposed 8B model significantly outperforms larger baselines (e.g., Qwen3-8B, Gemma-4-31B) in terms of SP score (97.3% on B-1000) while maintaining competitive performance on other metrics. The ablation studies effectively demonstrate the contribution of each component: SFT initialization, symbolic reward shaping, and GroupAdapt. The analysis of length compression is particularly strong, showing a 79% reduction in response length without sacrificing correctness. The inclusion of out-of-domain mathematical reasoning tasks (AIME, MATH) serves as a sanity check to ensure that the length compression does not degrade general reasoning capabilities, which is a valuable addition.
The paper provides detailed descriptions of the methodology, including the BDDL structure, the symbolic engine logic, and the RL hyperparameters (DAPO settings, group size, learning rates). The appendix contains extensive details on data construction, action library expansion, and reward landscape analysis. The use of open-weight models (Qwen3, Gemma) and standard benchmarks (BEHAVIOR-1K) enhances reproducibility. However, the specific implementation of the video-to-BDDL parser and the LLM verifier (likely proprietary or custom-built) might present some challenges for exact replication, although the logical flow is clear. The code for the symbolic engine and RL training loop appears to be the primary barrier to full reproducibility, but the paper provides sufficient detail for a competent researcher to implement.
The paper acknowledges several limitations. First, the method is a planning model and does not handle low-level control, which is a necessary layer for real-world deployment. Second, the reliance on BDDL requires robust scene understanding and object grounding, which can be noisy in real-world settings. The paper notes that real-time scene-to-BDDL construction is an open problem. Third, the performance is evaluated in simulation; real-world transferability is not demonstrated. Finally, the method's effectiveness is tied to the quality of the BDDL specifications and the action library, which may need manual curation or extensive LLM-assisted expansion for new domains.
This work has significant implications for the development of autonomous robots and embodied AI systems. By providing a scalable and verifiable method for training planners, it addresses a critical bottleneck in making robots capable of following complex, natural language instructions in unstructured environments. The emphasis on efficiency (shorter response times) and correctness (symbolic verification) aligns with the industry's need for reliable and deployable AI systems. The use of open-world video data for training also suggests a path towards more data-efficient and generalizable planning models. However, the reliance on simulation and symbolic representations may limit immediate applicability in highly dynamic or unstructured real-world scenarios without significant additional engineering. This paper presents a significant advancement in embodied AI by introducing a BDDL-centric pipeline that integrates symbolic verification with reinforcement learning, enabling compact and correct task planning for 8B models that outperform larger baselines. The rigorous evaluation and clear methodology make it a valuable contribution to the field of robotics and machine learning.
Traditional robot programming is challenging: it requires orchestrating multimodal perception, managing physical contact dynamics, and handling diverse configurations and execution failures. We introduce ASPIRE (Agentic Skill Programming through Iterative Robot Exploration), a continual learning system that autonomously writes and refines robot control programs in a code-as-policy paradigm while compounding experience into a reusable skill library. ASPIRE discovers skills that persist across tasks, simulation and real-world settings, and embodiments. It operates in an open-ended loop with three components: (1) a closed-loop robot execution engine that exposes fine-grained multimodal traces, enabling autonomous failure diagnosis, repair synthesis, and validation; (2) a continually expanding skill library that distills validated fixes into reusable, transferable knowledge; and (3) evolutionary search that generates diverse task sequences and control programs to explore beyond single-trajectory refinement. ASPIRE surpasses prior methods by up to 77% on LIBERO-Pro manipulation under perturbation, 72% on Robosuite bimanual handover, and 32% on BEHAVIOR-1K long-horizon household tasks. Its accumulated library also enables zero-shot generalization to unseen long-horizon tasks: on LIBERO-Pro Long, ASPIRE achieves 31% success versus 4% for prior methods despite their use of test-time reasoning and retries. Finally, simulation-discovered skills provide initial evidence of sim-to-real transfer, substantially reducing real-robot programming effort across different embodiments and robot APIs.
Primary: UC Berkeley
All Institutions: UC Berkeley
ASPIRE presents a compelling agentic framework for autonomous skill discovery in robotics, demonstrating significant empirical gains in simulation benchmarks through iterative code refinement and skill library accumulation, though real-world transfer remains preliminary.
The paper proposes ASPIRE, a framework for agentic skill programming in robotics. The core methodology involves a continual learning loop where an LLM-based agent writes, executes, and refines robot control code (code-as-policy). Key components include a closed-loop execution engine providing multimodal traces for failure diagnosis, a persistent skill library for distilling reusable fixes, and evolutionary search to generate diverse task sequences. The approach attempts to move beyond single-trajectory refinement by compounding experience into a transferable skill library. The methodology is technically sound, leveraging recent advances in LLM-based code generation and robotic simulation. However, the novelty is somewhat incremental; the combination of LLMs for code generation, simulation-based self-improvement, and skill libraries has been explored in various forms (e.g., RoboGen, RT-2, various LLM-robotics works). The specific contribution here is the "agentic" loop with evolutionary search for skill discovery, which is a reasonable engineering synthesis rather than a fundamental theoretical breakthrough.
The evaluation covers LIBERO-Pro, Robosuite, BEHAVIOR-1K, and LIBERO-Pro Long. The reported improvements are significant (up to 77% on LIBERO-Pro, 31% vs 4% on LIBERO-Pro Long). These results are compelling and suggest strong empirical performance. The use of standard benchmarks adds credibility. However, the comparison to "prior methods" needs careful scrutiny; if prior methods are not using test-time reasoning/retries as noted, the comparison might be slightly unfair or at least asymmetric. The sim-to-real transfer claim is mentioned as "initial evidence," which is a weak point for a high-impact claim. The experiments are extensive but largely confined to simulation, with real-world results being preliminary.
The paper describes a complex system involving LLMs, simulation environments, and evolutionary search. While the components are standard, the specific integration and hyperparameters for the agentic loop are crucial. The authors likely provide code (implied by the nature of such papers, though URL extraction returned none, suggesting it might not be publicly linked in the text provided or is new). The reliance on specific LLM APIs and simulation setups might pose reproducibility challenges for others without similar compute resources. The "skill library" mechanism needs clear definition in terms of storage and retrieval to be fully reproducible.
The primary limitation is the heavy reliance on simulation for skill discovery and the weak evidence for sim-to-real transfer. The "agentic" nature implies high compute costs and latency, which may not be suitable for real-time control. The approach may struggle with tasks requiring precise physical dynamics that are hard to capture in simulation or with LLM-generated code. The evaluation on real robots is limited to "initial evidence," lacking rigorous statistical analysis or long-term stability tests. The generalization to "unseen long-horizon tasks" is promising but relies on the assumption that discovered skills are composable in novel ways, which is not always guaranteed.
This work contributes to the automation of robot programming, potentially lowering the barrier to entry for deploying robots in complex tasks. It aligns with the trend of using LLMs for embodied AI. However, it also raises questions about the reliability and safety of autonomous code generation in physical systems. The potential for widespread adoption in industrial settings is high, but the current limitations in real-world robustness must be addressed. ASPIRE presents a compelling agentic framework for autonomous skill discovery in robotics, demonstrating significant empirical gains in simulation benchmarks through iterative code refinement and skill library accumulation, though real-world transfer remains preliminary.
Reward design remains a central bottleneck for autonomous robot policy improvement, especially in long-horizon manipulation tasks where sparse success labels provide too little signal and binary preferences collapse many competing notions of quality into one ambiguous signal. We introduce Freeform Preference Learning (FPL), a method for learning robot policies from freeform human preferences. Rather than asking annotators which of two trajectories is better overall, FPL lets them define natural-language preference axes, such as speed, safety, quality of placement, or carefulness, and provide pairwise preferences along each axis. These annotations are used to learn a language-conditioned reward model that maps a trajectory and preference label to an axis-specific reward. We use this model to train a reward-conditioned policy that optimizes across the multiple human-specified dimensions. Across four real-world and two simulated long-horizon manipulation tasks, FPL improves over sparse-reward and binary-preference methods by 38 percentage points. Beyond improved performance, FPL learns dense progress signals without explicit subtask segmentation, shows compositionality of behavior not present in the data, and allows users to steer the policy towards different behaviors at test time without retraining. Blog post with videos available at https://freeform-pl.github.io/fpl.website/
Primary: University of California, Berkeley
All Institutions: University of California, Berkeley, Stanford University, Toyota Research Institute
This paper presents a significant advancement in preference-based reinforcement learning for robotics by introducing Freeform Preference Learning, which leverages natural language to define multi-dimensional reward axes, enabling more flexible and effective policy optimization in long-horizon manipulation tasks.
The paper introduces Freeform Preference Learning (FPL), a framework that moves beyond binary pairwise comparisons in preference learning. Instead of asking "which is better?", it allows annotators to define natural-language axes (e.g., "speed", "safety") and provide preferences along these specific dimensions. The core technical innovation lies in training a language-conditioned reward model that maps a trajectory and a preference axis to an axis-specific scalar reward. This reward model is then used to train a reward-conditioned policy (likely using techniques similar to RLHF or conditional diffusion/behavioral cloning depending on the specific implementation details not fully expanded in the abstract, but implied by "reward-conditioned policy"). This approach decouples the definition of quality from the optimization, allowing for multi-objective steering at test time. The methodology addresses the ambiguity of binary preferences in complex, long-horizon tasks by providing dense, semantic feedback.
The evaluation is robust, covering four real-world and two simulated long-horizon manipulation tasks. The key result is a 38 percentage point improvement over sparse-reward and binary-preference baselines. This is a significant empirical gain, suggesting that the granularity of feedback provided by freeform axes is crucial for learning high-quality policies in complex environments. The paper also reports qualitative benefits: learning dense progress signals without explicit subtask segmentation, demonstrating compositionality (behaviors not seen in training data can be composed), and enabling zero-shot steering of behavior at test time. These results strongly support the claim that FPL provides a more flexible and effective interface for human-in-the-loop learning.
The paper provides a blog post with videos, which is helpful for qualitative assessment. However, the project URL is listed as "none" in the extraction, though the demo URL is provided. For full reproducibility, code and pre-trained models would be necessary. The abstract mentions "dense progress signals without explicit subtask segmentation," which implies a level of generalization that might be sensitive to implementation details of the reward model and policy trainer. While the method is conceptually clear, the lack of a public code repository in the metadata makes independent replication difficult at this stage.
The primary limitation is the reliance on natural language understanding for both defining axes and potentially interpreting them during reward modeling. If the language model fails to align the semantic meaning of the axis with the actual trajectory features, the reward signal may be noisy or misleading. Additionally, the complexity of the annotation task increases for users; defining multiple axes and providing pairwise comparisons for each might be more cognitively demanding than simple binary choices, potentially leading to annotator fatigue. The "compositionality" claim, while promising, may be limited to the specific distribution of axes and trajectories seen during training.
FPL has significant potential to democratize robot learning by making the reward specification process more intuitive and flexible for non-experts. By allowing users to steer behavior via natural language, it bridges the gap between high-level human intent and low-level robotic control. This could accelerate the deployment of robots in unstructured environments where explicit reward engineering is infeasible. However, it also raises questions about the alignment of the learned reward model with true human values, as the "axes" are user-defined and may not capture all ethical or safety nuances. This paper presents a significant advancement in preference-based reinforcement learning for robotics by introducing Freeform Preference Learning, which leverages natural language to define multi-dimensional reward axes, enabling more flexible and effective policy optimization in long-horizon manipulation tasks.
General-purpose robot policies should be modeled as dynamical systems, yet many VLA and generative imitation policies still rely on present observations or short windows. This Markovian shortcut fails in memory-dependent manipulation: identical observations can demand different actions after different histories. We present Chronos, a physics-informed full-history framework for non-Markovian long-horizon manipulation. The key idea is to elevate observation history from auxiliary context to the latent state of the policy dynamics. At each physical control step, Chronos forms one state-representative token by fusing observation and proprioception, so the token sequence is aligned one-to-one with physical time. A selective state space model propagates this causal historical state, which conditions a multimodal coarse action prior through implicit maximum likelihood estimation (IMLE). This prior is then refined by a second-order Schrodinger-inspired bridge that predicts acceleration fields, yielding smoother and more physically grounded robot motion. Across 16 simulated tasks and 4 real-world experiments, Chronos is evaluated on precision insertion, general manipulation, and memory-dependent long-horizon control. On RMBench, where success requires remembering task phase, Chronos achieves 73.6% average success, outperforming Markovian VLA baseline pi0.5 by +62.4 percentage points, a 6.6x relative gain, while using 10x fewer parameters. It also surpasses the memory VLA Mem-0 by 22.8 points while using over 30x fewer parameters. In real-world dual-arm experiments using a single RGB camera, Chronos achieves 78% average success over four tasks, including 72% on the three memory-dependent tasks, whereas pi0.5 achieves 7% overall and 0% on the memory-dependent subset. These results suggest that history should not be treated as auxiliary context, but as the latent state of the manipulation policy.
Primary: Huazhong University of Science and Technology
All Institutions: Huazhong University of Science and Technology
[One sentence main contribution]. Chronos introduces a physics-informed, full-history state-space framework for non-Markovian manipulation, achieving state-of-the-art performance on memory-dependent benchmarks with significantly fewer parameters than existing VLA models. [Comprehensive analysis of the technical contribution, methodology, and significance to the field]. The paper makes a significant technical contribution by addressing the non-Markovian nature of long-horizon robotic manipulation through a novel combination of Selective State Space Models for full-history encoding and a Schr\"odinger-inspired second-order bridge for action refinement. The methodology is rigorous, with a clear theoretical derivation linking quantum mechanical concepts to action-space acceleration fields, and the empirical results are strong, particularly on RMBench where it outperforms much larger memory-augmented VLAs. The approach is highly relevant to the current landscape of robot learning, offering a scalable and efficient alternative to large-scale transformer-based policies for tasks requiring temporal memory. The comprehensive evaluation across simulated and real-world tasks, along with detailed ablations, provides strong evidence for the efficacy of the proposed method.
The paper proposes Chronos, a framework addressing the non-Markovian nature of long-horizon manipulation. The core methodological contributions are twofold: (1) A full-history state representation using a Selective State Space Model (Mamba) that treats the entire observation history as the latent state, rather than using it as auxiliary context or a short window. This allows for precise temporal credit assignment across the full trajectory. (2) A physics-informed action generation module based on a "Schr\"odinger-inspired bridge." This module uses Implicit Maximum Likelihood Estimation (IMLE) to generate a coarse multimodal prior, which is then refined by a second-order differential equation solver that predicts acceleration fields. The derivation from the Schr\"odinger equation via Madelung transformation to a quantum Hamilton-Jacobi equation provides a theoretical justification for modeling action refinement as a physical process involving position stabilization and velocity dissipation. The approach is theoretically grounded and distinct from standard diffusion or flow-matching policies by explicitly modeling acceleration and using a quartic noise schedule compatible with second-order dynamics.
The evaluation is comprehensive, covering 16 simulated tasks and 4 real-world experiments. The results are compelling, particularly on RMBench, where Chronos achieves a 73.6% average success rate, significantly outperforming Markovian baselines like pi0.5 (+62.4 points) and memory-augmented VLAs like Mem-0 (+22.8 points), while using substantially fewer parameters (0.3B vs >10B for Mem-0). On RoboTwin 2.0, it achieves state-of-the-art performance in general manipulation. The ablation studies effectively isolate the contributions of the SSM memory and the second-order bridge, demonstrating that the acceleration-based refinement provides smoother and more precise actions, especially in contact-rich tasks like precision insertion. The real-world results on dual-arm manipulation further validate the transferability of the learned policies.
The paper provides a project page and code repository link. The methodology is described with sufficient mathematical detail, including the derivation of the acceleration target and the specific noise schedules. The use of standard components (Mamba, PointNet, DINOv2) facilitates implementation. However, the specific hyperparameters for the Schr\"odinger bridge integration steps and the IMLE latent update dynamics are crucial for reproduction and are partially detailed in the text. The claim of "memory-efficient training" via chunked perception is a practical detail that aids reproducibility.
The paper acknowledges that in fully observable, local-geometry-dominated tasks (e.g., Put Bottles Dustbin), Chronos slightly underperforms strong Markovian diffusion policies like DP3. This suggests that the overhead of full-history modeling may not always be beneficial when the present state is a sufficient statistic. Additionally, the reliance on a single RGB camera in real-world experiments might limit performance in complex lighting or occlusion scenarios compared to multi-view setups. The theoretical derivation, while elegant, is a specific projection of quantum mechanics concepts to control theory, and its generalizability to other domains beyond robotics is unclear.
This work advances the field of robotic manipulation by providing a robust solution to the long-standing problem of memory-dependent control. By demonstrating that full-history modeling can be efficient and effective, it challenges the prevailing trend of scaling VLA models with short-context windows. The physics-informed action generation could inspire more physically grounded generative models in other control domains. The significant performance gap on memory benchmarks highlights the limitations of current foundation models in temporal reasoning, guiding future research towards better temporal architectures. [One sentence main contribution]. Chronos introduces a physics-informed, full-history state-space framework for non-Markovian manipulation, achieving state-of-the-art performance on memory-dependent benchmarks with significantly fewer parameters than existing VLA models. [Comprehensive analysis of the technical contribution, methodology, and significance to the field]. The paper makes a significant technical contribution by addressing the non-Markovian nature of long-horizon robotic manipulation through a novel combination of Selective State Space Models for full-history encoding and a Schr\"odinger-inspired second-order bridge for action refinement. The methodology is rigorous, with a clear theoretical derivation linking quantum mechanical concepts to action-space acceleration fields, and the empirical results are strong, particularly on RMBench where it outperforms much larger memory-augmented VLAs. The approach is highly relevant to the current landscape of robot learning, offering a scalable and efficient alternative to large-scale transformer-based policies for tasks requiring temporal memory. The comprehensive evaluation across simulated and real-world tasks, along with detailed ablations, provides strong evidence for the efficacy of the proposed method.
Large-scale dexterous grasp datasets encode rich priors over hand-object interaction, but their use has largely been confined to grasp generation and pick-and-place manipulation. We study whether such data can instead support functional dexterity in articulated tool use, where a robot must acquire a tool, maintain contact, and operate its functional moving parts. We adapt a hierarchical imitation learning framework that combines high-level hand sub-goal prediction with a low-level goal-conditioned controller. We construct a 355k-trajectory grasp-pretraining dataset from large-scale dexterous grasp annotations and use it to pretrain the low-level controller. The controller is then fine-tuned on downstream task demonstrations. To evaluate this setting, we introduce DexCraft, a simulation benchmark with six articulated tool-use tasks requiring coordinated finger motion. Across simulation and real-world experiments, our approach outperforms end-to-end diffusion policy baselines and hierarchical policies trained from scratch. In the real world, it improves full-task success by 33.3 percentage points over DP3. These results show that grasp datasets can serve not only as resources for grasp synthesis, but also as scalable pretraining data for contact-rich dexterous manipulation. Videos are shown on https://yingyuan0414.github.io/grasp2dexterity/ .
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
The paper presents a compelling method for leveraging large-scale grasp datasets to enable dexterous tool use, demonstrating significant performance gains through hierarchical imitation learning and pretraining.
The paper proposes a hierarchical imitation learning framework for dexterous tool use. The core methodological contribution is the adaptation of a low-level goal-conditioned controller (based on Diffusion Policy) pre-trained on a large-scale synthetic grasp dataset (G2D-Pretrain, derived from Dexonomy). The high-level policy predicts 16-DoF hand keypoints as sub-goals, addressing the insufficiency of coarse gripper-centric sub-goals for dexterous hands. The approach effectively bridges the gap between static grasp synthesis data and dynamic, contact-rich manipulation tasks by leveraging the rich kinematic priors in grasp datasets. The hierarchical decomposition (high-level planning, low-level execution) is well-motivated and technically sound, particularly the semantic mapping of joint spaces between the Shadow hand (pretraining) and LEAP hand (fine-tuning).
The evaluation includes a new simulation benchmark, DexCraft, with six articulated tool-use tasks. The paper provides extensive ablation studies comparing end-to-end policies (DP, DP3), hierarchical policies from scratch, and their pre-trained counterparts. The results demonstrate significant improvements, particularly in the real-world setting where the proposed method improves full-task success by 33.3 percentage points over DP3. The sample efficiency analysis further supports the claim that pretraining reduces the need for downstream demonstrations. The inclusion of both simulation and real-world experiments strengthens the validity of the claims, although the real-world evaluation is limited to three tasks and a single robot setup.
The paper provides detailed descriptions of the data augmentation process for G2D-Pretrain, the policy architectures, and the experimental setups. The project website link suggests code or video availability, which aids reproducibility. The use of standard simulators (ManiSkill3) and datasets (Dexonomy) facilitates replication. However, the specific details of the teleoperation setup and the exact implementation of the semantic joint mapping for the LEAP hand might require additional clarification for perfect reproducibility.
The reliance on manually annotated sub-goals for training the high-level policy limits scalability. The simulation benchmark uses single object instances per task, which may not fully capture the generalization capabilities required for diverse object geometries. The real-world evaluation is constrained by the specific hardware setup (Franka + LEAP Hand) and does not explore the impact of tactile feedback or online adaptation, which are critical for robust dexterous manipulation.
This work significantly advances the field of dexterous manipulation by demonstrating that large-scale grasp datasets, previously underutilized for dynamic tasks, can serve as powerful pretraining resources. This could lower the barrier to entry for learning complex manipulation skills by reducing the need for costly real-world demonstrations. The DexCraft benchmark provides a valuable resource for evaluating articulated tool use, encouraging further research in this area. The paper presents a compelling method for leveraging large-scale grasp datasets to enable dexterous tool use, demonstrating significant performance gains through hierarchical imitation learning and pretraining.