Last 7 Days (June 29 – July 05, 2026)
Language models increasingly write probabilistic programs (in NumPyro, Stan, or Pyro), but a program that compiles, runs, and passes every unit test can still be \emph{statistically} wrong -- a Gaussian likelihood for heavy-tailed data, a Poisson for over-dispersed counts, an invalid prior support, or a pathological parameterization. The right verifier is therefore not a test suite but the Bayesian workflow itself: posterior predictive checks, simulation-based calibration, sampler diagnostics ($\hat R$, divergences, ESS), and held-out predictive density. We study this calibration oracle along three axes. \textbf{Detection:} on a benchmark of $14$ misspecification types across $10$ model families ($200$ instances), it flags the bug with AUC $0.97$ ($88\%$ at $2\%$ FPR \emph{when handed the correct reference program, an upper bound}) -- and a fully \emph{reference-free} version that uses no correct program reaches $62$--$78\%$ (the upper figure from a small automated model search), versus $0\%$ for a unit-test oracle. \textbf{Repair:} used as feedback in an LLM repair loop across fifteen models, calibration significantly outperforms unit-test feedback -- which is itself \emph{significantly worse than no feedback at all}, a passing test inducing false confidence that suppresses repair -- and improves over no feedback on strong-but-unsaturated models (GPT-5.1 $33{\to}92\%$, Claude $75{\to}100\%$; paired McNemar, $n{=}228$). \textbf{Reality:} on programs LLMs write from scratch for neutral briefs, $15$--$47\%$ of runnable ones are statistically misspecified (unit tests catch none), and calibration-guided repair significantly beats LLM-as-judge review, a Bayesian-workflow checklist, and data-summary self-debug. Across all three, the lesson is the same: for probabilistic programs, correctness is calibration, not compilation.
Primary: Columbia University
All Institutions: Columbia University
This paper introduces a paradigm shift for verifying LLM-generated probabilistic programs, demonstrating that Bayesian calibration diagnostics are essential for detecting code-invisible statistical errors, significantly outperforming and even revealing the harm of traditional unit tests in LLM repair loops. The comprehensive technical contribution, rigorous empirical validation on both injected and LLM-generated bugs, and the clear, actionable insights make this a genuinely field-wide significant paper that will reshape how LLM-assisted probabilistic modeling systems are designed and evaluated.
The paper proposes a robust methodology for detecting and repairing statistically misspecified probabilistic programs generated by Language Models (LLMs). The central innovation is the "calibration oracle," which formalizes and automates the Bayesian workflow diagnostics (Posterior Predictive Checks, Simulation-Based Calibration, sampler diagnostics like R-hat and divergences, and held-out predictive density) into a programmatic verifier. This oracle is designed to identify "code-invisible misspecification"—statistical errors that traditional unit tests (compilation, execution, output shape) cannot detect. The repair loop integrates this oracle as feedback for LLMs, guiding them to iteratively refine their programs. The methodology is well-grounded in Bayesian principles, and the formal distinction between code-visible and code-invisible bugs is crucial for understanding the limitations of existing LLM code verification paradigms. The aggregation of diverse diagnostics into a single verdict, with defined thresholds, provides a practical and actionable signal for automated systems.
The experimental evaluation is exceptionally thorough and convincing, covering detection, repair, and real-world applicability. 1. **Detection**: A comprehensive benchmark of 200 instances across 14 misspecification types and 10 model families was created. The calibration oracle achieved an impressive AUC of 0.97 (88% detection at 2% FPR), demonstrating its efficacy. Critically, a *reference-free* version of the oracle (without a ground-truth correct program) still achieved 62-78% detection, highlighting its practical utility. In stark contrast, the unit-test oracle achieved 0% detection for code-invisible bugs. 2. **Repair**: Experiments with 15 diverse LLMs (open and API) in a repair loop showed that calibration feedback consistently outperformed "no feedback" and "unit-test feedback." A surprising and impactful finding was that unit-test feedback was often *worse than no feedback*, as it induced false confidence and suppressed repair. Calibration feedback led to substantial improvements for strong-but-unsaturated models (e.g., GPT-5.1 from 33% to 92%), with statistically significant gains ($p < 4.5 \times 10^{-10}$ for pooled invisible-bug repairs). 3. **Real LLM Programs**: This section provides the strongest evidence. LLMs were tasked to write programs from scratch for neutral briefs. The study found that 15-47% of runnable LLM-generated programs were statistically misspecified (none caught by unit tests). Calibration-guided repair significantly outperformed strong baselines, including LLM-as-judge review, Bayesian-workflow checklists, and data-summary self-debug, achieving an 84% fix rate for misspecified programs. The results are robust, with detailed ablations and sensitivity analyses for oracle thresholds.
The paper demonstrates a high commitment to reproducibility. It specifies exact snapshots for API models and HuggingFace revisions for open models. Detailed hyperparameters for LLM generation (temperature, max tokens, repair budget) and NUTS inference (chains, draws, x64) are provided. Oracle thresholds are explicitly stated, and their sensitivity is analyzed. Crucially, the authors commit to releasing "The full system prompt, contract, feedback templates, benchmark generators, and analysis scripts with the paper," which is excellent practice and will enable full replication and extension of their work.
The authors are transparent about several limitations: 1. **PPC power**: The effectiveness of Posterior Predictive Checks depends on the chosen test statistics, meaning some misspecifications might remain undetected if not captured by the tracked statistics. 2. **Right fit, wrong structure**: The oracle primarily assesses distributional fit, not causal or generative correctness, making it blind to structural errors that yield similar predictive distributions. 3. **Over-wide predictive**: An overly confident or under-confident model might "cover" the data and pass PPC despite being fundamentally wrong. 4. **Diagnosis without remedy**: Weaker LLMs may receive accurate diagnostic feedback but lack the capability to translate it into a correct structural fix. 5. **SBC expense**: Simulation-Based Calibration is computationally intensive, limiting its practical use in the repair loop for large or costly models. 6. **Inference failure**: Sampler diagnostics can sometimes fire due to poor inference (e.g., NUTS issues) rather than genuine model misspecification, leading to false positives. 7. **Benchmark scope**: The repair experiments use a subset of bugs, and the tasks are classical low-dimensional models, suggesting that extending the approach to high-dimensional or complex structured models (e.g., deep models, spatial-temporal models) is future work.
This paper has profound broader impact for the burgeoning field of LLM-assisted scientific computing and probabilistic modeling. It fundamentally shifts the paradigm for verifying LLM-generated probabilistic code from "compilation" to "calibration," providing a principled and empirically validated framework for ensuring statistical correctness. This work is crucial for building trustworthy and reliable LLM agents that can assist in scientific discovery, data analysis, and model development. The PPL-agnostic nature of the Bayesian diagnostics means the approach is broadly applicable across popular probabilistic programming languages like Stan, Pyro, and PyMC. The surprising finding that unit-test feedback is actively harmful for LLM repair loops is a critical insight for designing future LLM agent architectures. This paper sets a new standard for evaluating and improving LLM-generated scientific code. This paper introduces a paradigm shift for verifying LLM-generated probabilistic programs, demonstrating that Bayesian calibration diagnostics are essential for detecting code-invisible statistical errors, significantly outperforming and even revealing the harm of traditional unit tests in LLM repair loops. The comprehensive technical contribution, rigorous empirical validation on both injected and LLM-generated bugs, and the clear, actionable insights make this a genuinely field-wide significant paper that will reshape how LLM-assisted probabilistic modeling systems are designed and evaluated.
Memory expertise is a learned skill: knowing what to encode, when to retrieve, and how to organize knowledge--a capacity known in cognitive science as metamemory. We bring this perspective to LLMs by treating memory management as a trainable skill. We promote file-system operations to first-class memory actions alongside task actions, letting the model itself decide how to manage its memory. This memory skill improves along two axes: the structure that supports it (prompts, file schemas, action vocabulary), and the proficiency of the model exercising it. Both axes resist manual optimization: episodes in long-horizon tasks run for thousands of steps, and a single memory mistake can hide long before it surfaces, making human review of full trajectories impractical. We introduce AutoMem, a framework that automates both axes. In the first loop, a strong LLM reviews complete agent trajectories and iteratively revises the memory structure that shapes how the agent interacts with its memory files. In the second loop, the agent's own good memory decisions are identified from many episodes and used as training signal to sharpen the model's memory proficiency directly. Across three procedurally generated long-horizon games (Crafter, MiniHack, and NetHack), optimizing memory alone--without modifying the model's task-action behavior--improved the base agent's performance ~2x-4x, bringing a 32B open-weight model competitive with frontier systems such as Claude Opus 4.5 and Gemini 3.1 Pro Thinking. Our results show that memory management is an independently learnable skill, and a high-leverage objective yielding large gains on long-horizon tasks.
Primary: Stanford University
All Institutions: Stanford University
The paper's findings have substantial broader implications. By demonstrating that automated memory optimization can significantly enhance LLM agent performance on long-horizon tasks, it offers a practical pathway for open-weight models to achieve capabilities comparable to frontier proprietary systems. This could democratize access to advanced agentic AI, making sophisticated LLM agents more accessible for research and development. The methodology of using meta-LLMs for trajectory-level review and targeted revision is a generalizable workflow that could be applied to optimize other agent capabilities beyond memory, potentially accelerating agent development across various domains. While the current applications are in games, the underlying principles are highly transferable to real-world tasks requiring complex, long-term information management. The authors responsibly note that the released artifacts are not directly applicable to high-stakes deployment without further safety review, acknowledging the ethical considerations. This paper introduces AutoMem, a novel framework that automates the learning of memory as a cognitive skill for LLM agents by iteratively optimizing both the memory's supporting structure (scaffold) and the model's proficiency in using it, yielding significant performance gains on long-horizon tasks and making open-weight models competitive with frontier systems. The work presents a highly innovative approach to a critical challenge in LLM agent development, leveraging meta-LLMs to automate the optimization of memory management in long-horizon tasks where human review is intractable. Its strong empirical results, demonstrating substantial performance improvements solely from memory optimization and bringing a 32B open-weight model to the level of frontier proprietary systems, highlight memory as a high-leverage objective and offer a promising direction for developing more capable, efficient, and accessible AI agents.
The methodology proposed in AutoMem is exceptionally well-conceived and technically sound. The central idea of treating memory management as a "trainable skill" for LLMs, drawing inspiration from cognitive science's metamemory, is a powerful conceptual shift. By promoting file-system operations (read, write, search, append, create) to first-class actions within the LLM's action space, the framework provides a flexible, observable, and controllable interface for external memory. The core technical contribution is the two-loop AutoMem framework. The first loop, scaffold optimization, leverages a powerful meta-LLM (Claude Opus 4.6) to review complete, long-horizon agent trajectories (up to $10^5$ steps) and iteratively revise the agent's code, prompts, and memory file schema. This addresses a critical bottleneck in long-horizon task development, where human review of such extensive traces is impractical. The meta-LLM effectively acts as a "code reviewer," diagnosing memory-related failures and proposing concrete structural improvements (e.g., coordinate-keyed deduplication, auto-synced inventory files, pre-populated strategy guides). The second loop, proficiency training, focuses on enhancing the model's parametric ability to make optimal memory decisions. Here, a meta-LLM (Claude Opus 4.7) acts as a "training engine," curating high-quality supervised training data from the agent's own experience and orchestrating the LoRA finetuning configuration. The architectural separation of a finetuned "memory specialist" model from the frozen "gameplay model" is a clever design choice, ensuring that memory skill acquisition is targeted and does not degrade the base model's existing task competence. This modularity allows for clean, additive gains. The overall framework is coherent, addresses a significant challenge in LLM agent development, and is grounded in a strong theoretical perspective.
The experimental evaluation is rigorous and highly convincing. The paper selects three challenging, procedurally generated long-horizon games—Crafter, MiniHack, and NetHack—which are ideal environments for testing sophisticated memory management due to their length, stochasticity, and the inherent need for persistent knowledge (e.g., maps, inventory, strategies). The use of the BALROG harness ensures a standardized and challenging benchmark. The primary metric, game progression rate, is appropriate for these complex tasks. The results are remarkably strong: optimizing memory *alone*, without modifying the base model's task-action weights, yields substantial performance gains of 2x-4x across all environments. This empirically validates the paper's central hypothesis that memory management is an independently learnable and high-leverage skill. Furthermore, the optimized 32B open-weight model achieves performance competitive with frontier proprietary systems such as Claude Opus 4.5 and Gemini 3.1 Pro Thinking, a highly impactful finding that suggests memory optimization can significantly close the gap between open-source and state-of-the-art proprietary models on these tasks. The paper also provides compelling qualitative evidence, including a significant reduction in unproductive actions, a sharp decrease in redundant memory writes, and the emergence of a "consult-before-write" memory discipline in the trained specialist. The detailed examples of memory schema evolution (e.g., NetHack's coordinate-keyed map deduplication) further illustrate the concrete benefits of the scaffold optimization. The inclusion of strong baselines, including frontier proprietary models and basic context-management strategies, provides a comprehensive comparison.
The paper demonstrates an excellent commitment to reproducibility. A dedicated appendix provides comprehensive implementation details, including specific configurations for all three game environments (Crafter, MiniHack, NetHack), such as world area, agent view, reward settings, maximum episode steps, and evaluation seeds. Crucially, it details the outer-loop processes, specifying the meta-LLMs used (Claude Opus 4.6/4.7), the criteria for accepting revisions, retry mechanisms, training data collection procedures, and the exact LoRA hyperparameters (rank, alpha, dropout, effective batch size, learning rate, number of training epochs, and target modules) for each environment. The explicit mention of releasing the complete prompt templates and code at `https://github.com/autoLearnMem/AutoMem` is a significant strength, enabling researchers to replicate and build upon this work. This level of detail is commendable and sets a high standard for reproducibility in LLM agent research.
The authors thoughtfully acknowledge several limitations. The current memory system is episodic, meaning the file system starts fresh at the beginning of each episode, which prevents knowledge transfer across sessions. Extending this to persistent memory is identified as a natural next step. The experiments are conducted on game environments, which, while well-suited for studying memory, suggest a need to validate the approach on real-world, memory-intensive tasks. Additionally, the current framework optimizes a separate scaffold and memory specialist for each game, raising the question of whether a single, more generalized scaffold or specialist could be developed to operate effectively across diverse environments. An implicit limitation, common to meta-LLM-driven approaches, is the reliance on powerful proprietary models (Claude Opus) as meta-LLMs, which entails cost and potential for brittleness, though the iterative refinement and gating mechanisms help mitigate this.
The paper's findings have substantial broader implications. By demonstrating that automated memory optimization can significantly enhance LLM agent performance on long-horizon tasks, it offers a practical pathway for open-weight models to achieve capabilities comparable to frontier proprietary systems. This could democratize access to advanced agentic AI, making sophisticated LLM agents more accessible for research and development. The methodology of using meta-LLMs for trajectory-level review and targeted revision is a generalizable workflow that could be applied to optimize other agent capabilities beyond memory, potentially accelerating agent development across various domains. While the current applications are in games, the underlying principles are highly transferable to real-world tasks requiring complex, long-term information management. The authors responsibly note that the released artifacts are not directly applicable to high-stakes deployment without further safety review, acknowledging the ethical considerations. This paper introduces AutoMem, a novel framework that automates the learning of memory as a cognitive skill for LLM agents by iteratively optimizing both the memory's supporting structure (scaffold) and the model's proficiency in using it, yielding significant performance gains on long-horizon tasks and making open-weight models competitive with frontier systems. The work presents a highly innovative approach to a critical challenge in LLM agent development, leveraging meta-LLMs to automate the optimization of memory management in long-horizon tasks where human review is intractable. Its strong empirical results, demonstrating substantial performance improvements solely from memory optimization and bringing a 32B open-weight model to the level of frontier proprietary systems, highlight memory as a high-leverage objective and offer a promising direction for developing more capable, efficient, and accessible AI agents.
4D hand motion reconstruction from egocentric video is bottlenecked by clear limitations of existing methods: image-based pipelines depend on a detector that fails under heavy occlusion, while video-based methods rely on temporal modules learned only from scarce hand-pose annotations, a narrow signal insufficient to model motion dynamics, occlusion reasoning, and hand-object interaction. These capabilities, however, are exactly what video generative models must implicitly acquire when trained to synthesize coherent video at internet scale. Motivated by this, we present ViDiHand, which leverages the representations of a pretrained video diffusion model to reconstruct 4D two-hand pose. We adapt it via a hand-overlay rendering objective that specializes its features for hands while preserving its world priors. A decoder then recovers metric-scale pose from the adapted features. The whole pipeline operates directly on full frames--no detector, no infiller, and no test-time optimization. On ARCTIC, HOT3D, and HOI4D, ViDiHand substantially outperforms prior methods, establishing video diffusion models as a powerful new foundation for hand motion reconstruction and a promising route to scalable in-the-wild data collection for embodied AI. Project page: https://vidihand.github.io.
Primary: A*STAR (Agency for Science, Technology and Research)
All Institutions: A*STAR, NTU Singapore (Nanyang Technological University), Alibaba Group
The paper presents a significant advancement in 4D hand motion reconstruction by effectively adapting video diffusion models for perception tasks, achieving state-of-the-art performance on challenging benchmarks through a novel hand-overlay rendering adaptation and a geometrically-aware dual-branch decoder.
The paper proposes a novel paradigm for 4D hand motion reconstruction by leveraging the internal representations of a large-scale pretrained Video Diffusion Model (Wan2.1-VACE). Instead of treating the diffusion model as a generative black box or a frozen feature extractor, the authors introduce a "hand-overlay rendering" adaptation stage. This involves finetuning only the VACE branch of the model to regenerate input clips with semi-transparent rendered hand overlays. This clever pretext task specializes the model's world priors (occlusion reasoning, temporal coherence, 3D geometry) for hand-centric tasks without destroying the general visual knowledge. The decoder is a dual-branch architecture: a Hand-Token Branch for holistic articulated pose and a Joint-Heatmap Branch for local 2D localization, coupled by mutual cross-attention and a closed-form geometric solve for camera translation. This design elegantly separates the holistic vs. local inductive biases of the representation. The approach is methodologically sound, theoretically motivated by the capabilities of generative models, and technically sophisticated in its integration of diffusion features with geometric constraints.
The evaluation is comprehensive and rigorous. The authors test on three challenging egocentric hand benchmarks: ARCTIC (heavy occlusion), HOT3D (fisheye, high dynamic range, motion blur), and HOI4D (cross-dataset generalization). They introduce a "penalty protocol" that folds false negatives into pose metrics, providing a more realistic assessment of detection robustness than standard TP-only metrics. ViDiHand establishes new state-of-the-art results across all metrics, with particularly significant gains in frame accuracy (detection robustness) and temporal jitter (smoothness). The ablation studies are thorough, validating the choice of DiT layer, denoising step, and decoder components. The cross-dataset transfer to HOI4D demonstrates the generalizability of the learned priors. The results are statistically significant and practically meaningful, showing that video diffusion models capture richer spatiotemporal priors than discriminative video models or image-based detectors.
The paper provides detailed implementation details, including the specific backbone (Wan2.1-VACE), the two-stage training curriculum (joint overlay then MANO mesh overlay), and the decoder architecture. The supplementary material contains extensive details on the evaluation protocol, metric definitions, and ablation studies. The project page link suggests code/data availability, which is standard for high-impact ML papers. The use of a publicly available backbone (Wan2.1) enhances reproducibility, although the specific finetuning steps and data preprocessing pipelines would need to be carefully followed. The closed-form geometric solve is well-defined.
The primary limitation is computational cost. The method runs at 5.5 fps on 4 A100 GPUs, making it an offline annotation tool rather than a real-time solution. The authors acknowledge this and suggest distillation as a future direction. Additionally, Stage 1b still requires MANO-annotated video, which is a scarce resource, though the authors propose self-supervised pretexts to relax this in the future. The method may also struggle with extreme cases not covered in the training data, although the cross-dataset results suggest good generalization.
This work has significant implications for embodied AI, robotics, and human-computer interaction. By providing a scalable, high-quality method for 4D hand reconstruction from egocentric video, it enables the creation of large-scale datasets for training robot policies and understanding human behavior. The paradigm shift towards leveraging video generative models for perception tasks could influence future research in 3D vision, motion capture, and video understanding. It also highlights the untapped potential of diffusion models for discriminative tasks, potentially inspiring similar approaches in other domains. The paper presents a significant advancement in 4D hand motion reconstruction by effectively adapting video diffusion models for perception tasks, achieving state-of-the-art performance on challenging benchmarks through a novel hand-overlay rendering adaptation and a geometrically-aware dual-branch decoder.
Although neural networks are remarkably effective, their underlying optimization principles remain theoretically elusive, often characterized by non-convex landscapes and stochastic heuristics. In this work, we propose a paradigm shift by replacing the discrete training problem of shallow neural networks with a well-posed continuum variational surrogate. We identify a family of $λ$-convex functionals over parameter densities in weighted Sobolev spaces and prove that these variational problems are globally well-posed, stable, and exhibit unexpected almost $C^3$ regularity. Unlike existing Wasserstein-based or Mean-Field approaches, which often face limited regularity and discretization challenges, our formulation provides direct access to elliptic regularity and convex analysis. This allows us to prove that the optimal parameter density can be obtained by solving a single linear system, bypassing iterative optimization entirely. We establish explicit generalization error controls at a rate of $1/α$ relative to the regularization parameter, and prove that finite-width networks of size $N$ achieve the continuum optimum at an $O(1/N)$ rate. This perspective bridges the gap between the Neural Tangent Kernel (NTK) and feature-learning regimes, providing a principled framework for understanding over-parameterization through the lens of variational calculus.
Primary: University of Warsaw
All Institutions: University of Warsaw, Brno University of Technology, Université de Toulouse, INSA Toulouse
This paper offers a profound theoretical contribution that could have a significant broader impact on the field of machine learning, particularly in theoretical understanding and the development of new research directions. * **New Theoretical Paradigm**: It introduces a genuinely novel paradigm for analyzing neural networks, distinct from existing mean-field/Wasserstein and NTK frameworks. This opens up a new avenue for applying advanced mathematical tools (variational calculus, elliptic PDE theory) to ML problems. * **Understanding Implicit Bias**: The discovery of near-$C^3$ regularity for optimal parameter densities provides a concrete and quantitative form of implicit bias towards smooth, well-generalizing solutions. This offers a deeper understanding of why overparameterized networks avoid overfitting. * **Bridging Regimes**: By offering a globally convex, nonlinear model that is exactly tied to finite-width networks, it bridges the gap between the lazy-training NTK regime and the feature-learning capabilities of mean-field approaches. * **Novel Regularization Principles**: The framework could inspire new regularization techniques grounded in variational principles and Sobolev spaces, potentially leading to more robust and interpretable models. * **Alternative Optimization**: The demonstration that the optimal density can be found by solving a linear system, even if currently limited to shallow networks, is a conceptual breakthrough that might inspire new hybrid optimization strategies or analytical solutions for specific network architectures. * **Foundational Research**: While not immediately applicable to state-of-the-art deep learning, this work lays a strong theoretical foundation that could influence future research on multi-layer networks, potentially leading to new insights into their complex optimization landscapes and generalization properties. This paper proposes a paradigm shift by formulating shallow neural network training as a globally well-posed continuum variational problem in weighted Sobolev spaces, yielding a unique, almost $C^3$ regular minimizer obtainable by solving a single linear system. This work provides profound theoretical insights into the implicit bias and generalization of neural networks, bridging existing frameworks with a novel mathematical approach based on convex analysis and elliptic PDE theory, despite its current limitation to shallow architectures.
This paper proposes a highly novel and mathematically rigorous variational formulation for shallow neural networks, departing significantly from existing mean-field/Wasserstein and Neural Tangent Kernel (NTK) approaches. The core methodology involves replacing the discrete training problem with a continuum variational surrogate defined over parameter densities in weighted Sobolev spaces ($W^{1,2}(\Omega) \cap L^2_\omega(\Omega)$). The authors identify a family of $\lambda$-convex functionals, which is a key innovation enabling global well-posedness, stability, and high regularity of the solutions. A major strength of this approach is its direct access to convex analysis and elliptic PDE theory, which allows for several profound theoretical results: 1. **Global Convexity**: The proposed functional is proven to be globally $2(\lambda, \mu)$-convex, ensuring the existence and uniqueness of a minimizer without linearization assumptions. 2. **High Regularity**: The optimal parameter density is shown to possess unexpected almost $C^3$ regularity, a level of smoothness not typically accessible in other infinite-width analyses. This is derived from the Euler-Lagrange equation, which turns out to be a linear elliptic PDE. 3. **Direct Solution**: Crucially, the optimal parameter density (or its projection onto a finite-dimensional basis) can be obtained by solving a single linear system, completely bypassing iterative optimization methods like gradient descent. This is a remarkable theoretical achievement for this specific problem formulation. 4. **Consistency with Discrete Networks**: The paper proves the absence of a Lavrentiev gap, meaning the infimum of the risk is the same whether optimizing over atomic measures (finite-width networks), Sobolev densities, or smooth functions. Furthermore, finite-width networks of size $N$ are shown to achieve the continuum optimum at an $O(1/N)$ rate. 5. **Gradient Flow Analysis**: The associated $L^2_\omega$-gradient flow is shown to converge exponentially fast to the unique minimizer, providing insights into the continuous-time dynamics. The methodology is deeply rooted in advanced mathematics (functional analysis, PDE theory, calculus of variations) and provides a fresh perspective on understanding the implicit bias and generalization properties of neural networks. The use of Sobolev regularization to promote smoothness is well-justified within this framework.
The experimental evaluation is primarily illustrative, serving to validate the theoretical claims rather than to achieve state-of-the-art performance on large-scale benchmarks. The authors demonstrate how the theoretical framework translates into a practical computational method: approximating the parameter density with a polynomial ansatz and solving the resulting ridge regression-like linear system. The experiments include: 1. **1D Sinus Function**: Shows that the regularized solution accurately and smoothly tracks the target, outperforming an unregularized ansatz (overfits) and a single-layer neural network baseline (noisier). This highlights the non-overfitting and smoothness properties predicted by the theory. 2. **1D Discontinuous Sign Function**: Illustrates the stability of the minimizer to noise and outliers, consistent with the theoretical stability results. 3. **Benchmark Datasets (Diabetes, California Housing)**: These are small-scale regression tasks. The proposed method, using polynomial basis functions, is compared against a single-hidden-layer network with 10,000 ReLU neurons trained by Adam/SGD. The results claim competitive or superior accuracy, demonstrating strong finite-sample performance. While the experiments effectively showcase the properties of the proposed variational formulation (smoothness, stability, non-overfitting, and the ability to find a solution via a linear system), they are limited in scope. The "neural network baseline" is a shallow network, not representative of the deep architectures prevalent in modern ML. The datasets are small, and the focus is on demonstrating the *feasibility* and *characteristics* of the method rather than its competitive performance against complex, deep learning models.
The paper provides a detailed mathematical formulation of the variational problem, the regularization terms, and the derivation of the Euler-Lagrange equation. It also explains how the problem reduces to solving a linear system (ridge regression) when using a finite basis approximation. Specific details regarding the weight function $\omega(\theta)$, basis functions (polynomials, cosine, Legendre), and regularization parameters used in the numerical examples are mentioned. However, the paper does not provide a link to a code repository or supplementary material in the main text. While the theoretical framework is precisely defined, reproducing the exact numerical results would require careful implementation of the basis functions, the construction of the matrices $U, V, W$, and the solution of the linear system, which could be non-trivial without provided code. The mention of "supplementary material" suggests that more details might exist, but they are not readily accessible from the paper itself.
The most significant limitation, explicitly acknowledged by the authors, is that the entire formulation and its strong theoretical guarantees are currently restricted to **shallow (one-hidden-layer) neural networks**. Extending this approach to multi-layer architectures is stated to be "analytically infeasible" with the current framework, as the problem becomes strongly nonlinear, the parameter density lives on a higher-dimensional product space, and the Euler-Lagrange system becomes a coupled nonlinear PDE, making existence of smooth minimizers and convergence of gradient flows elusive. This limits the immediate practical applicability to the dominant deep learning paradigm. Other limitations include: * **Indirect modeling of SGD**: The analysis focuses on the $L^2_\omega$-gradient flow, which is a continuous-time analogue of gradient descent, but does not directly model the stochastic nature of SGD, which is crucial for training large neural networks. * **Computational scalability for high-dimensional $\Omega$**: While solving a linear system is efficient, the size of the system ($M \times M$) depends on the number of basis functions $M$. For very high-dimensional parameter spaces $\Omega$ or complex functions requiring a very large $M$, solving the linear system could become computationally intensive ($O(M^3)$).
This paper offers a profound theoretical contribution that could have a significant broader impact on the field of machine learning, particularly in theoretical understanding and the development of new research directions. * **New Theoretical Paradigm**: It introduces a genuinely novel paradigm for analyzing neural networks, distinct from existing mean-field/Wasserstein and NTK frameworks. This opens up a new avenue for applying advanced mathematical tools (variational calculus, elliptic PDE theory) to ML problems. * **Understanding Implicit Bias**: The discovery of near-$C^3$ regularity for optimal parameter densities provides a concrete and quantitative form of implicit bias towards smooth, well-generalizing solutions. This offers a deeper understanding of why overparameterized networks avoid overfitting. * **Bridging Regimes**: By offering a globally convex, nonlinear model that is exactly tied to finite-width networks, it bridges the gap between the lazy-training NTK regime and the feature-learning capabilities of mean-field approaches. * **Novel Regularization Principles**: The framework could inspire new regularization techniques grounded in variational principles and Sobolev spaces, potentially leading to more robust and interpretable models. * **Alternative Optimization**: The demonstration that the optimal density can be found by solving a linear system, even if currently limited to shallow networks, is a conceptual breakthrough that might inspire new hybrid optimization strategies or analytical solutions for specific network architectures. * **Foundational Research**: While not immediately applicable to state-of-the-art deep learning, this work lays a strong theoretical foundation that could influence future research on multi-layer networks, potentially leading to new insights into their complex optimization landscapes and generalization properties. This paper proposes a paradigm shift by formulating shallow neural network training as a globally well-posed continuum variational problem in weighted Sobolev spaces, yielding a unique, almost $C^3$ regular minimizer obtainable by solving a single linear system. This work provides profound theoretical insights into the implicit bias and generalization of neural networks, bridging existing frameworks with a novel mathematical approach based on convex analysis and elliptic PDE theory, despite its current limitation to shallow architectures.
Physics-informed neural networks (PINNs) have emerged as a promising route to solve partial differential equations, yet they have struggled to reach the precision of classical solvers. The obstacle is increasingly understood to be one of optimisation, owing to the severely ill-conditioned loss landscape. We present $\textbf{DSGNAR}$: Doubly-Sketched Gauss-Newton with Adaptive Ratio, a scalable second-order optimisation framework that confronts this ill-conditioning and, in doing so, obtains unprecedented accuracy and speed. $\textbf{DSGNAR}$ couples a doubly-sketched Gauss-Newton model with a novel strategy that carefully controls both regularisation and step length. Across a suite of problems spanning nonlinear, chaotic, multi-scale, high-dimensional, and Navier-Stokes, the framework greatly improves on the state of the art: able to attain relative $\ell_2$ errors as low as $3\times10^{-16}$ in double precision, improve contemporary results by five orders of magnitude on the canonical Burgers' equation, and as much as eight orders on a high-dimensional Poisson problem, while remaining markedly faster. We further show that, in single precision, solutions at the limit of round-off error can be obtained very quickly: Burgers' equation to $\ell_2^{\text{rel}} = 4.75 \times 10^{-7}$ in under ten seconds. The framework is also robust to the choice of architecture, arithmetic precision, and initial hyperparameters. The code is available at https://www.github.com/wephy/physics-informed-neural-networks
Primary: University of Oxford
All Institutions: University of Oxford
This paper presents a significant advancement in the optimization of Physics-Informed Neural Networks, enabling unprecedented accuracy and speed through a novel doubly-sketched Gauss-Newton framework, thereby addressing a fundamental limitation in the field and expanding the practical applicability of PINNs to high-precision scientific computing tasks.
The paper addresses the critical bottleneck in Physics-Informed Neural Networks (PINNs): the ill-conditioned loss landscape that prevents convergence to high-precision solutions. The proposed method, DSGNAR (Doubly-Sketched Gauss-Newton with Adaptive Ratio), is a sophisticated optimization framework. It combines second-order optimization (Gauss-Newton) with randomized linear algebra techniques (doubly-sketching) to make the Hessian approximation tractable for large-scale problems. Crucially, it introduces an adaptive ratio strategy to control regularization and step length, which stabilizes the training process. This is a significant methodological contribution to the intersection of numerical linear algebra, optimization, and scientific machine learning. The approach is theoretically grounded and practically scalable.
The experimental evaluation is extensive and compelling. The authors test DSGNAR across a diverse suite of problems including nonlinear PDEs, chaotic systems, multi-scale problems, high-dimensional Poisson equations, and the Navier-Stokes equations. The results are striking: relative $\ell_2$ errors as low as $3 \times 10^{-16}$ in double precision (near machine epsilon) and significant improvements (5-8 orders of magnitude) over state-of-the-art PINN methods on canonical benchmarks like Burgers' equation. The claim of solving high-dimensional problems that were previously intractable for PINNs is a major empirical achievement. The inclusion of single-precision results further demonstrates the robustness and speed of the framework.
The paper provides a GitHub repository link for the code. The authors are from a reputable institution (Oxford) with strong ties to numerical analysis, suggesting rigorous implementation. The detailed description of the doubly-sketching technique and the adaptive ratio strategy provides sufficient detail for replication, assuming access to the code. The use of standard benchmarks (Burgers', Poisson, Navier-Stokes) facilitates direct comparison with existing literature.
The primary limitation is the computational overhead of second-order methods, even with sketching. While the paper claims speed improvements, the constant factors associated with Gauss-Newton iterations and sketching operations may still be higher than first-order methods (like Adam) for very simple problems or small networks. The scalability to extremely large-scale neural networks (e.g., those used in modern foundation models) is not explicitly tested, as the focus is on PDE solutions where the network size is moderate but the domain is complex. Additionally, the method requires careful tuning of the sketching parameters, although the paper claims robustness.
This work has the potential to transform the application of PINNs in scientific computing. By enabling high-precision solutions, PINNs can become viable alternatives to classical solvers for complex, high-dimensional, or irregularly shaped domains where traditional methods struggle. This could accelerate research in fluid dynamics, quantum mechanics, and other fields relying on PDEs. The method also contributes to the broader field of optimization by demonstrating the efficacy of second-order methods with randomized linear algebra for ill-conditioned loss landscapes. This paper presents a significant advancement in the optimization of Physics-Informed Neural Networks, enabling unprecedented accuracy and speed through a novel doubly-sketched Gauss-Newton framework, thereby addressing a fundamental limitation in the field and expanding the practical applicability of PINNs to high-precision scientific computing tasks.
Memory expertise is a learned skill: knowing what to encode, when to retrieve, and how to organize knowledge--a capacity known in cognitive science as metamemory. We bring this perspective to LLMs by treating memory management as a trainable skill. We promote file-system operations to first-class memory actions alongside task actions, letting the model itself decide how to manage its memory. This memory skill improves along two axes: the structure that supports it (prompts, file schemas, action vocabulary), and the proficiency of the model exercising it. Both axes resist manual optimization: episodes in long-horizon tasks run for thousands of steps, and a single memory mistake can hide long before it surfaces, making human review of full trajectories impractical. We introduce AutoMem, a framework that automates both axes. In the first loop, a strong LLM reviews complete agent trajectories and iteratively revises the memory structure that shapes how the agent interacts with its memory files. In the second loop, the agent's own good memory decisions are identified from many episodes and used as training signal to sharpen the model's memory proficiency directly. Across three procedurally generated long-horizon games (Crafter, MiniHack, and NetHack), optimizing memory alone--without modifying the model's task-action behavior--improved the base agent's performance ~2x-4x, bringing a 32B open-weight model competitive with frontier systems such as Claude Opus 4.5 and Gemini 3.1 Pro Thinking. Our results show that memory management is an independently learnable skill, and a high-leverage objective yielding large gains on long-horizon tasks.
Primary: Stanford University
All Institutions: Stanford University
The paper's findings have substantial broader implications. By demonstrating that automated memory optimization can significantly enhance LLM agent performance on long-horizon tasks, it offers a practical pathway for open-weight models to achieve capabilities comparable to frontier proprietary systems. This could democratize access to advanced agentic AI, making sophisticated LLM agents more accessible for research and development. The methodology of using meta-LLMs for trajectory-level review and targeted revision is a generalizable workflow that could be applied to optimize other agent capabilities beyond memory, potentially accelerating agent development across various domains. While the current applications are in games, the underlying principles are highly transferable to real-world tasks requiring complex, long-term information management. The authors responsibly note that the released artifacts are not directly applicable to high-stakes deployment without further safety review, acknowledging the ethical considerations. This paper introduces AutoMem, a novel framework that automates the learning of memory as a cognitive skill for LLM agents by iteratively optimizing both the memory's supporting structure (scaffold) and the model's proficiency in using it, yielding significant performance gains on long-horizon tasks and making open-weight models competitive with frontier systems. The work presents a highly innovative approach to a critical challenge in LLM agent development, leveraging meta-LLMs to automate the optimization of memory management in long-horizon tasks where human review is intractable. Its strong empirical results, demonstrating substantial performance improvements solely from memory optimization and bringing a 32B open-weight model to the level of frontier proprietary systems, highlight memory as a high-leverage objective and offer a promising direction for developing more capable, efficient, and accessible AI agents.
The methodology proposed in AutoMem is exceptionally well-conceived and technically sound. The central idea of treating memory management as a "trainable skill" for LLMs, drawing inspiration from cognitive science's metamemory, is a powerful conceptual shift. By promoting file-system operations (read, write, search, append, create) to first-class actions within the LLM's action space, the framework provides a flexible, observable, and controllable interface for external memory. The core technical contribution is the two-loop AutoMem framework. The first loop, scaffold optimization, leverages a powerful meta-LLM (Claude Opus 4.6) to review complete, long-horizon agent trajectories (up to $10^5$ steps) and iteratively revise the agent's code, prompts, and memory file schema. This addresses a critical bottleneck in long-horizon task development, where human review of such extensive traces is impractical. The meta-LLM effectively acts as a "code reviewer," diagnosing memory-related failures and proposing concrete structural improvements (e.g., coordinate-keyed deduplication, auto-synced inventory files, pre-populated strategy guides). The second loop, proficiency training, focuses on enhancing the model's parametric ability to make optimal memory decisions. Here, a meta-LLM (Claude Opus 4.7) acts as a "training engine," curating high-quality supervised training data from the agent's own experience and orchestrating the LoRA finetuning configuration. The architectural separation of a finetuned "memory specialist" model from the frozen "gameplay model" is a clever design choice, ensuring that memory skill acquisition is targeted and does not degrade the base model's existing task competence. This modularity allows for clean, additive gains. The overall framework is coherent, addresses a significant challenge in LLM agent development, and is grounded in a strong theoretical perspective.
The experimental evaluation is rigorous and highly convincing. The paper selects three challenging, procedurally generated long-horizon games—Crafter, MiniHack, and NetHack—which are ideal environments for testing sophisticated memory management due to their length, stochasticity, and the inherent need for persistent knowledge (e.g., maps, inventory, strategies). The use of the BALROG harness ensures a standardized and challenging benchmark. The primary metric, game progression rate, is appropriate for these complex tasks. The results are remarkably strong: optimizing memory *alone*, without modifying the base model's task-action weights, yields substantial performance gains of 2x-4x across all environments. This empirically validates the paper's central hypothesis that memory management is an independently learnable and high-leverage skill. Furthermore, the optimized 32B open-weight model achieves performance competitive with frontier proprietary systems such as Claude Opus 4.5 and Gemini 3.1 Pro Thinking, a highly impactful finding that suggests memory optimization can significantly close the gap between open-source and state-of-the-art proprietary models on these tasks. The paper also provides compelling qualitative evidence, including a significant reduction in unproductive actions, a sharp decrease in redundant memory writes, and the emergence of a "consult-before-write" memory discipline in the trained specialist. The detailed examples of memory schema evolution (e.g., NetHack's coordinate-keyed map deduplication) further illustrate the concrete benefits of the scaffold optimization. The inclusion of strong baselines, including frontier proprietary models and basic context-management strategies, provides a comprehensive comparison.
The paper demonstrates an excellent commitment to reproducibility. A dedicated appendix provides comprehensive implementation details, including specific configurations for all three game environments (Crafter, MiniHack, NetHack), such as world area, agent view, reward settings, maximum episode steps, and evaluation seeds. Crucially, it details the outer-loop processes, specifying the meta-LLMs used (Claude Opus 4.6/4.7), the criteria for accepting revisions, retry mechanisms, training data collection procedures, and the exact LoRA hyperparameters (rank, alpha, dropout, effective batch size, learning rate, number of training epochs, and target modules) for each environment. The explicit mention of releasing the complete prompt templates and code at `https://github.com/autoLearnMem/AutoMem` is a significant strength, enabling researchers to replicate and build upon this work. This level of detail is commendable and sets a high standard for reproducibility in LLM agent research.
The authors thoughtfully acknowledge several limitations. The current memory system is episodic, meaning the file system starts fresh at the beginning of each episode, which prevents knowledge transfer across sessions. Extending this to persistent memory is identified as a natural next step. The experiments are conducted on game environments, which, while well-suited for studying memory, suggest a need to validate the approach on real-world, memory-intensive tasks. Additionally, the current framework optimizes a separate scaffold and memory specialist for each game, raising the question of whether a single, more generalized scaffold or specialist could be developed to operate effectively across diverse environments. An implicit limitation, common to meta-LLM-driven approaches, is the reliance on powerful proprietary models (Claude Opus) as meta-LLMs, which entails cost and potential for brittleness, though the iterative refinement and gating mechanisms help mitigate this.
The paper's findings have substantial broader implications. By demonstrating that automated memory optimization can significantly enhance LLM agent performance on long-horizon tasks, it offers a practical pathway for open-weight models to achieve capabilities comparable to frontier proprietary systems. This could democratize access to advanced agentic AI, making sophisticated LLM agents more accessible for research and development. The methodology of using meta-LLMs for trajectory-level review and targeted revision is a generalizable workflow that could be applied to optimize other agent capabilities beyond memory, potentially accelerating agent development across various domains. While the current applications are in games, the underlying principles are highly transferable to real-world tasks requiring complex, long-term information management. The authors responsibly note that the released artifacts are not directly applicable to high-stakes deployment without further safety review, acknowledging the ethical considerations. This paper introduces AutoMem, a novel framework that automates the learning of memory as a cognitive skill for LLM agents by iteratively optimizing both the memory's supporting structure (scaffold) and the model's proficiency in using it, yielding significant performance gains on long-horizon tasks and making open-weight models competitive with frontier systems. The work presents a highly innovative approach to a critical challenge in LLM agent development, leveraging meta-LLMs to automate the optimization of memory management in long-horizon tasks where human review is intractable. Its strong empirical results, demonstrating substantial performance improvements solely from memory optimization and bringing a 32B open-weight model to the level of frontier proprietary systems, highlight memory as a high-leverage objective and offer a promising direction for developing more capable, efficient, and accessible AI agents.
In long-context use, large language models frequently synthesize answers from the meaning of a relevant context span rather than literally copy-pasting them. Identifying which attention heads perform this synthesis matters for interpreting long-context model behavior. Yet existing detectors miss these heads by construction: they reward heads whose attended token matches the generated token, a literal-copy criterion that captures where a head reads but not what it writes through its output-value (OV) circuit, the very mechanism that carries non-literal retrieval. We introduce Logit-Contribution Scoring (LOCOS), a write-aware detector that scores each head by the projection of its OV-circuit output onto the answer-token unembedding direction, contrasting needle and off-needle source positions in a single forward pass. Across three model families (Qwen3, Gemma-3, OLMo-3.1), mean-ablating the top LOCOS heads on the NoLiMa non-literal retrieval benchmark collapses ROUGE-L at lower head counts than prior attention-based detections; on Qwen3-8B, ablating 50 heads drives ROUGE-L from 0.401 to 0.000 while the strongest baseline still retains 0.292. The selected heads are retrieval-specific: parametric recall and arithmetic reasoning stay at baseline under the same ablation. On Qwen3-8B, the same ablation also drops MuSiQue from 0.55 to 0.08 and BABI-Long from 0.62 to 0.20, while a random-heads control stays within 0.05 of baseline.
Primary: University of Edinburgh
All Institutions: University of Edinburgh
LOCOS makes a significant contribution to the field of LLM interpretability and mechanistic interpretability. By providing a "write-aware" detector, it enables researchers to identify and understand the attention heads responsible for synthesizing non-literal answers from long contexts, a crucial aspect of advanced LLM behavior. This can lead to: * **Improved Model Understanding**: Deeper insights into how LLMs process and synthesize information, moving beyond simple token copying. * **Enhanced Debugging and Safety**: Pinpointing specific circuits responsible for non-literal retrieval could help diagnose issues like hallucination or incorrect synthesis in RAG systems. * **Targeted Model Optimization**: Identifying these critical heads could inform more efficient model architectures or targeted fine-tuning strategies for long-context tasks. * **New Research Directions**: The method opens avenues for further investigation into the interplay between QK and OV circuits in various LLM capabilities. The provided code and datasets will facilitate further research in this area. Logit-Contribution Scoring (LOCOS) is a novel, write-aware method that effectively identifies non-literal retrieval heads in large language models by measuring their direct contribution to the answer logit. The paper presents a robust methodology and compelling experimental evidence across multiple LLM families and tasks, demonstrating that LOCOS consistently and causally identifies heads critical for synthesizing answers from context, thereby significantly advancing the mechanistic understanding of long-context LLM behavior.
The paper introduces Logit-Contribution Scoring (LOCOS), a novel, "write-aware" detector for identifying non-literal retrieval heads in large language models. The core insight is that existing methods, which rely on attention patterns (where a head reads), fail to capture non-literal retrieval where the output-value (OV) circuit transforms attended content into a synthesized answer (what a head writes). LOCOS addresses this by scoring each head based on the scalar projection of its OV-circuit output onto the correct answer-token's unembedding vector. This directly measures the head's contribution to the answer logit. A key methodological strength is the use of spatial contrast, comparing logit contributions from needle positions against length-normalized off-needle contributions within a single decoding step. This allows for efficient scoring (single forward pass per probing trial) and effectively isolates needle-specific contributions, cancelling out uniform contributors. The aggregation method pools scores over all answer steps across passing trials. The method is well-defined, mathematically grounded, and directly tackles a critical limitation of prior work in mechanistic interpretability.
The experimental evaluation is exceptionally thorough and convincing. The authors test LOCOS across six configurations spanning three modern LLM families (Qwen3, Gemma-3, OLMo-3.1) on the NoLiMa non-literal retrieval benchmark. Causal validation is performed via mean-ablation of top-ranked heads, a robust technique for assessing causal importance. 1. **Ablation Comparison**: LOCOS consistently produces significantly steeper ROUGE-L degradation curves than all attention-based baselines (Wu/NIAH-scored, Wu/NoLiMa-scored, and a random control). On Qwen3-8B, ablating just 50 LOCOS heads collapses ROUGE-L from 0.401 to 0.000, while the strongest baseline retains 0.292. This is a striking and highly convincing result. 2. **OV Contribution Isolation**: A control experiment comparing LOCOS to an attention-only spatial-contrast score (matching LOCOS's aggregation but removing the OV projection) demonstrates that the OV projection is crucial for consistent reliability and severe performance collapse across models. 3. **Bottom-k Control**: Ablating heads with negative spatial contrast scores (contributing from off-needle positions) shows no degradation, effectively ruling out the objection that LOCOS merely ablates any answer-aligned signal. This confirms that LOCOS identifies *needle-specific* retrieval heads. 4. **Retrieval Specificity**: LOCOS heads are shown to be retrieval-specific, with parametric recall and arithmetic reasoning tasks remaining largely unaffected by the same ablation. LOCOS achieves the highest dissociation score across all models, indicating minimal damage to non-retrieval capabilities. 5. **Literal vs. Non-Literal Specificity**: Ablating LOCOS heads degrades both non-literal (NoLiMa) and literal (NIAH) retrieval, but with a steeper drop on NoLiMa, confirming its ability to identify the non-literal subset missed by prior methods. 6. **Downstream Evaluation**: Ablating LOCOS heads significantly degrades performance on complex downstream long-context benchmarks like MuSiQue and BABILong, particularly for the Qwen3 family, demonstrating transferability and real-world impact. The experiments are comprehensive, include strong baselines and controls, and provide compelling evidence for the efficacy and specificity of LOCOS.
The paper provides a clear methodological description, including equations for per-position logit contribution, spatial contrast, and aggregation. Key experimental details such as model families, benchmarks (NoLiMa, NIAH, parametric tasks), ablation method (mean-ablation with query vector calibration), and evaluation metrics (ROUGE-L, accuracy) are well-documented. The authors provide GitHub and HuggingFace dataset links, which significantly enhance reproducibility. The level of detail provided is sufficient for researchers to replicate the core findings.
The authors acknowledge two main limitations: 1. **Off-needle baseline**: If the context contains distractor information semantically related to the answer, the off-needle contribution might rise, potentially causing LOCOS to under-score heads performing broad semantic matching rather than targeted needle retrieval. While desirable for span-specific retrieval, this might miss heads involved in more diffuse contextual integration. 2. **Architecture coverage**: The evaluation focuses on specific decoder-only transformer families. The authors caution that the observed causal head-ablation magnitudes and late-layer concentration should not be assumed to transfer without verification to other architectures like Mixture-of-Experts, encoder-decoder stacks, or state-space models.
LOCOS makes a significant contribution to the field of LLM interpretability and mechanistic interpretability. By providing a "write-aware" detector, it enables researchers to identify and understand the attention heads responsible for synthesizing non-literal answers from long contexts, a crucial aspect of advanced LLM behavior. This can lead to: * **Improved Model Understanding**: Deeper insights into how LLMs process and synthesize information, moving beyond simple token copying. * **Enhanced Debugging and Safety**: Pinpointing specific circuits responsible for non-literal retrieval could help diagnose issues like hallucination or incorrect synthesis in RAG systems. * **Targeted Model Optimization**: Identifying these critical heads could inform more efficient model architectures or targeted fine-tuning strategies for long-context tasks. * **New Research Directions**: The method opens avenues for further investigation into the interplay between QK and OV circuits in various LLM capabilities. The provided code and datasets will facilitate further research in this area. Logit-Contribution Scoring (LOCOS) is a novel, write-aware method that effectively identifies non-literal retrieval heads in large language models by measuring their direct contribution to the answer logit. The paper presents a robust methodology and compelling experimental evidence across multiple LLM families and tasks, demonstrating that LOCOS consistently and causally identifies heads critical for synthesizing answers from context, thereby significantly advancing the mechanistic understanding of long-context LLM behavior.
Autonomous scientific discovery systems offer the potential to accelerate research by automating the process of hypothesis generation and validation. However, current systems operate within constrained search spaces or require predefined research questions, limiting their capacity for true open-ended inquiry. Furthermore, while they generate hypotheses iteratively, they largely lack the ability to explicitly synthesize their own accumulated findings to uncover complex, interconnected phenomena. We introduce DiscoPER, an autonomous large language model-powered framework that conducts open-ended research by dynamically generating and executing code to explore datasets without pre-specified research objectives. To ensure rigorous scientific validity, every proposed discovery must pass statistical testing. To overcome the limitations of isolated search, our framework introduces a second-order reasoning mechanism that periodically analyzes its own accumulated discoveries. By treating prior discoveries as empirical data, DiscoPER identifies structural patterns, confounds, and epistemic gaps, actively redirecting hypothesis exploration toward uncharted regions of the search space. The search space is further expanded by incorporating tool use, enabling the system to explore hypotheses beyond structured metadata by seamlessly processing and extracting useful information from multimodal sources like images. Evaluated on iNatDisco, a new multimodal ecological knowledge benchmark with pattern-level ground truth obtained from peer-reviewed literature, DiscoPER recovers 8 of 9 known patterns with a 72.7% hypothesis support rate, outperforming both classical causal discovery and LLM-guided baselines. Ablations show that DiscoPER scales with more data, and confirms the benefits of second-order meta-reflection.
Primary: University of Edinburgh
All Institutions: University of Edinburgh, Massachusetts Institute of Technology
[One sentence main contribution]. DiscoPER introduces a novel autonomous scientific discovery framework that combines LLM-driven hypothesis generation, code-based statistical validation, and second-order meta-reflection to enable open-ended, data-driven scientific inquiry. [Comprehensive analysis of the technical contribution, methodology, and significance to the field]. The paper presents a significant advancement in agentic ML for scientific discovery by addressing the critical limitation of isolated hypothesis generation in existing systems. By introducing a structured "Propose-Evaluate-Reflect" loop, DiscoPER enables the system to synthesize accumulated findings, identify gaps, and redirect its search strategy dynamically. The rigorous validation mechanism, which requires hypotheses to pass statistical tests on held-out data, ensures scientific validity and mitigates LLM hallucination. The creation of the iNatDisco benchmark provides a much-needed evaluation standard for open-ended discovery, moving beyond task-specific QA. The empirical results demonstrate that this approach significantly outperforms both classical causal discovery methods and guided LLM baselines, particularly in recovering complex, multi-variable patterns. This work establishes a new paradigm for autonomous scientific agents that are not only capable of generating ideas but also of critically evaluating and building upon their own discoveries.
The paper proposes DiscoPER, an autonomous scientific discovery framework that integrates Large Language Models (LLMs) with executable code and statistical testing. The core methodological innovation is the "Propose-Evaluate-Reflect" loop. Unlike previous systems that either require predefined research questions (guided) or lack iterative synthesis (unstructured), DiscoPER operates in an open-ended manner ($P=$ none). It generates hypotheses as Python code, validates them on held-out data to prevent p-hacking, and employs a second-order "Reflect" module. This Reflect module analyzes the accumulated claim store to identify epistemic gaps, confounds, and compound hypotheses, thereby steering the search space in subsequent iterations. The approach effectively bridges the gap between classical causal discovery (restricted edge spaces) and LLM-based reasoning (prone to hallucination) by grounding all claims in statistical significance while allowing the LLM to explore a Turing-complete hypothesis space. The inclusion of multimodal capabilities via tool use (VLMs) further expands the scope of discoverable patterns beyond tabular metadata.
The evaluation is rigorous and addresses the specific challenges of open-ended discovery. The authors introduce iNatDisco, a new benchmark derived from iNaturalist data, which includes ground-truth patterns from peer-reviewed literature. This is a significant contribution, as existing benchmarks are largely task-oriented. DiscoPER achieves 8/9 pattern recovery on iNatDisco-800 and 8/12 on iNatDisco-50K, outperforming classical causal discovery methods (which fail to capture complex interactions) and guided LLM baselines. The ablation studies clearly demonstrate the value of the Reflect module, showing improvements in both recall and hypothesis support rate. The counterfactual evaluation is particularly strong, proving that the system relies on data-driven evidence rather than memorized LLM priors. The scaling analysis provides insight into the system's behavior with respect to data size and iteration count.
The paper provides detailed implementation specifications, including model versions (Claude Sonnet 4.6, etc.), statistical thresholds (effect size > 0.2, p < 0.05), and the structure of the hypothesis code. The use of executable code for hypotheses enhances reproducibility, as the validation steps are deterministic given the data and code. The description of the iNatDisco dataset construction is sufficient for replication. However, the reliance on proprietary LLMs (Claude, GPT) means that exact performance replication might vary with model updates, though the methodology itself is open.
The system is computationally expensive due to the iterative nature of code generation, execution, and reflection. The performance is bounded by the quality and bias of the underlying LLMs and the available data. The "Reflect" module, while effective, introduces latency and potential for compounding errors if the initial claims are flawed. Additionally, the benchmark, while novel, is specific to ecology; generalization to other scientific domains requires further validation. The system's ability to discover truly novel, non-intuitive patterns beyond those present in the training data of the LLM remains an open question, although the counterfactual tests mitigate some of this concern.
This work has significant implications for accelerating scientific discovery across disciplines. By automating the iterative process of hypothesis generation and validation, it can help researchers identify patterns that might be overlooked due to human cognitive biases or limitations. The open-ended nature of the system encourages exploration of uncharted regions of the search space, potentially leading to new scientific insights. However, the reliance on AI for scientific discovery raises ethical considerations regarding the verification of findings and the potential for automated bias reinforcement. The framework provides a robust template for building autonomous scientific agents that prioritize empirical validity. [One sentence main contribution]. DiscoPER introduces a novel autonomous scientific discovery framework that combines LLM-driven hypothesis generation, code-based statistical validation, and second-order meta-reflection to enable open-ended, data-driven scientific inquiry. [Comprehensive analysis of the technical contribution, methodology, and significance to the field]. The paper presents a significant advancement in agentic ML for scientific discovery by addressing the critical limitation of isolated hypothesis generation in existing systems. By introducing a structured "Propose-Evaluate-Reflect" loop, DiscoPER enables the system to synthesize accumulated findings, identify gaps, and redirect its search strategy dynamically. The rigorous validation mechanism, which requires hypotheses to pass statistical tests on held-out data, ensures scientific validity and mitigates LLM hallucination. The creation of the iNatDisco benchmark provides a much-needed evaluation standard for open-ended discovery, moving beyond task-specific QA. The empirical results demonstrate that this approach significantly outperforms both classical causal discovery methods and guided LLM baselines, particularly in recovering complex, multi-variable patterns. This work establishes a new paradigm for autonomous scientific agents that are not only capable of generating ideas but also of critically evaluating and building upon their own discoveries.
Language models increasingly write probabilistic programs (in NumPyro, Stan, or Pyro), but a program that compiles, runs, and passes every unit test can still be \emph{statistically} wrong -- a Gaussian likelihood for heavy-tailed data, a Poisson for over-dispersed counts, an invalid prior support, or a pathological parameterization. The right verifier is therefore not a test suite but the Bayesian workflow itself: posterior predictive checks, simulation-based calibration, sampler diagnostics ($\hat R$, divergences, ESS), and held-out predictive density. We study this calibration oracle along three axes. \textbf{Detection:} on a benchmark of $14$ misspecification types across $10$ model families ($200$ instances), it flags the bug with AUC $0.97$ ($88\%$ at $2\%$ FPR \emph{when handed the correct reference program, an upper bound}) -- and a fully \emph{reference-free} version that uses no correct program reaches $62$--$78\%$ (the upper figure from a small automated model search), versus $0\%$ for a unit-test oracle. \textbf{Repair:} used as feedback in an LLM repair loop across fifteen models, calibration significantly outperforms unit-test feedback -- which is itself \emph{significantly worse than no feedback at all}, a passing test inducing false confidence that suppresses repair -- and improves over no feedback on strong-but-unsaturated models (GPT-5.1 $33{\to}92\%$, Claude $75{\to}100\%$; paired McNemar, $n{=}228$). \textbf{Reality:} on programs LLMs write from scratch for neutral briefs, $15$--$47\%$ of runnable ones are statistically misspecified (unit tests catch none), and calibration-guided repair significantly beats LLM-as-judge review, a Bayesian-workflow checklist, and data-summary self-debug. Across all three, the lesson is the same: for probabilistic programs, correctness is calibration, not compilation.
Primary: Columbia University
All Institutions: Columbia University
This paper introduces a paradigm shift for verifying LLM-generated probabilistic programs, demonstrating that Bayesian calibration diagnostics are essential for detecting code-invisible statistical errors, significantly outperforming and even revealing the harm of traditional unit tests in LLM repair loops. The comprehensive technical contribution, rigorous empirical validation on both injected and LLM-generated bugs, and the clear, actionable insights make this a genuinely field-wide significant paper that will reshape how LLM-assisted probabilistic modeling systems are designed and evaluated.
The paper proposes a robust methodology for detecting and repairing statistically misspecified probabilistic programs generated by Language Models (LLMs). The central innovation is the "calibration oracle," which formalizes and automates the Bayesian workflow diagnostics (Posterior Predictive Checks, Simulation-Based Calibration, sampler diagnostics like R-hat and divergences, and held-out predictive density) into a programmatic verifier. This oracle is designed to identify "code-invisible misspecification"—statistical errors that traditional unit tests (compilation, execution, output shape) cannot detect. The repair loop integrates this oracle as feedback for LLMs, guiding them to iteratively refine their programs. The methodology is well-grounded in Bayesian principles, and the formal distinction between code-visible and code-invisible bugs is crucial for understanding the limitations of existing LLM code verification paradigms. The aggregation of diverse diagnostics into a single verdict, with defined thresholds, provides a practical and actionable signal for automated systems.
The experimental evaluation is exceptionally thorough and convincing, covering detection, repair, and real-world applicability. 1. **Detection**: A comprehensive benchmark of 200 instances across 14 misspecification types and 10 model families was created. The calibration oracle achieved an impressive AUC of 0.97 (88% detection at 2% FPR), demonstrating its efficacy. Critically, a *reference-free* version of the oracle (without a ground-truth correct program) still achieved 62-78% detection, highlighting its practical utility. In stark contrast, the unit-test oracle achieved 0% detection for code-invisible bugs. 2. **Repair**: Experiments with 15 diverse LLMs (open and API) in a repair loop showed that calibration feedback consistently outperformed "no feedback" and "unit-test feedback." A surprising and impactful finding was that unit-test feedback was often *worse than no feedback*, as it induced false confidence and suppressed repair. Calibration feedback led to substantial improvements for strong-but-unsaturated models (e.g., GPT-5.1 from 33% to 92%), with statistically significant gains ($p < 4.5 \times 10^{-10}$ for pooled invisible-bug repairs). 3. **Real LLM Programs**: This section provides the strongest evidence. LLMs were tasked to write programs from scratch for neutral briefs. The study found that 15-47% of runnable LLM-generated programs were statistically misspecified (none caught by unit tests). Calibration-guided repair significantly outperformed strong baselines, including LLM-as-judge review, Bayesian-workflow checklists, and data-summary self-debug, achieving an 84% fix rate for misspecified programs. The results are robust, with detailed ablations and sensitivity analyses for oracle thresholds.
The paper demonstrates a high commitment to reproducibility. It specifies exact snapshots for API models and HuggingFace revisions for open models. Detailed hyperparameters for LLM generation (temperature, max tokens, repair budget) and NUTS inference (chains, draws, x64) are provided. Oracle thresholds are explicitly stated, and their sensitivity is analyzed. Crucially, the authors commit to releasing "The full system prompt, contract, feedback templates, benchmark generators, and analysis scripts with the paper," which is excellent practice and will enable full replication and extension of their work.
The authors are transparent about several limitations: 1. **PPC power**: The effectiveness of Posterior Predictive Checks depends on the chosen test statistics, meaning some misspecifications might remain undetected if not captured by the tracked statistics. 2. **Right fit, wrong structure**: The oracle primarily assesses distributional fit, not causal or generative correctness, making it blind to structural errors that yield similar predictive distributions. 3. **Over-wide predictive**: An overly confident or under-confident model might "cover" the data and pass PPC despite being fundamentally wrong. 4. **Diagnosis without remedy**: Weaker LLMs may receive accurate diagnostic feedback but lack the capability to translate it into a correct structural fix. 5. **SBC expense**: Simulation-Based Calibration is computationally intensive, limiting its practical use in the repair loop for large or costly models. 6. **Inference failure**: Sampler diagnostics can sometimes fire due to poor inference (e.g., NUTS issues) rather than genuine model misspecification, leading to false positives. 7. **Benchmark scope**: The repair experiments use a subset of bugs, and the tasks are classical low-dimensional models, suggesting that extending the approach to high-dimensional or complex structured models (e.g., deep models, spatial-temporal models) is future work.
This paper has profound broader impact for the burgeoning field of LLM-assisted scientific computing and probabilistic modeling. It fundamentally shifts the paradigm for verifying LLM-generated probabilistic code from "compilation" to "calibration," providing a principled and empirically validated framework for ensuring statistical correctness. This work is crucial for building trustworthy and reliable LLM agents that can assist in scientific discovery, data analysis, and model development. The PPL-agnostic nature of the Bayesian diagnostics means the approach is broadly applicable across popular probabilistic programming languages like Stan, Pyro, and PyMC. The surprising finding that unit-test feedback is actively harmful for LLM repair loops is a critical insight for designing future LLM agent architectures. This paper sets a new standard for evaluating and improving LLM-generated scientific code. This paper introduces a paradigm shift for verifying LLM-generated probabilistic programs, demonstrating that Bayesian calibration diagnostics are essential for detecting code-invisible statistical errors, significantly outperforming and even revealing the harm of traditional unit tests in LLM repair loops. The comprehensive technical contribution, rigorous empirical validation on both injected and LLM-generated bugs, and the clear, actionable insights make this a genuinely field-wide significant paper that will reshape how LLM-assisted probabilistic modeling systems are designed and evaluated.
Kilometer-scale convection shapes precipitation extremes, tropical organization, and cloud feedbacks, but most global atmospheric models approximate these processes at 25-100 km resolution. Global storm-resolving physics models resolve convective systems explicitly, but at a cost -- roughly one MWh per simulated day on exascale supercomputers -- that limits long-duration simulation. We introduce STRATA (Storm-resolving Tile-based autoRegressive Atmosphere Transformer Architecture), the first autoregressive AI emulator for global storm-resolving atmospheric dynamics. STRATA is trained on the highest-resolution atmospheric dataset yet used for global AI emulation: 17 days of SCREAM physics-model output at 4.9-km resolution (~25 million grid cells) sampled every 10 minutes. Our central premise is that on 10-minute timescales atmospheric dynamics are predominantly local, so training on small spatial tiles trades scarce global temporal samples for abundant local spatial samples and enables global rollout via overlapping-tile blending. STRATA combines 3D patch embedding and local 3D neighborhood attention, a novel Stereographic Rotary Position Embedding (StereoRoPE) for grid-invariant encoding, and a pixel-space de-aliasing decoder that suppresses patch-scale rollout artifacts. An iso-FLOP scaling study reveals that km-scale emulation requires ~10x more FLOPs per grid point than coarse-resolution AI weather models, consistent with the higher information density of convective-scale dynamics. Trained on only 17 days of data, STRATA produces stable 24-hour global rollouts with realistic km-scale dynamics across diverse regimes, though large-scale biases develop with lead time. It achieves 48 simulation days per megawatt-hour -- about 50 times better energy efficiency than the SCREAM physics model -- and 741 simulated days per wall-clock day at 512 H100 GPUs. Code and dataset are publicly available.
Primary: NVIDIA
All Institutions: NVIDIA, Lawrence Livermore National Laboratory, Pacific Northwest National Laboratory, National Energy Research Scientific Computing Center (NERSC)
STRATA represents a significant advance in scientific machine learning by successfully demonstrating autoregressive global emulation at storm-resolving resolution, addressing key computational and methodological challenges through innovative architectural design and rigorous physical constraints.
The paper introduces STRATA, a transformer-based autoregressive emulator for global storm-resolving atmospheric dynamics. The core methodological innovation lies in addressing the computational intractability of training global models at kilometer-scale resolution. The authors propose a tile-based training strategy that trades scarce global temporal samples for abundant local spatial samples, leveraging the locality of atmospheric dynamics on 10-minute timescales. Key technical contributions include: 1) A 3D patch embedding and local neighborhood attention backbone adapted from Diffusion Transformers (DiT) for deterministic weather forecasting. 2) StereoRoPE, a novel stereographic rotary position embedding that provides grid-invariant encoding, allowing the model to generalize across different spherical grid topologies (cubed-sphere, lat-lon, stereographic) without retraining. 3) A pixel-space de-aliasing decoder using bilinear upsampling and depthwise convolutions to suppress checkerboard artifacts inherent in patch-based tokenization. 4) A rigorous spectral stability analysis explaining why patch-based architectures suffer from instability and how the proposed decoder mitigates this. The approach is technically sophisticated, combining insights from atmospheric physics (locality, mass continuity constraints) with advanced deep learning architectures (transformers, positional embeddings, stability analysis).
The evaluation is comprehensive and rigorous. The model is trained on 17 days of high-resolution (4.9 km, 10-minute) output from the SCREAM physics model, a significant step up from previous AI weather models that typically use reanalysis data (ERA5) at much coarser resolutions. The paper demonstrates stable 24-hour global rollouts, capturing realistic convective-scale dynamics including tropical cyclones, fronts, and orographic precipitation. Quantitative metrics include Fractions Skill Score (FSS) for rainfall, error growth rates, and precipitation distribution analysis. The paper also includes an iso-FLOP scaling study, revealing that km-scale emulation requires ~10x more FLOPs per grid point than coarse-resolution models, a finding with significant implications for the field. The energy efficiency comparison (48 simulated days per MWh vs. 1 for SCREAM) is a strong practical result. The evaluation of grid invariance on unseen grids is a novel and compelling test of the StereoRoPE method.
The paper provides detailed implementation details, including architecture hyperparameters, training objectives, and optimization strategies. The code and dataset are publicly available, which is a major plus for reproducibility. The description of the tile-based inference and distributed implementation is sufficiently detailed for replication. The acknowledgment of dataset inconsistencies (SST/IC mismatch) adds to the transparency.
The model exhibits large-scale biases that develop with lead time, likely due to the tile-based training being blind to global-mean constraints (e.g., mass continuity for vertical velocity). The authors address this with post-processing filters (spherical harmonic filtering for vertical velocity, constraints for specific humidity), but this suggests the model does not learn global conservation laws natively. The reliance on only 17 days of training data is a significant limitation for capturing long-term climate variability or rare events, although the authors argue it is sufficient for short-term dynamic emulation. The model is not coupled with a coarse-resolution model, limiting its ability to simulate multi-decadal climate scenarios.
This work has the potential to revolutionize high-resolution climate simulation by making storm-resolving simulations orders of magnitude more energy-efficient. This could enable large ensemble simulations for uncertainty quantification in climate projections, particularly for extreme weather events and cloud feedbacks. The methodological contributions (tile-based training, StereoRoPE, de-aliasing) are generalizable to other scientific domains requiring high-resolution, global, autoregressive modeling on spherical domains (e.g., ocean dynamics, astrophysics). The public release of the dataset and code will accelerate research in scientific machine learning. STRATA represents a significant advance in scientific machine learning by successfully demonstrating autoregressive global emulation at storm-resolving resolution, addressing key computational and methodological challenges through innovative architectural design and rigorous physical constraints.
Controllable image generation methods, such as ControlNet, have demonstrated a remarkable capacity to introduce visual conditions(e.g., depth maps) to guide image generation. However, these methods often struggle with complex multi-instance scenes, frequently leading to attribute confusion among instances. While recent approaches attempt to mitigate this via manual instance labeling, such requirements are labor-intensive. In this paper, we propose InstanceControl, a novel multi-instance controllable generation method that eliminates the need for instance labeling. We identify the primary bottleneck in existing methods as the inability to accurately associate instance descriptions with their corresponding regions within visual conditions. To address this, we leverage the Vision-Language Model (VLM) to establish instance-level correspondences between text prompts and visual conditions. Specifically, the VLM automatically parses instance descriptions from the text prompts and simultaneously predicts instance masks based on the visual conditions. Furthermore, since the predicted masks may contain noise, we introduce an adaptive mask refinement strategy that dynamically refines these instance masks during the generation process. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods, achieving superior fidelity and precise instance-level control.
Primary: Harbin Institute of Technology
All Institutions: Harbin Institute of Technology, HUAWEI Noah's Ark Lab
InstanceControl has significant broader impact potential: 1. **Reduced Annotation Burden:** By eliminating the need for manual instance labeling during inference, it drastically reduces the labor and time costs associated with fine-grained controllable image generation, making such tools more accessible and practical for real-world applications. 2. **Enhanced Content Creation:** It empowers users with more precise and intuitive control over complex multi-instance scenes, which is invaluable for creative industries, design, virtual reality, and artistic expression. 3. **Advancement in Controllable Generation:** It pushes the boundaries of controllable image generation by effectively addressing the long-standing challenge of attribute confusion in multi-instance scenarios, paving the way for more sophisticated and reliable generative AI systems. 4. **VLM Application:** It demonstrates a novel and effective application of Vision-Language Models for grounding on non-RGB visual conditions, opening new avenues for VLM research beyond traditional image understanding tasks. 5. **Robustness to Imperfect Inputs:** The adaptive mask refinement strategy offers a generalizable approach to handle noisy or uncertain inputs from upstream perception modules, which is a common challenge in complex AI pipelines. InstanceControl introduces a novel framework for multi-instance controllable image generation that eliminates the need for labor-intensive manual instance labeling. The paper makes a significant technical contribution by leveraging Vision-Language Models to automatically establish instance-level correspondences between text prompts and visual conditions, and by proposing an adaptive mask refinement strategy to robustly handle noisy predicted masks during the generation process. The comprehensive experiments and strong quantitative and qualitative results demonstrate its superior fidelity and precise instance-level control compared to state-of-the-art methods, marking a substantial advancement in the field of controllable generative AI.
The methodology of InstanceControl is well-conceived and addresses a critical bottleneck in multi-instance controllable image generation: the need for manual instance labeling. The proposed two-stage framework is logical and effectively integrates state-of-the-art components. The first stage, "Instance-level Text-Visual Condition Association," is highly innovative. Instead of relying on RGB images for VLM grounding, the authors adapt a VLM (Sa2VA) to parse instance descriptions from text prompts and predict corresponding instance masks directly from *visual conditions* (e.g., canny, depth, HED maps). This is a non-trivial adaptation, as existing grounding models are typically designed for semantic consistency with RGB content. The tailored dataset construction, leveraging Gemini 2.5 Pro for detailed prompts and correspondence generation, is a significant enabler for this stage. The "Shared SEG Token (SST)" strategy is a clever detail to handle multiple textual descriptions for a single instance, ensuring consistent mask predictions. The second stage, "Instance-aware Controllable Generation," effectively integrates these automatically derived correspondences into a diffusion model. Recognizing the inherent noise in VLM-predicted masks, the "Mask Refinement Module (MRM)" is a crucial component. It adaptively refines masks by combining the VLM's confidence score, attention-based masks from the generative model, and image latent features. This adaptive approach is superior to hard constraints or simple fusion strategies, allowing the model to be robust to prediction inaccuracies. The integration of these refined masks via a correspondence mask into the attention mechanism of the diffusion model is a standard yet effective way to enforce instance-level control. The overall methodology is robust, leveraging the strengths of VLMs for understanding and diffusion models for generation, while mitigating their respective weaknesses (VLM's noise, diffusion's attribute confusion).
The experimental evaluation is comprehensive and rigorous, demonstrating the strong performance of InstanceControl. 1. **Baselines:** The paper compares against a wide range of state-of-the-art methods, categorized into those requiring instance labeling (EliGen, CreatiLayout, Seg2Any, DreamRenderer) and those without (FLUX ControlNet). This provides a clear picture of where InstanceControl stands relative to both types of approaches. Additionally, comparisons with unified understanding and generation models (Qwen-Image ControlNet, Nano Banana) further validate its superiority. 2. **Visual Conditions:** Evaluation across canny edges, depth maps, and HED maps confirms the generalizability of the method to diverse control signals. 3. **Metrics:** A robust set of quantitative metrics is used, including MIoU for spatial alignment, Local CLIP and VQA-based Accuracy (Spatial, Color, Shape, Texture) for region-wise quality, and FID/ImageReward for global image quality. The use of Qwen2-VL-72B for fine-grained VQA assessment is particularly noteworthy for its detailed evaluation of attribute fidelity. 4. **Benchmarks:** Evaluation on a custom MIG-Eval dataset (derived from their constructed data), COCO-POS, and an out-of-domain HiCo-7K benchmark (in supplementary) demonstrates both in-domain and out-of-domain effectiveness. 5. **Results:** InstanceControl consistently outperforms all baselines, often by significant margins, especially against label-free methods like FLUX ControlNet. Remarkably, it even surpasses several methods that rely on manual instance labeling, highlighting the effectiveness of its automated approach. The qualitative results visually confirm the superior fine-grained attribute control and reduced attribute confusion. 6. **Ablation Studies:** Detailed ablation studies on the Shared SEG Token (SST) and the Mask Refinement Module (MRM) clearly demonstrate the contribution of each proposed component, providing strong empirical justification for their inclusion. 7. **Data Construction:** The effort in constructing a high-quality, detailed dataset using Gemini 2.5 Pro is a significant strength, enabling the VLM to perform robust grounding on visual conditions.
The paper provides a good level of detail regarding reproducibility. Key aspects include: * Specific backbone models used (Sa2VA, SAM, FLUX.1-Canny/Depth, XLabs HED ControlNet). * Use of LoRA modules with specified rank (256). * Training steps (30k for Stage 1, 80k + 10k for Stage 2), batch sizes (64, 4), learning rates ($4 \times 10^{-5}$, $1 \times 10^{-4}$), and schedules (cosine). * Loss weights ($bce=2.0, dice=0.5$). * Hardware used (four NVIDIA A6000 GPUs). * Details on dataset construction are mentioned to be in the supplementary material, which is crucial. The project page URL is provided, which often includes code or further details. Overall, the information provided should allow for a high degree of reproducibility.
1. **VLM Dependency:** The method heavily relies on a powerful VLM (Sa2VA, and Gemini 2.5 Pro for data generation) for initial instance parsing and mask prediction. While effective, this introduces a dependency on the capabilities and potential biases of these large models. 2. **Mask Noise Mitigation, Not Elimination:** While the Mask Refinement Module effectively mitigates noise and inaccuracies in predicted masks, it does not eliminate them entirely. The paper mentions that severe errors like localization offsets or missed objects can still cause incorrect generation, and an interactive mechanism is provided for rectification, implying it's not fully autonomous in all cases. 3. **Computational Cost:** Training and inference with large VLMs and diffusion models, especially with additional refinement modules, can be computationally intensive, potentially limiting real-time applications or deployment on resource-constrained devices. 4. **Dataset Specificity:** The custom dataset construction, while a strength, means the VLM is fine-tuned on specific types of visual conditions and prompt styles. Generalization to vastly different visual conditions or highly ambiguous prompts might require further adaptation.
InstanceControl has significant broader impact potential: 1. **Reduced Annotation Burden:** By eliminating the need for manual instance labeling during inference, it drastically reduces the labor and time costs associated with fine-grained controllable image generation, making such tools more accessible and practical for real-world applications. 2. **Enhanced Content Creation:** It empowers users with more precise and intuitive control over complex multi-instance scenes, which is invaluable for creative industries, design, virtual reality, and artistic expression. 3. **Advancement in Controllable Generation:** It pushes the boundaries of controllable image generation by effectively addressing the long-standing challenge of attribute confusion in multi-instance scenarios, paving the way for more sophisticated and reliable generative AI systems. 4. **VLM Application:** It demonstrates a novel and effective application of Vision-Language Models for grounding on non-RGB visual conditions, opening new avenues for VLM research beyond traditional image understanding tasks. 5. **Robustness to Imperfect Inputs:** The adaptive mask refinement strategy offers a generalizable approach to handle noisy or uncertain inputs from upstream perception modules, which is a common challenge in complex AI pipelines. InstanceControl introduces a novel framework for multi-instance controllable image generation that eliminates the need for labor-intensive manual instance labeling. The paper makes a significant technical contribution by leveraging Vision-Language Models to automatically establish instance-level correspondences between text prompts and visual conditions, and by proposing an adaptive mask refinement strategy to robustly handle noisy predicted masks during the generation process. The comprehensive experiments and strong quantitative and qualitative results demonstrate its superior fidelity and precise instance-level control compared to state-of-the-art methods, marking a substantial advancement in the field of controllable generative AI.
We present a zero-shot, training-free and optimization-free framework for generating 360 panoramic images and videos by directly injecting spherical priors into pre-trained diffusion transformers. Existing methods either rely on costly fine-tuning on scarce panoramic data that limits generalization, or leverage multi-step optimization that incurs prohibitive inference latency. We observe that contemporary generative models natively exhibit some panoramic priors from large-scale training. However, these emergent capabilities are insufficient, as the models fundamentally fail to satisfy the rigorous topological constraints imposed by equirectangular projection (ERP). We introduce a zero-shot and optimization-free approach that resolves these constraints at inference time. Spherical RoPE replaces standard rotary position embeddings: low-frequency channels are re-parameterized as 3D Cartesian coordinates to natively encode the spherical manifold, while high-frequency channels are harmonically quantized to enforce exact periodicity. Coupled with complementary Semantic Distortion classifier-free guidance (CFG) that explicitly steers geometry, we avoid retraining and inherit the full creative breadth of state-of-the-art models. Our approach generalizes across diverse backbones and 360 generation modalities. We demonstrate this across text-to-panorama using Flux.1, Flux.2, and LTX-Video backbones, achieving competitive performance against baselines, all while remaining training-free. Project page: https://orhir.github.io/SpheRoPE
Primary: Tel-Aviv University
All Institutions: Tel-Aviv University, Hebrew University of Jerusalem
SpheRoPE introduces a novel, training-free framework for 360-degree panorama generation by integrating spherical priors directly into diffusion transformers via modified position embeddings and guidance, achieving competitive results across multiple backbones without the need for fine-tuning or optimization.
The paper proposes SpheRoPE, a training-free, zero-shot method for generating 360-degree panoramic images and videos using pre-trained diffusion transformers (DiTs). The core innovation lies in replacing standard Rotary Position Embeddings (RoPE) with Spherical RoPE. This involves re-parameterizing low-frequency channels into 3D Cartesian coordinates to natively encode the spherical manifold and harmonically quantizing high-frequency channels to enforce periodicity. This is coupled with a Semantic Distortion classifier-free guidance (CFG) mechanism to steer geometry. The approach is theoretically sound, addressing the topological mismatch between planar training data (ERP) and spherical reality without retraining. It leverages the emergent capabilities of large models while correcting their fundamental geometric flaws.
The authors evaluate SpheRoPE on multiple state-of-the-art backbones, including Flux.1, Flux.2, and LTX-Video. They demonstrate competitive performance against existing baselines in text-to-panorama and text-to-video tasks. The evaluation highlights the method's ability to resolve topological artifacts (seams, discontinuities) common in naive ERP generation. The results suggest that the method generalizes well across different model architectures, which is a significant strength. However, as a zero-shot method, it relies on the underlying model's quality, so comparisons are against other zero-shot or fine-tuned baselines. The paper likely includes qualitative visualizations and potentially quantitative metrics like FID or CLIP scores adapted for panoramas, though specific numbers are not provided in the abstract. The claim of "competitive performance" suggests it matches or exceeds fine-tuned methods in some aspects while being significantly more efficient.
The paper provides a project page URL. As a training-free method, reproducibility is high provided the source code for the Spherical RoPE injection and Semantic Distortion guidance is released. The reliance on pre-trained models (Flux, LTX-Video) means the community has access to the base weights, facilitating replication. The method's simplicity (modifying embeddings and guidance) makes it easier to implement than full fine-tuning pipelines.
The primary limitation is the reliance on the pre-trained model's inherent knowledge. If the base model lacks semantic understanding of specific panoramic scenes, SpheRoPE cannot create that knowledge from scratch. Additionally, the harmonic quantization and Cartesian re-parameterization might introduce subtle artifacts if not tuned correctly for specific resolutions or aspect ratios. The method is currently demonstrated on text-to-panorama; its effectiveness on more complex video generation with temporal consistency across the spherical manifold needs rigorous long-term evaluation. There may also be a trade-off between geometric correctness and semantic fidelity, which the Semantic Distortion CFG aims to mitigate but may not eliminate entirely.
This work significantly lowers the barrier to entry for high-quality 360-degree content generation. By eliminating the need for costly fine-tuning on scarce panoramic data, it democratizes access to VR/AR content creation tools. It also provides a generalizable technique for handling non-Euclidean data structures in diffusion models, which could be extended to other domains like spherical video, global climate modeling visualization, or astronomical data. The reduction in inference latency compared to optimization-based methods makes it more viable for real-time applications. SpheRoPE introduces a novel, training-free framework for 360-degree panorama generation by integrating spherical priors directly into diffusion transformers via modified position embeddings and guidance, achieving competitive results across multiple backbones without the need for fine-tuning or optimization.
4D hand motion reconstruction from egocentric video is bottlenecked by clear limitations of existing methods: image-based pipelines depend on a detector that fails under heavy occlusion, while video-based methods rely on temporal modules learned only from scarce hand-pose annotations, a narrow signal insufficient to model motion dynamics, occlusion reasoning, and hand-object interaction. These capabilities, however, are exactly what video generative models must implicitly acquire when trained to synthesize coherent video at internet scale. Motivated by this, we present ViDiHand, which leverages the representations of a pretrained video diffusion model to reconstruct 4D two-hand pose. We adapt it via a hand-overlay rendering objective that specializes its features for hands while preserving its world priors. A decoder then recovers metric-scale pose from the adapted features. The whole pipeline operates directly on full frames--no detector, no infiller, and no test-time optimization. On ARCTIC, HOT3D, and HOI4D, ViDiHand substantially outperforms prior methods, establishing video diffusion models as a powerful new foundation for hand motion reconstruction and a promising route to scalable in-the-wild data collection for embodied AI. Project page: https://vidihand.github.io.
Primary: A*STAR (Agency for Science, Technology and Research)
All Institutions: A*STAR, NTU Singapore (Nanyang Technological University), Alibaba Group
The paper presents a significant advancement in 4D hand motion reconstruction by effectively adapting video diffusion models for perception tasks, achieving state-of-the-art performance on challenging benchmarks through a novel hand-overlay rendering adaptation and a geometrically-aware dual-branch decoder.
The paper proposes a novel paradigm for 4D hand motion reconstruction by leveraging the internal representations of a large-scale pretrained Video Diffusion Model (Wan2.1-VACE). Instead of treating the diffusion model as a generative black box or a frozen feature extractor, the authors introduce a "hand-overlay rendering" adaptation stage. This involves finetuning only the VACE branch of the model to regenerate input clips with semi-transparent rendered hand overlays. This clever pretext task specializes the model's world priors (occlusion reasoning, temporal coherence, 3D geometry) for hand-centric tasks without destroying the general visual knowledge. The decoder is a dual-branch architecture: a Hand-Token Branch for holistic articulated pose and a Joint-Heatmap Branch for local 2D localization, coupled by mutual cross-attention and a closed-form geometric solve for camera translation. This design elegantly separates the holistic vs. local inductive biases of the representation. The approach is methodologically sound, theoretically motivated by the capabilities of generative models, and technically sophisticated in its integration of diffusion features with geometric constraints.
The evaluation is comprehensive and rigorous. The authors test on three challenging egocentric hand benchmarks: ARCTIC (heavy occlusion), HOT3D (fisheye, high dynamic range, motion blur), and HOI4D (cross-dataset generalization). They introduce a "penalty protocol" that folds false negatives into pose metrics, providing a more realistic assessment of detection robustness than standard TP-only metrics. ViDiHand establishes new state-of-the-art results across all metrics, with particularly significant gains in frame accuracy (detection robustness) and temporal jitter (smoothness). The ablation studies are thorough, validating the choice of DiT layer, denoising step, and decoder components. The cross-dataset transfer to HOI4D demonstrates the generalizability of the learned priors. The results are statistically significant and practically meaningful, showing that video diffusion models capture richer spatiotemporal priors than discriminative video models or image-based detectors.
The paper provides detailed implementation details, including the specific backbone (Wan2.1-VACE), the two-stage training curriculum (joint overlay then MANO mesh overlay), and the decoder architecture. The supplementary material contains extensive details on the evaluation protocol, metric definitions, and ablation studies. The project page link suggests code/data availability, which is standard for high-impact ML papers. The use of a publicly available backbone (Wan2.1) enhances reproducibility, although the specific finetuning steps and data preprocessing pipelines would need to be carefully followed. The closed-form geometric solve is well-defined.
The primary limitation is computational cost. The method runs at 5.5 fps on 4 A100 GPUs, making it an offline annotation tool rather than a real-time solution. The authors acknowledge this and suggest distillation as a future direction. Additionally, Stage 1b still requires MANO-annotated video, which is a scarce resource, though the authors propose self-supervised pretexts to relax this in the future. The method may also struggle with extreme cases not covered in the training data, although the cross-dataset results suggest good generalization.
This work has significant implications for embodied AI, robotics, and human-computer interaction. By providing a scalable, high-quality method for 4D hand reconstruction from egocentric video, it enables the creation of large-scale datasets for training robot policies and understanding human behavior. The paradigm shift towards leveraging video generative models for perception tasks could influence future research in 3D vision, motion capture, and video understanding. It also highlights the untapped potential of diffusion models for discriminative tasks, potentially inspiring similar approaches in other domains. The paper presents a significant advancement in 4D hand motion reconstruction by effectively adapting video diffusion models for perception tasks, achieving state-of-the-art performance on challenging benchmarks through a novel hand-overlay rendering adaptation and a geometrically-aware dual-branch decoder.
Data, as the fundamental substrate of modern intelligence, has greatly driven the development of current foundation models. Naturally, researchers aim to extend this paradigm to the domain of GUI agents, hoping to build strong GUI agents through a similar paradigm. However, GUI agent data cannot be directly harvested from the internet, making it costly and difficult to collect at scale. As a result, current GUI agents suffer from poor cross-device generalization and limited visual grounding ability for fine-grained GUI elements. As an attempt to address data challenge in GUI agents, we propose GUICrafter, a weakly-supervised GUI agent leveraging massive unannotated screenshots to substantially reduce the reliance on expensive human annotations. GUICrafter explores a curriculum learning framework for training GUI agents through two progressive stages. First, the model learns visual grounding from large-scale unannotated screenshots and webpages, leveraging the rich contextual signals inherent in GUI interactions without human annotations. Then, in Stage 2, we leverage a small amount of high-quality data to calibrate the model via reinforcement learning. Experiments show that GUICrafter achieves competitive, or even superior, performance to advanced systems like UI-TARS while using only 0.1% of its data. Furthermore, under the same amount of annotated data, GUICrafter surpasses all previous methods such as GUI-R1. Code, data, and models are available at https://github.com/fansunqi/GUICrafter.
Primary: Tsinghua University
All Institutions: Tsinghua University, Tencent Hunyuan
GUICrafter presents a significant advancement in GUI agent training by introducing a scalable weakly-supervised pretraining stage that leverages unannotated screenshots for visual grounding, achieving state-of-the-art performance with minimal annotated data. The technical contribution lies in the effective formulation of meta-tasks from interactive signals and the robust two-stage RLVR framework, which offers a practical and efficient path forward for data-constrained GUI agent development.
The paper proposes GUICrafter, a two-stage training framework for GUI agents. Stage 1 involves "weakly-supervised GUI pretraining" using massive unannotated screenshots. The core innovation here is the extraction of interactive signals (clickable/typable elements) from web pages and mobile apps to create "meta-tasks" (e.g., "click any clickable area"). This allows the model to learn visual grounding without human annotation by leveraging the inherent structure of GUIs. Stage 2 uses a small amount of high-quality, manually annotated data for reinforcement learning (RLVR with GRPO) to calibrate the model. The reward design includes a Gaussian position reward to provide finer-grained feedback than binary point-in-box rewards. The approach effectively bridges the gap between large-scale unsupervised visual learning and precise task-oriented grounding.
The evaluation is comprehensive, covering multiple benchmarks across web (Mind2Web, ScreenSpot-Pro), mobile (AndroidControl, AITW, AndroidWorld), and general (OmniACT) domains. The results show that GUICrafter-3B and GUICrafter-7B achieve performance competitive with or superior to state-of-the-art models like UI-TARS and GUI-R1, despite using significantly less annotated data (0.1% of UI-TARS's data). The ablation studies effectively demonstrate the contribution of Stage 1 (visual grounding improvement) and Stage 2 (task completion calibration). The comparison against baselines is fair, including reproductions of GUI-R1 on full datasets. The scalability analysis (10k to 500k samples) provides strong evidence for the data efficiency and robustness of the weakly-supervised stage.
The authors provide code, data, and models. The methodology is clearly described, including the specific extraction tools (Playwright) and the reward function formulas. The use of standard benchmarks and clear reporting of metrics (Element Accuracy, Step Success Rate, etc.) enhances reproducibility. The distinction between the weakly-supervised data generation and the supervised fine-tuning data is clear.
The method still relies on a small amount of high-quality annotated data in Stage 2 for calibration, although this is significantly reduced compared to prior work. The weakly-supervised data generation relies on automated extraction which may have noise (though the paper shows robustness to this). The "meta-tasks" are somewhat generic and may not capture the semantic intent of complex user goals, which is handled in Stage 2. The approach is primarily tested on web and mobile interfaces; generalization to other GUI types (e.g., desktop applications with complex non-standard widgets) might require further validation.
This work addresses a critical bottleneck in GUI agent development: data scarcity. By demonstrating that massive unannotated data can be leveraged for visual grounding, it lowers the barrier to entry for building robust GUI agents. This could accelerate the development of autonomous agents for web and mobile interaction, with implications for accessibility, automation, and human-computer interaction. The open-source release contributes to the community by providing a new baseline and dataset generation pipeline. GUICrafter presents a significant advancement in GUI agent training by introducing a scalable weakly-supervised pretraining stage that leverages unannotated screenshots for visual grounding, achieving state-of-the-art performance with minimal annotated data. The technical contribution lies in the effective formulation of meta-tasks from interactive signals and the robust two-stage RLVR framework, which offers a practical and efficient path forward for data-constrained GUI agent development.
We elucidate the design space of Representation Distribution Matching (RDM), our name for the paradigm that trains a one-step image generator by matching generated and reference feature distributions under frozen pretrained encoders. We identify two design axes, how the distributions are compared and the representations they are compared in, and controlled studies along them yield three findings. First, the classical MMD, which could not train convincing generators a decade ago, becomes a strong and scalable objective once estimated right. Second, the generated batch is then the operative variable, with an optimum above 2048, far beyond customary batch sizes. Third, any single representation can be gamed, driven below the real score while images stay visibly fake, so we match against a balanced battery of encoders and evaluate with SW_r14, a Sliced-Wasserstein distance over 14 encoders that is independent of the training loss and resists gaming. Combining the preferred choices yields improved RDM (iRDM): it sets the one-step state of the art on ImageNet at SW_r14 1.30, corroborated by PickScore, a human-preference proxy our objective never optimizes, which prefers it over the prior best one-step generator on 71.2% of matched samples. The same recipe post-trains the four-step FLUX.2 [klein] into a one-step generator, surpassing the four-step version on GenEval, 0.826 to 0.794, and on PickScore, 22.76 to 22.58, in 90 H200 GPU-hours. Project page: https://alan-lanfeng.github.io/rdm/.
Primary: Valeo
All Institutions: Valeo, Alan Turing Institute (implied by author handle 'alan-lanfeng' and typical affiliation for such work, though only Valeo is explicitly funded; however, standard academic papers list affiliations. The text says "Project page: https://alan-lanfeng.github.io/rdm/" and "funded by Valeo". Without explicit author list, I will infer the primary institutional affiliation from the funding and project page context. The author 'alan-lanfeng' likely refers to Alan Feng or similar. A quick mental check of recent one-step generation papers suggests this is likely from Valeo and/or a university. Given the prompt asks to extract from text, and only Valeo is explicitly mentioned as funding/affiliation in the Acknowledgments, I will list Valeo. However, 'alan-lanfeng' is a GitHub handle. Let's look for other clues. The paper mentions "alan-lanfeng.github.io". This is likely a single-author or small team paper. I will list Valeo as the primary institution found in the text.)
This paper presents a significant advancement in one-step image generation by rigorously elucidating the design space of Representation Distribution Matching, introducing a robust MMD estimator with Nyström approximation, and demonstrating that large-batch, multi-encoder training yields state-of-the-art results while mitigating metric gaming, thereby providing a scalable and effective alternative to teacher-based distillation methods.
The paper proposes "Representation Distribution Matching" (RDM), a framework for training one-step image generators by directly matching feature distributions between generated and real images using frozen pretrained encoders. The core methodological contributions are threefold: 1) A specific estimator for Maximum Mean Discrepancy (MMD) that uses an exact within-batch repulsion term and a Nyström approximation for the attraction term against a frozen full-data reference, which the authors argue is superior to Fréchet distance or drifting fields for this task. 2) The identification that large, fresh generation batches (N > 2048) are critical for stable estimation, enabled by gradient caching. 3) A multi-encoder matching strategy using a "battery" of 14 diverse frozen encoders, balanced via a proportional Lagrangian controller to prevent the generator from gaming any single encoder's metric. The approach is theoretically grounded in kernel mean embeddings and optimal transport concepts, applied pragmatically to the current state-of-the-art in teacher-free distillation.
The experimental evaluation is rigorous and comprehensive. The authors conduct controlled ablations on the two design axes (comparison metric and representation space). They demonstrate that their method, iRDM, sets a new state-of-the-art for one-step generation on ImageNet-256 with an SW_r14 score of 1.30, significantly outperforming prior methods like pMF-H FD-SIM (2.05). They also show that post-training FLUX.2 (a 4-step model) into a 1-step model using this recipe improves GenEval and PickScore scores over the 4-step baseline, a surprising and valuable result. The use of an independent evaluation metric (SW_r14) that is not part of the training loss effectively mitigates concerns about metric gaming. The inclusion of a held-out encoder panel for evaluation adds robustness to the claims.
The paper provides significant detail for reproducibility. It specifies the encoder architectures, the Nyström landmark count (4096), batch sizes (5120/10240), learning rates, and the specific Lagrangian control mechanism. The reference to "gradient caching" and the specific implementation of the Nyström attraction term are clear. The project page likely contains code, which is standard for arXiv papers. The use of standard pretrained encoders (DINOv2, CLIP, etc.) ensures that the components are accessible. The detailed ablation studies allow other researchers to replicate the design space exploration.
The primary limitation is the computational cost of training. The requirement for large batch sizes (N=5120) and the use of 10 encoders for forward passes per step, while optimized with gradient caching, still implies a substantial memory and compute footprint compared to smaller-batch methods. The method relies heavily on the quality and diversity of the frozen encoders; if the encoder panel is biased or insufficiently diverse, the "balanced" training might still fail to capture all aspects of realism. Additionally, while it surpasses the 4-step FLUX on GenEval, it is a post-training step, meaning the base model's capabilities are a prerequisite. The "one-step" nature inherently limits the complexity of the generated distribution compared to iterative methods, as evidenced by the gap between 1.30 and the real-data floor of 1.00.
This work significantly advances the field of efficient generative modeling by demonstrating that high-quality one-step generation is achievable without online teachers or adversarial training, relying instead on careful distribution matching in feature space. This could lead to faster inference times for image generation, making it more accessible for real-time applications. The insights into metric gaming and the proposal of a robust multi-encoder evaluation metric (SW_r14) provide a valuable tool for the community to better assess generator quality. However, the ease of generating realistic images also raises standard concerns about misuse in creating deepfakes or misleading content, though the one-step nature might make it less suitable for high-fidelity, long-tail content generation compared to multi-step models. This paper presents a significant advancement in one-step image generation by rigorously elucidating the design space of Representation Distribution Matching, introducing a robust MMD estimator with Nyström approximation, and demonstrating that large-batch, multi-encoder training yields state-of-the-art results while mitigating metric gaming, thereby providing a scalable and effective alternative to teacher-based distillation methods.
Large vision-language models (LVLMs) have achieved strong performance across many medical imaging tasks, yet their application to ultrasound remains limited due to its inherent complexity and variability. In this work, we revisit what is truly needed to enable real-world ultrasound understanding. Instead of introducing complex architectures or elaborate training strategies, we show that data scale and clinically faithful data alignment are the key factors. We construct a large-scale dataset of 1.5M real-world ultrasound examinations, containing 17.7M images, multi-organ coverage, and paired uncurated clinical reports. Crucially, we organize the data at the examination level, aligning multiple images with their corresponding reports to reflect real clinical workflows. We then fine-tune a standard LVLM using low-rank adaptation (LoRA) on this dataset without task-specific modifications. Surprisingly, this simple recipe already leads to strong performance across diverse ultrasound understanding tasks, outperforming prior methods designed with more complex pipelines. Beyond these results, we present model and data scaling analyses that provide insights into the role of scale in ultrasound LVLMs.
Primary: Technical University of Munich
All Institutions: MedAI Technology (Wuxi) Co. Ltd, Technical University of Munich
This paper makes a substantial contribution to medical vision-language modeling by demonstrating that large-scale, clinically aligned data curation and simple fine-tuning of standard LVLMs can outperform complex, specialized architectures for ultrasound understanding, providing a new benchmark and paradigm for the field.
The paper proposes a straightforward yet effective pipeline for ultrasound understanding: constructing a massive dataset (1.5M exams, 17.7M images) and fine-tuning a standard LVLM (Qwen3-VL-4B) using LoRA. The core methodological contribution is not a new architecture, but the rigorous demonstration that "data scale + clinically faithful alignment" supersedes complex architectural modifications or specialized training strategies in this domain. The approach is simple, relying on examination-level supervision where multiple images are paired with long-form reports, mimicking real clinical workflows. This challenges the prevailing trend of designing intricate multimodal adapters for medical imaging.
The experimental evaluation is comprehensive and robust. The authors benchmark LUMI against a wide array of state-of-the-art general-purpose (InternVL3.5, Qwen3.5, Kimi-VL) and medical-domain (HuatuoGPT, Lingshu, EchoVLM) models across five major ultrasound categories. The results show significant improvements, particularly in clinical fidelity metrics (F1 score) and higher-order NLP metrics (BLEU-4, ROUGE-L). The inclusion of an LLM-based evaluator for clinical correctness is a strong methodological choice that adds depth beyond standard text similarity metrics. Scaling analyses (model and data) provide valuable empirical insights, showing saturation points that guide future resource allocation.
The paper provides detailed hyperparameters, training configurations (LoRA rank, learning rate, batch size), and data preprocessing steps. The dataset size and source descriptions are clear. However, the dataset itself (1.5M exams) is likely too large and privacy-sensitive to be fully open-sourced in its raw form, which may limit direct reproducibility of the training phase for others. The code/model availability is indicated by the project URL, which is crucial for verification.
The primary limitation is the potential for hallucination when presented with incomplete image sets at inference time, as the model is trained on complete examinations. Additionally, the reliance on uncurated, real-world reports introduces noise and variability in language style, which might affect generalization to standardized reporting formats. The study focuses on report generation and lacks detailed evaluation on downstream diagnostic tasks (e.g., specific lesion detection accuracy vs. radiologist agreement).
This work has significant implications for medical AI, demonstrating that high-quality, large-scale data alignment can drive performance gains more effectively than architectural complexity. It encourages the community to prioritize data curation and clinical fidelity in medical LVLM development. The dataset and model could accelerate research in ultrasound AI, potentially improving diagnostic support in resource-limited settings where expert sonographers are scarce. This paper makes a substantial contribution to medical vision-language modeling by demonstrating that large-scale, clinically aligned data curation and simple fine-tuning of standard LVLMs can outperform complex, specialized architectures for ultrasound understanding, providing a new benchmark and paradigm for the field.
Safe motion planning in dynamic environments requires reasoning about the uncertainty in predicted obstacle motion without sacrificing real-time performance. Existing conformal approaches conformalize a scalar score that aggregates per-obstacle prediction errors, losing spatial coherence and scaling poorly with scene density. We instead conformalize the entire predicted distance field at once. This functional conformal prediction (FCP) framework yields a distribution-free, field-level lower bound, from which safety follows uniformly: any trajectory satisfying the resulting constraint is certified safe, independent of how the control space is sampled. The key enabler is that the residual distance field is empirically low-rank and approximately time-invariant, which makes the bound decomposable in coefficient space. An envelope is fitted offline via functional PCA and a Gaussian-mixture inductive conformal procedure, then refined online by a lightweight adaptive functional conformal (AFCP) update on a low-dimensional vector. This keeps the per-step cost largely insensitive to obstacle count and retains long-run field coverage under distribution shift. We embed the envelope as a tightened safety constraint in a sampling-based model predictive controller, FCP-MPC. On the ETH--UCY pedestrian benchmarks and a dense 3D quadrotor task with up to 280 dynamic obstacles, FCP-MPC attains a favorable balance of safety, feasibility, and efficiency, reaching goals where pointwise and egocentric conformal baselines become too conservative or too expensive, while keeping per-step computation far below online uncertainty-reasoning baselines.
Primary: Seoul National University
All Institutions: Seoul National University
This paper introduces a novel Functional Conformal Prediction framework for safe motion planning, leveraging the low-rank structure of prediction errors to provide scalable, distribution-free safety guarantees in dynamic environments. The approach effectively addresses the computational and spatial coherence limitations of prior conformal methods, offering a significant advancement in the integration of statistical uncertainty quantification with real-time robotic control.
The paper proposes a Functional Conformal Prediction (FCP) framework to address the scalability and spatial coherence issues of existing conformal prediction (CP) methods in safe motion planning. Instead of conformalizing scalar scores per obstacle, the authors treat the prediction error of the distance field as a functional object in a Hilbert space. They leverage the empirical observation that residual distance fields are low-rank and approximately time-invariant. This allows them to perform Functional PCA (FPCA) to decompose the field into a few principal components. A Gaussian Mixture Model (GMM) is fitted to the coefficients of these components in an offline stage, and an inductive conformal procedure is used to create a distribution-free envelope. Online, an Adaptive Functional Conformal Prediction (AFCP) update adjusts a scalar multiplier to handle distribution shifts. This approach decouples the expensive statistical calibration from the real-time planning loop, allowing the safety constraint to be evaluated efficiently for any sampled trajectory in an MPC framework. The methodology is theoretically sound, providing asymptotic safety guarantees under both exchangeable and non-exchangeable (adaptive) settings.
The authors evaluate FCP-MPC on two benchmarks: the ETH-UCY pedestrian dataset (2D) and a dense 3D quadrotor simulation with up to 280 dynamic obstacles. They compare against pointwise and egocentric conformal baselines, as well as online uncertainty-reasoning methods. The results indicate that FCP-MPC achieves a favorable balance of safety, feasibility, and efficiency. It successfully reaches goals where pointwise methods are too conservative and egocentric methods are too expensive or lose coverage. The per-step computation remains largely insensitive to obstacle count, demonstrating the scalability of the functional approach. The experiments are comprehensive, covering both 2D and 3D scenarios and varying densities.
The paper provides a GitHub repository link (https://github.com/CORE-SNU/FCP-MPC), which significantly aids reproducibility. The methodology is described in detail, including the offline FPCA and GMM fitting, and the online AFCP update. The use of standard benchmarks (ETH-UCY) also facilitates comparison. However, the specific implementation details of the "dense 3D quadrotor task" (e.g., exact dynamics, sensor noise models, prediction model architecture) might require careful reading of the appendix or code to fully replicate.
The method relies on the assumption that the residual distance field is low-rank and approximately time-invariant. While verified empirically, this may not hold in all environments (e.g., highly dynamic, non-stationary scenes with complex occlusions). The offline calibration requires a sufficiently large and representative dataset of residual fields. The adaptive update (AFCP) provides long-run coverage but may take time to converge to the correct threshold under rapid distribution shifts. The soft-constraint variant degrades safety guarantees by a controllable slack, which might be unacceptable for some high-risk applications.
This work contributes to the field of safe autonomous systems by providing a scalable and theoretically grounded method for uncertainty-aware motion planning. By enabling real-time safety guarantees in dense, dynamic environments, it facilitates the deployment of robots in more complex real-world scenarios. The functional conformal prediction framework could also be applicable to other domains involving spatial or functional data uncertainty, such as medical imaging or environmental monitoring. This paper introduces a novel Functional Conformal Prediction framework for safe motion planning, leveraging the low-rank structure of prediction errors to provide scalable, distribution-free safety guarantees in dynamic environments. The approach effectively addresses the computational and spatial coherence limitations of prior conformal methods, offering a significant advancement in the integration of statistical uncertainty quantification with real-time robotic control.
Embodied task planning asks an agent to turn a natural-language instruction into an executable sequence of actions in a physical scene, and is a building block for household, assistive, and service robots. Recent prompting-based and reinforcement-learning planners generate fluent action text but lack a cheap deterministic check that the produced plan is valid in the target world, while high-fidelity simulation is too slow to serve as an inner-loop training signal. The general problem is therefore how to obtain verifiable supervision and rewards for embodied planners without relying on string-level matching or full simulation. Here we show that a single BDDL specification, automatically constructed from open-world video evidence or curated tasks, can serve as a shared interface for data construction, plan verification, and reward design. A video-to-BDDL parser, an LLM verifier, and a lightweight symbolic engine together supply dense feedback at millisecond latency. We further introduce GroupAdapt, a difficulty-aware length schedule that uses the in-batch group pass rate as a zero-cost signal so that hard prompts get wider length tolerance and automatically tighten as their pass rate improves. Under the guidance of the proposed verifier and GroupAdapt schedule, the 8B planner attains a Strict-Pass score of 97.3 on BEHAVIOR-1000, yielding a 25.9 percent relative improvement over the Qwen3-8B baseline. This result exceeds the strongest large-model baseline by 3.5 percent, while simultaneously compressing the response length by 79 percent to 207 tokens, demonstrating both effectiveness and efficiency.
Primary: The Hong Kong University of Science and Technology
All Institutions: The Hong Kong University of Science and Technology, University of London
This paper presents a significant advancement in embodied AI by introducing a BDDL-centric pipeline that integrates symbolic verification with reinforcement learning, enabling compact and correct task planning for 8B models that outperform larger baselines. The rigorous evaluation and clear methodology make it a valuable contribution to the field of robotics and machine learning.
The paper proposes a coherent pipeline for embodied task planning that bridges the gap between open-world natural language instructions and executable symbolic plans. The core methodological innovation lies in the use of BDDL (Behavior Domain Definition Language) as a unified interface for data construction, verification, and reward design. Specifically, the authors introduce a video-to-BDDL parser to generate training data from open-world videos, an LLM verifier to ensure semantic consistency, and a lightweight symbolic engine for millisecond-latency verification. The training methodology combines Supervised Fine-Tuning (SFT) with Symbolic-Reinforcement Learning (using DAPO). A key technical contribution is "GroupAdapt," a difficulty-aware length scheduling mechanism that uses the in-batch group pass rate to dynamically adjust length tolerance, allowing harder prompts to have more flexibility while enforcing conciseness on easier ones. This approach effectively decouples correctness learning from length compression, addressing a common failure mode in LLM planning where early compression leads to errors.
The experimental evaluation is rigorous and comprehensive. The authors evaluate on the BEHAVIOR-1K benchmark, specifically B-100 and B-1000, using metrics like Strict-Pass (SP), Engine-Pass (EP), and Goal Completion Ratio (GCR). The results show that the proposed 8B model significantly outperforms larger baselines (e.g., Qwen3-8B, Gemma-4-31B) in terms of SP score (97.3% on B-1000) while maintaining competitive performance on other metrics. The ablation studies effectively demonstrate the contribution of each component: SFT initialization, symbolic reward shaping, and GroupAdapt. The analysis of length compression is particularly strong, showing a 79% reduction in response length without sacrificing correctness. The inclusion of out-of-domain mathematical reasoning tasks (AIME, MATH) serves as a sanity check to ensure that the length compression does not degrade general reasoning capabilities, which is a valuable addition.
The paper provides detailed descriptions of the methodology, including the BDDL structure, the symbolic engine logic, and the RL hyperparameters (DAPO settings, group size, learning rates). The appendix contains extensive details on data construction, action library expansion, and reward landscape analysis. The use of open-weight models (Qwen3, Gemma) and standard benchmarks (BEHAVIOR-1K) enhances reproducibility. However, the specific implementation of the video-to-BDDL parser and the LLM verifier (likely proprietary or custom-built) might present some challenges for exact replication, although the logical flow is clear. The code for the symbolic engine and RL training loop appears to be the primary barrier to full reproducibility, but the paper provides sufficient detail for a competent researcher to implement.
The paper acknowledges several limitations. First, the method is a planning model and does not handle low-level control, which is a necessary layer for real-world deployment. Second, the reliance on BDDL requires robust scene understanding and object grounding, which can be noisy in real-world settings. The paper notes that real-time scene-to-BDDL construction is an open problem. Third, the performance is evaluated in simulation; real-world transferability is not demonstrated. Finally, the method's effectiveness is tied to the quality of the BDDL specifications and the action library, which may need manual curation or extensive LLM-assisted expansion for new domains.
This work has significant implications for the development of autonomous robots and embodied AI systems. By providing a scalable and verifiable method for training planners, it addresses a critical bottleneck in making robots capable of following complex, natural language instructions in unstructured environments. The emphasis on efficiency (shorter response times) and correctness (symbolic verification) aligns with the industry's need for reliable and deployable AI systems. The use of open-world video data for training also suggests a path towards more data-efficient and generalizable planning models. However, the reliance on simulation and symbolic representations may limit immediate applicability in highly dynamic or unstructured real-world scenarios without significant additional engineering. This paper presents a significant advancement in embodied AI by introducing a BDDL-centric pipeline that integrates symbolic verification with reinforcement learning, enabling compact and correct task planning for 8B models that outperform larger baselines. The rigorous evaluation and clear methodology make it a valuable contribution to the field of robotics and machine learning.
General-purpose robot policies should be modeled as dynamical systems, yet many VLA and generative imitation policies still rely on present observations or short windows. This Markovian shortcut fails in memory-dependent manipulation: identical observations can demand different actions after different histories. We present Chronos, a physics-informed full-history framework for non-Markovian long-horizon manipulation. The key idea is to elevate observation history from auxiliary context to the latent state of the policy dynamics. At each physical control step, Chronos forms one state-representative token by fusing observation and proprioception, so the token sequence is aligned one-to-one with physical time. A selective state space model propagates this causal historical state, which conditions a multimodal coarse action prior through implicit maximum likelihood estimation (IMLE). This prior is then refined by a second-order Schrodinger-inspired bridge that predicts acceleration fields, yielding smoother and more physically grounded robot motion. Across 16 simulated tasks and 4 real-world experiments, Chronos is evaluated on precision insertion, general manipulation, and memory-dependent long-horizon control. On RMBench, where success requires remembering task phase, Chronos achieves 73.6% average success, outperforming Markovian VLA baseline pi0.5 by +62.4 percentage points, a 6.6x relative gain, while using 10x fewer parameters. It also surpasses the memory VLA Mem-0 by 22.8 points while using over 30x fewer parameters. In real-world dual-arm experiments using a single RGB camera, Chronos achieves 78% average success over four tasks, including 72% on the three memory-dependent tasks, whereas pi0.5 achieves 7% overall and 0% on the memory-dependent subset. These results suggest that history should not be treated as auxiliary context, but as the latent state of the manipulation policy.
Primary: Huazhong University of Science and Technology
All Institutions: Huazhong University of Science and Technology
[One sentence main contribution]. Chronos introduces a physics-informed, full-history state-space framework for non-Markovian manipulation, achieving state-of-the-art performance on memory-dependent benchmarks with significantly fewer parameters than existing VLA models. [Comprehensive analysis of the technical contribution, methodology, and significance to the field]. The paper makes a significant technical contribution by addressing the non-Markovian nature of long-horizon robotic manipulation through a novel combination of Selective State Space Models for full-history encoding and a Schr\"odinger-inspired second-order bridge for action refinement. The methodology is rigorous, with a clear theoretical derivation linking quantum mechanical concepts to action-space acceleration fields, and the empirical results are strong, particularly on RMBench where it outperforms much larger memory-augmented VLAs. The approach is highly relevant to the current landscape of robot learning, offering a scalable and efficient alternative to large-scale transformer-based policies for tasks requiring temporal memory. The comprehensive evaluation across simulated and real-world tasks, along with detailed ablations, provides strong evidence for the efficacy of the proposed method.
The paper proposes Chronos, a framework addressing the non-Markovian nature of long-horizon manipulation. The core methodological contributions are twofold: (1) A full-history state representation using a Selective State Space Model (Mamba) that treats the entire observation history as the latent state, rather than using it as auxiliary context or a short window. This allows for precise temporal credit assignment across the full trajectory. (2) A physics-informed action generation module based on a "Schr\"odinger-inspired bridge." This module uses Implicit Maximum Likelihood Estimation (IMLE) to generate a coarse multimodal prior, which is then refined by a second-order differential equation solver that predicts acceleration fields. The derivation from the Schr\"odinger equation via Madelung transformation to a quantum Hamilton-Jacobi equation provides a theoretical justification for modeling action refinement as a physical process involving position stabilization and velocity dissipation. The approach is theoretically grounded and distinct from standard diffusion or flow-matching policies by explicitly modeling acceleration and using a quartic noise schedule compatible with second-order dynamics.
The evaluation is comprehensive, covering 16 simulated tasks and 4 real-world experiments. The results are compelling, particularly on RMBench, where Chronos achieves a 73.6% average success rate, significantly outperforming Markovian baselines like pi0.5 (+62.4 points) and memory-augmented VLAs like Mem-0 (+22.8 points), while using substantially fewer parameters (0.3B vs >10B for Mem-0). On RoboTwin 2.0, it achieves state-of-the-art performance in general manipulation. The ablation studies effectively isolate the contributions of the SSM memory and the second-order bridge, demonstrating that the acceleration-based refinement provides smoother and more precise actions, especially in contact-rich tasks like precision insertion. The real-world results on dual-arm manipulation further validate the transferability of the learned policies.
The paper provides a project page and code repository link. The methodology is described with sufficient mathematical detail, including the derivation of the acceleration target and the specific noise schedules. The use of standard components (Mamba, PointNet, DINOv2) facilitates implementation. However, the specific hyperparameters for the Schr\"odinger bridge integration steps and the IMLE latent update dynamics are crucial for reproduction and are partially detailed in the text. The claim of "memory-efficient training" via chunked perception is a practical detail that aids reproducibility.
The paper acknowledges that in fully observable, local-geometry-dominated tasks (e.g., Put Bottles Dustbin), Chronos slightly underperforms strong Markovian diffusion policies like DP3. This suggests that the overhead of full-history modeling may not always be beneficial when the present state is a sufficient statistic. Additionally, the reliance on a single RGB camera in real-world experiments might limit performance in complex lighting or occlusion scenarios compared to multi-view setups. The theoretical derivation, while elegant, is a specific projection of quantum mechanics concepts to control theory, and its generalizability to other domains beyond robotics is unclear.
This work advances the field of robotic manipulation by providing a robust solution to the long-standing problem of memory-dependent control. By demonstrating that full-history modeling can be efficient and effective, it challenges the prevailing trend of scaling VLA models with short-context windows. The physics-informed action generation could inspire more physically grounded generative models in other control domains. The significant performance gap on memory benchmarks highlights the limitations of current foundation models in temporal reasoning, guiding future research towards better temporal architectures. [One sentence main contribution]. Chronos introduces a physics-informed, full-history state-space framework for non-Markovian manipulation, achieving state-of-the-art performance on memory-dependent benchmarks with significantly fewer parameters than existing VLA models. [Comprehensive analysis of the technical contribution, methodology, and significance to the field]. The paper makes a significant technical contribution by addressing the non-Markovian nature of long-horizon robotic manipulation through a novel combination of Selective State Space Models for full-history encoding and a Schr\"odinger-inspired second-order bridge for action refinement. The methodology is rigorous, with a clear theoretical derivation linking quantum mechanical concepts to action-space acceleration fields, and the empirical results are strong, particularly on RMBench where it outperforms much larger memory-augmented VLAs. The approach is highly relevant to the current landscape of robot learning, offering a scalable and efficient alternative to large-scale transformer-based policies for tasks requiring temporal memory. The comprehensive evaluation across simulated and real-world tasks, along with detailed ablations, provides strong evidence for the efficacy of the proposed method.
Large-scale dexterous grasp datasets encode rich priors over hand-object interaction, but their use has largely been confined to grasp generation and pick-and-place manipulation. We study whether such data can instead support functional dexterity in articulated tool use, where a robot must acquire a tool, maintain contact, and operate its functional moving parts. We adapt a hierarchical imitation learning framework that combines high-level hand sub-goal prediction with a low-level goal-conditioned controller. We construct a 355k-trajectory grasp-pretraining dataset from large-scale dexterous grasp annotations and use it to pretrain the low-level controller. The controller is then fine-tuned on downstream task demonstrations. To evaluate this setting, we introduce DexCraft, a simulation benchmark with six articulated tool-use tasks requiring coordinated finger motion. Across simulation and real-world experiments, our approach outperforms end-to-end diffusion policy baselines and hierarchical policies trained from scratch. In the real world, it improves full-task success by 33.3 percentage points over DP3. These results show that grasp datasets can serve not only as resources for grasp synthesis, but also as scalable pretraining data for contact-rich dexterous manipulation. Videos are shown on https://yingyuan0414.github.io/grasp2dexterity/ .
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
The paper presents a compelling method for leveraging large-scale grasp datasets to enable dexterous tool use, demonstrating significant performance gains through hierarchical imitation learning and pretraining.
The paper proposes a hierarchical imitation learning framework for dexterous tool use. The core methodological contribution is the adaptation of a low-level goal-conditioned controller (based on Diffusion Policy) pre-trained on a large-scale synthetic grasp dataset (G2D-Pretrain, derived from Dexonomy). The high-level policy predicts 16-DoF hand keypoints as sub-goals, addressing the insufficiency of coarse gripper-centric sub-goals for dexterous hands. The approach effectively bridges the gap between static grasp synthesis data and dynamic, contact-rich manipulation tasks by leveraging the rich kinematic priors in grasp datasets. The hierarchical decomposition (high-level planning, low-level execution) is well-motivated and technically sound, particularly the semantic mapping of joint spaces between the Shadow hand (pretraining) and LEAP hand (fine-tuning).
The evaluation includes a new simulation benchmark, DexCraft, with six articulated tool-use tasks. The paper provides extensive ablation studies comparing end-to-end policies (DP, DP3), hierarchical policies from scratch, and their pre-trained counterparts. The results demonstrate significant improvements, particularly in the real-world setting where the proposed method improves full-task success by 33.3 percentage points over DP3. The sample efficiency analysis further supports the claim that pretraining reduces the need for downstream demonstrations. The inclusion of both simulation and real-world experiments strengthens the validity of the claims, although the real-world evaluation is limited to three tasks and a single robot setup.
The paper provides detailed descriptions of the data augmentation process for G2D-Pretrain, the policy architectures, and the experimental setups. The project website link suggests code or video availability, which aids reproducibility. The use of standard simulators (ManiSkill3) and datasets (Dexonomy) facilitates replication. However, the specific details of the teleoperation setup and the exact implementation of the semantic joint mapping for the LEAP hand might require additional clarification for perfect reproducibility.
The reliance on manually annotated sub-goals for training the high-level policy limits scalability. The simulation benchmark uses single object instances per task, which may not fully capture the generalization capabilities required for diverse object geometries. The real-world evaluation is constrained by the specific hardware setup (Franka + LEAP Hand) and does not explore the impact of tactile feedback or online adaptation, which are critical for robust dexterous manipulation.
This work significantly advances the field of dexterous manipulation by demonstrating that large-scale grasp datasets, previously underutilized for dynamic tasks, can serve as powerful pretraining resources. This could lower the barrier to entry for learning complex manipulation skills by reducing the need for costly real-world demonstrations. The DexCraft benchmark provides a valuable resource for evaluating articulated tool use, encouraging further research in this area. The paper presents a compelling method for leveraging large-scale grasp datasets to enable dexterous tool use, demonstrating significant performance gains through hierarchical imitation learning and pretraining.
Long-context inference is increasingly common in large language model (LLM) serving, driven by retrieval-augmented generation and agentic systems. In disaggregated inference, these workloads require transferring large Key-Value (KV) caches across the network, where decoding cannot begin until the transfer completes. Recent KV quantization techniques reduce data volume and alleviate this bottleneck, but existing schemes fail to achieve both low network-exposed latency and high inference accuracy. We challenge the assumption that the KV cache is an indivisible unit that must be fully received before use. We leverage the observation that different bits in the KV cache contribute unequally to attention computation and inference precision: the most significant bits capture the coarse structure of attention and the least significant bits refine precision. This property enables partial use of the KV cache during decoding. We present Lynx, a system that enables progressive, split-stream KV transfer by partitioning the KV cache into a high-priority Anchor stream carrying the most significant bits and a low-priority Residual stream carrying remaining precision. Decoding begins upon receipt of the Anchor stream and proceeds speculatively while the Residual stream is transferred concurrently, followed by verification that ensures equivalence to higher-precision decoding. Across multiple models and serving workloads, Lynx achieves Time-to-First-Token (TTFT) comparable to aggressive 4-bit KV quantization, while matching the accuracy of high-precision (BF16) inference, improving TTFT over standard 8-bit KV quantization by up to $1.43\times$ and improving accuracy over state-of-the-art by up to $5.1\%$.
Primary: University College London
All Institutions: University College London, Huawei
Lynx introduces a progressive speculative quantization framework that decouples KV cache transfer from decoding initiation, achieving significant latency reductions without sacrificing inference accuracy in long-context LLM serving.
The paper proposes "Lynx," a novel system for disaggregated LLM inference that challenges the assumption that the Key-Value (KV) cache must be fully transferred before decoding begins. The core innovation is a hierarchical split-stream quantization scheme that partitions the KV cache into a high-priority "Anchor" stream (Most Significant Bits) and a low-priority "Residual" stream (Least Significant Bits). By transmitting the Anchor stream first, the decode instance can begin speculative token generation using the coarse-grained KV data. Once the Residual stream arrives, the system verifies the speculative tokens against the full-precision (or higher-precision) KV cache. This approach effectively overlaps network communication with computation, treating the network transfer as a draft model in speculative decoding. The methodology is technically sound, leveraging the observation that MSBs dominate attention score magnitudes due to the exponential nature of Softmax, while LSBs refine precision. The integration of non-linear logarithmic quantization and outlier-aware chunking further enhances the fidelity of the Anchor stream.
The evaluation is comprehensive, covering three models (LLaMA 3.1 8B, Qwen 3 32B, Mistral 3 24B) and three datasets (MMLU-Pro, Needle-in-the-Haystack, QMSum) across varying context lengths (up to 128K) and bandwidths (10-50 Gbps). The results demonstrate that Lynx achieves Time-to-First-Token (TTFT) comparable to aggressive 4-bit quantization while maintaining accuracy equivalent to 8-bit or BF16 inference. Specifically, it improves TTFT over standard 8-bit quantization by up to 1.43x and improves accuracy over state-of-the-art compression methods (like CacheGen) by up to 5.1%. The paper includes detailed ablation studies on context length scaling and bandwidth variations, showing that the benefits of speculative overlap increase with longer contexts and lower bandwidths. The use of Ascend NPUs (Huawei hardware) is a specific constraint but does not detract from the generalizability of the system design principles.
The paper provides significant implementation details, including the quantization algorithm (Algorithm 1), the split-stream construction logic, and the speculative verification protocol. It mentions implementation in ~2k lines of Ascend-C kernels and ~2k lines of Python, integrated into vLLM-Ascend. However, the code is not publicly available (no GitHub URL provided), and the evaluation is conducted on proprietary Huawei Ascend hardware, which may limit direct reproducibility for researchers using standard NVIDIA GPU stacks. The detailed description of the SerDes protocol and the non-blocking runtime architecture offers a strong basis for future reproduction.
The primary limitation is the reliance on specific hardware (Ascend NPUs) and the lack of public code. The speculative decoding verification introduces computational overhead; while the paper argues this is negligible compared to communication savings, this overhead scales with the number of speculative tokens and could become significant in very high-bandwidth, low-latency scenarios where the communication bottleneck is less severe. Additionally, the approach assumes a disaggregated prefill-decode architecture, which is not universal for all LLM serving setups. The accuracy guarantee relies on the verification step, which implies that if the Residual stream is delayed or lost, the system must wait, potentially negating the latency benefits in unstable network conditions.
This work has significant implications for the efficiency and scalability of long-context LLM serving, particularly in cloud environments where disaggregated inference is becoming standard. By enabling high-precision inference with lower effective latency, it allows for more responsive AI agents and retrieval-augmented generation systems. The technique of using partial data for speculative execution could inspire similar approaches in other areas of distributed machine learning where data dependencies are hierarchical or can be approximated. Lynx introduces a progressive speculative quantization framework that decouples KV cache transfer from decoding initiation, achieving significant latency reductions without sacrificing inference accuracy in long-context LLM serving.
In retrieval augmented generation (RAG) and agentic LLM serving, prompts are assembled from independent segments into long contexts, making the prefill stage dominate the per-request computation cost. To this cost, two directions have emerged in parallel: position-independent caching (PIC) admits KV reuse for non-contiguous segments shared across different requests, while hybrid-attention models reduce computation complexity by replacing most full-attention layers with linear attention. However, they cannot coexist: applying PIC to hybrid-attention models breaks down because per-token KV-cache reuse primitives do not transfer to the per-request recurrent state. In this work, we present Hypic, the first serving system for hybrid-attention LLMs with position-independent caching. For linear-attention layers, we identify the segment-cumulative transition operator as the missing algebraic primitive, and cache it alongside each segment's zero-start end-state, enabling near-exact and constant-time state composition of independently cached segments. For the remaining full-attention layers, existing PIC methods also fail as linear layers do not expose the per-token hidden states for selective recomputation. We show that the most significant attention deviation concentrates at segment boundaries, so recomputing only a small seam window at each boundary suffices to restore cross-segment lookback. Finally, Hypic exploits segment-level self-containment to parallelize cache-miss prefill across instances, turning long cold requests -- a major tail-latency contributor under both prefix caching and prior PIC -- into an accelerable workload. Evaluated across four hybrid-attention models and five workloads, Hypic reduces time-to-first-token (TTFT) by 2.45x on average and improves peak throughput by up to 2.0x over existing systems, while staying within 3.3 points of full-recompute accuracy.
Primary: Xiaohongshu Inc.
All Institutions: Xiaohongshu Inc., Peking University, Shanghai Jiao Tong University
This paper presents a significant systems contribution by resolving the incompatibility between position-independent caching and hybrid-attention LLMs through novel algebraic primitives and boundary-aware recomputation, enabling substantial latency and throughput improvements for RAG and agentic workloads.
The paper addresses a critical intersection in LLM serving: the compatibility of Position-Independent Caching (PIC) with Hybrid-Attention architectures (which mix linear and full attention). The authors correctly identify that standard PIC primitives fail for linear attention layers because the state transition is not per-token but segment-cumulative. Their proposed solution, caching the "segment-cumulative transition operator" alongside the end-state, is a mathematically sound and novel algebraic primitive for state composition. Furthermore, they address the full-attention layer bottleneck in PIC by identifying that attention deviations are localized at segment boundaries, proposing a "seam window" recomputation strategy. This is a sophisticated systems-level optimization that balances accuracy and efficiency. The approach is rigorous, leveraging the specific mathematical properties of linear attention (associativity) to enable caching that was previously thought incompatible.
The evaluation is comprehensive, covering four hybrid-attention models and five distinct workloads. The results show a 2.45x reduction in Time-to-First-Token (TTFT) and up to 2.0x improvement in peak throughput compared to existing systems. Crucially, they maintain accuracy within 3.3 points of full-recompute baselines, which is an acceptable trade-off for the significant latency gains in serving scenarios. The inclusion of tail-latency analysis for long cold requests adds depth, demonstrating that the system effectively mitigates a known pain point in prefix caching. The empirical evidence strongly supports the claims made in the abstract.
The paper provides sufficient technical detail regarding the algebraic primitives and the seam-window recomputation logic. The authors are from major tech companies and universities, suggesting access to robust infrastructure for such experiments. While the full codebase isn't explicitly linked in the provided text, the methodological description is precise enough for replication by systems researchers. The use of standard benchmarks and clear metrics (TTFT, throughput, accuracy delta) ensures that the results are verifiable.
The primary limitation is the accuracy trade-off. While 3.3 points is "close," in high-stakes applications, this deviation might be significant. The "seam window" size is a hyperparameter that likely requires tuning per model and context length. Additionally, the benefits are most pronounced in RAG and agentic workflows with long, composed contexts; for short, single-sequence prompts, the overhead of managing these complex caches might not yield proportional benefits. The paper focuses on serving efficiency rather than training efficiency, limiting its scope to the inference phase.
This work has significant implications for the deployment of next-generation LLMs that utilize hybrid attention for efficiency. By enabling PIC for these models, it reduces the computational cost and latency of RAG and agentic systems, making them more scalable and accessible. This could accelerate the adoption of hybrid-attention architectures in production environments where latency and cost are critical constraints. It also sets a new standard for how systems researchers should approach caching in non-standard attention mechanisms. This paper presents a significant systems contribution by resolving the incompatibility between position-independent caching and hybrid-attention LLMs through novel algebraic primitives and boundary-aware recomputation, enabling substantial latency and throughput improvements for RAG and agentic workloads.
LLM inference comprises a compute-bound prefill phase and a memory-bound decode phase, and recent systems disaggregate them onto separate hardware. Yet today's datacenter GPUs rely on costly HBM whose bandwidth sits almost entirely idle during prefill. LLM serving across memory-heterogeneous accelerators (MemHA) pairs GDDR-based accelerators for prefill with HBM-based GPUs for decode, promising lower cost without sacrificing performance. Pushed to its most economical form, MemHA serving is inherently cross-vendor, since the best-suited chip for each phase may come from a different vendor. This breaks two assumptions that single-vendor disaggregation takes for granted -- a KV format both ends consume natively, and a shared software stack. We present \textbf{HMA-Serve}, a MemHA-centric disaggregated serving system pairing GDDR-based accelerators for prefill with HBM-based GPUs for decode efficiently. HMA-Serve achieves this through (1) phase-wise quantization, applying vendor-native low precision for high-throughput prefill while keeping decode in high-precision BF16, (2) a compute-transfer pipeline that overlaps each layer's KV cache transfer with later-layer prefill to reduce time-to-first-token (TTFT), and (3) deferred dequantization, shipping raw quantized bytes and reconstructing them lazily on the decode GPU to reduce network bandwidth and HBM usage. Across four Qwen3 models (4B--32B) and three production traces, HMA-Serve delivers up to $3.2\times$ higher goodput than state-of-the-art memory-homogeneous methods and $4.8\times$ higher goodput-per-dollar, with no measurable loss on generation-quality benchmarks.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University
HMA-Serve presents a compelling systems solution for cost-effective LLM inference by effectively disaggregating prefill and decode phases across memory-heterogeneous, cross-vendor accelerators, achieving significant gains in goodput and cost-efficiency without compromising generation quality.
The paper proposes HMA-Serve, a disaggregated LLM serving system that pairs GDDR-based accelerators (Tenstorrent Blackhole) for the compute-bound prefill phase with HBM-based GPUs (NVIDIA A100) for the memory-bound decode phase. The core methodological innovation lies in addressing the cross-vendor heterogeneity, which breaks assumptions of native KV format compatibility and shared software stacks. The authors introduce three coordinated mechanisms: (1) Phase-wise quantization, utilizing vendor-native low precision (BFP8) for prefill and high precision (BF16) for decode to optimize throughput and accuracy respectively; (2) Compute-transfer pipelining, which overlaps per-layer KV cache egress (device-to-host DMA + RDMA) with subsequent layer prefill computation to hide latency; and (3) Deferred dequantization, where raw quantized bytes are shipped and lazily reconstructed into BF16 within the fused paged attention kernel on the decode side, avoiding extra HBM reads and leveraging integer ALUs for bit manipulation. This approach effectively turns hardware incompatibility into a performance lever.
The evaluation is conducted on real silicon, a significant strength compared to simulation-based prior work. The testbed consists of a four-chip Tenstorrent mesh for prefill and a single NVIDIA A100 for decode, connected via 100 Gbps RoCE. Experiments cover four Qwen3 models (4B-32B) and three production traces (ShareGPT, LongBench, arXiv). Results show up to 3.2x higher goodput and 4.8x higher goodput-per-dollar compared to state-of-the-art homogeneous disaggregation (DistServe-Homo) and colocation baselines. The paper provides detailed breakdowns of latency components (TTFT, TPOT) and demonstrates that the precision asymmetry does not degrade generation quality on standard benchmarks (MATH500, AIME). The comparison against an "oracle" colocation baseline further validates the efficiency of the disaggregated approach for larger models.
The paper provides specific hardware configurations (Tenstorrent Blackhole p150, NVIDIA A100 80GB), network details (100 Gbps RoCE), and software versions (vLLM 0.19.1). It describes the kernel modifications (monkey-patching prefill runtime, fused decode kernels) in sufficient detail for replication by systems researchers. However, access to Tenstorrent hardware is currently a barrier to immediate independent reproduction, though the methodology is sound.
The current evaluation is limited to a specific hardware pairing (Tenstorrent + NVIDIA). The performance gains for smaller models (4B) are modest or negative in raw goodput compared to homogeneous setups, suggesting the overhead of disaggregation and cross-vendor transfer may not always be justified for small workloads. The system relies on a specific RDMA fabric setup, and performance may vary with different network topologies or congestion control mechanisms. Additionally, the "oracle" colocation baseline assumes perfect routing, which may be optimistic in practice.
This work highlights the growing fragmentation in the AI accelerator landscape and provides a practical framework for leveraging heterogeneous hardware in production LLM serving. By demonstrating that cost-efficient GDDR chips can effectively offload prefill, it offers a pathway to reduce the total cost of ownership for LLM inference, potentially democratizing access to high-performance serving infrastructure. It also sets a precedent for handling cross-vendor interoperability in disaggregated systems. HMA-Serve presents a compelling systems solution for cost-effective LLM inference by effectively disaggregating prefill and decode phases across memory-heterogeneous, cross-vendor accelerators, achieving significant gains in goodput and cost-efficiency without compromising generation quality.