Week of May 31 – June 07, 2026
We introduce GENERIC-FNO, the first neural operator to embed the full GENERIC (metriplectic) structure of nonequilibrium thermodynamics -- reversible, energy-conserving dynamics and irreversible, entropy-producing dynamics coupled through the degeneracy conditions -- directly in function space. Existing structure-preserving neural operators enforce at most a single conservation law or reversible (Hamiltonian) structure, while thermodynamically consistent learning has been confined to finite-dimensional, graph, or particle systems. GENERIC-FNO closes this gap: it learns the energy and entropy functionals as neural operators and parameterizes the Poisson and friction operators as diagonal Fourier multipliers sandwiched between rank-one projections that enforce the degeneracy conditions exactly, by construction, with no penalty term, update projection, or residual. The degeneracy identities hold to machine precision (residuals ~10^-13) for any initialization, dimension, or resolution, so the continuous-time dynamics conserve the learned energy and produce entropy exactly; the explicit time stepping adds only a small O(dt^2) drift (per-step residual ~10^-6). We further note that the (E,S,L,M) decomposition of a given flow is not unique, and introduce a gauge-invariant dissipation diagnostic separating reversible from dissipative dynamics independently of the learned functionals. Across three operator backbones (1D/2D FNOs and DeepONet) and four PDEs spanning reversible, dissipative, and mixed regimes, GENERIC-FNO preserves its exact structural guarantees zero-shot across a 4x super-resolution range (64 to 256), recovers the ground-truth ordering of physical dissipation, and is competitive with strong unconstrained and energy-penalized baselines, outperforming them on several dissipative and mixed problems at comparable or fewer parameters.
Primary: Georgia Tech Research Institute
All Institutions: Georgia Tech Research Institute, University of Illinois at Chicago
The paper presents GENERIC-FNO, a groundbreaking neural operator that integrates thermodynamic principles into machine learning models, ensuring energy conservation and entropy production are respected in learned dynamics. This contribution is poised to significantly advance the field of machine learning in physics-informed applications, enabling new capabilities and improving the reliability of predictive models in complex systems.
The paper introduces GENERIC-FNO, a novel neural operator that embeds the full GENERIC structure of nonequilibrium thermodynamics into function space. This approach is significant because it allows for the learning of energy and entropy functionals while ensuring thermodynamic consistency through exact enforcement of degeneracy conditions. The methodology is rigorously constructed, avoiding the pitfalls of soft penalties and ensuring that the structural identities hold to machine precision. The use of diagonal Fourier multipliers and rank-one projections is innovative and effectively addresses the challenges of enforcing degeneracy in function space.
The experiments are comprehensive, evaluating the proposed method across three operator backbones and four PDEs that represent a range of reversible, dissipative, and mixed dynamics. The results demonstrate that GENERIC-FNO not only preserves the structural guarantees but also outperforms strong baselines on several tasks, particularly in dissipative and mixed regimes. The paper provides detailed metrics and comparisons, showcasing the method's robustness and accuracy.
The paper lacks explicit URLs for code or demos, which could hinder reproducibility. However, the methodology is described in sufficient detail for implementation. The authors report rigorous evaluations and provide machine precision guarantees, which enhance the credibility of their results.
The paper acknowledges certain limitations, such as reduced expressiveness in purely reversible transport scenarios and challenges with coarse-grid accuracy due to the explicit integrator. Additionally, the method's performance may vary with backbone capacity, and it does not natively represent second-order dynamics under partial observation.
The implications of this research are substantial, particularly in fields requiring thermodynamically consistent modeling, such as complex fluids, turbulence, and closure modeling in high-dimensional systems. By providing a framework that maintains physical fidelity in learned models, this work could lead to more stable and reliable predictions in various scientific and engineering applications. The paper presents GENERIC-FNO, a groundbreaking neural operator that integrates thermodynamic principles into machine learning models, ensuring energy conservation and entropy production are respected in learned dynamics. This contribution is poised to significantly advance the field of machine learning in physics-informed applications, enabling new capabilities and improving the reliability of predictive models in complex systems.
Pathology is the cornerstone of modern medicine, where accurate decision-making relies heavily on evidence-based practices. While artificial intelligence (AI) has the potential to transform clinical workflows, the intersection of AI and evidence-based medicine remains under-explored, with primitive attempts restricted to text-only general medicine. In this work, we present PathPocket, a multimodal AI agentic co-pilot designed specifically for evidence grounded pathology. We construct the most comprehensive pathology evidence corpus to date, encompassing approximately 110,472 public and authorized documents structured across a rigorous hierarchy of evidence from clinical guideline to expert opinion. From this meticulously graded foundation, we build a large-scale multimodal pathology hypergraph containing over 4.55 million entities and 7.10 million relations. Serving as a robust knowledge engine, this hypergraph provides traceable evidence for a collaborative multi-agent reasoning framework integrating input understanding, evidence retrieval, filtering, and diagnosis generation. This enables PathPocket to seamlessly resolve a wide spectrum of clinical tasks, ranging from text-only queries to complex multimodal diagnostics involving region-of-interest (ROI) and gigapixel whole-slide images (WSIs). We rigorously evaluate the system on a multidimensional benchmark of over 200,000 real-world cases, where it significantly outperforms existing state-of-the-arts. Crucially, extensive user studies demonstrate that PathPocket substantially improves the diagnostic accuracy and confidence of pathologists. By directly grounding pathology interpretations in verifiable literature, PathPocket offers a practical and scalable solution for the future of evidence grounded computational pathology.
Primary: Southern Medical University
All Institutions: Southern Medical University
The main contribution of this paper is the introduction of PathPocket, a multimodal AI co-pilot that integrates a comprehensive evidence corpus with a multi-agent reasoning framework to enhance diagnostic accuracy in computational pathology. This work represents a significant advancement in the field, addressing critical limitations of existing AI models by providing transparent, evidence-backed reasoning that can improve clinical decision-making.
The methodology presented in this paper is robust and innovative, combining a comprehensive evidence corpus with a multimodal hypergraph architecture. The use of a multi-agent framework to facilitate evidence retrieval and diagnostic reasoning is a significant advancement over traditional AI models, which often lack interpretability and grounding in verifiable evidence. The detailed stratification of the evidence corpus and the construction of the hypergraph are methodologically sound and well-justified, providing a strong foundation for the proposed system.
The experimental evaluation is extensive, covering a wide range of clinical tasks across text-only, ROI-level, and WSI-level diagnostics. The benchmarks are well-structured, utilizing both public datasets and a large-scale private clinical dataset, which enhances the validity of the results. The paper reports significant performance improvements over existing state-of-the-art models, demonstrating the effectiveness of the proposed approach in real-world clinical scenarios.
The paper provides detailed implementation details, including the construction of the evidence corpus, the architecture of the hypergraph, and the multi-agent reasoning framework. However, the lack of a publicly accessible code repository or demo limits the reproducibility of the results. The authors could enhance reproducibility by sharing their code and datasets, allowing other researchers to validate and build upon their work.
While the proposed system shows promise, it introduces increased computational overhead and inference latency compared to traditional models. This may limit its practical deployment in time-sensitive clinical environments. Additionally, the reliance on a large corpus of evidence may pose challenges in terms of data quality and the potential for biases in the underlying literature.
The implications of this work are significant for the field of computational pathology and AI in medicine. By grounding diagnostic reasoning in verifiable evidence, PathPocket has the potential to enhance diagnostic accuracy and confidence among pathologists, ultimately improving patient outcomes. The framework could also be adapted for other medical domains, promoting the integration of AI into evidence-based clinical practice. The main contribution of this paper is the introduction of PathPocket, a multimodal AI co-pilot that integrates a comprehensive evidence corpus with a multi-agent reasoning framework to enhance diagnostic accuracy in computational pathology. This work represents a significant advancement in the field, addressing critical limitations of existing AI models by providing transparent, evidence-backed reasoning that can improve clinical decision-making.
Audio-language models (ALMs) often follow text that conflicts with audio, even when the audio evidence is clear. This raises a basic question: is the audio-supported answer unavailable, or is it represented but overridden by the conflicting text? We examine this question using a same-audio counterfactual that keeps the audio fixed, removes only the conflicting text, and measures the resulting shift in model preference. Across five ALMs and four conflict tasks, 64.1% of conflict samples show a sign flip: the same-audio branch prefers the audio-supported answer, whereas the joint branch prefers the text-supported answer. This pattern suggests that the relevant audio evidence is encoded but loses in arbitration. Activation patching further localizes the reversal to answer-position computation, and patching effects closely track output candidate-score differences (Spearman rho=0.93). Using this diagnostic, we propose Gated Audio Counterfactual Logit Correction (GACL), a training-free decoding rule that interpolates between joint and same-audio scores. Under a strict 5 pp faithfulness-drop budget, GACL improves nAUC by 17.8 points over the best contrastive baseline and transfers without retuning to vision-text arbitration (up to +40.5 pp).
Primary: Northeastern University
All Institutions: Northeastern University, Shanghai Artificial Intelligence Laboratory
This work has significant broader impact for the development of robust and trustworthy multimodal AI systems. By providing a rigorous diagnostic framework and an effective, generalizable intervention, it directly addresses a critical safety and reliability concern in ALMs: their tendency to prioritize conflicting text over clear audio evidence. This is particularly important for agentic applications in sensitive domains like healthcare, emergency services, or legal assistance, where accurate interpretation of audio is paramount. The mechanistic understanding gained through causal localization offers a powerful new lens for analyzing internal decision-making in complex multimodal models, moving beyond black-box observations and fostering more interpretable AI. The demonstrated cross-modal transfer suggests that the principles of diagnosing and correcting arbitration failures using counterfactual references and logit correction could be broadly applicable across various multimodal AI systems (e.g., vision-language, video-language), paving the way for more faithful and reliable AI across the board. This research not only provides a practical solution but also advances our fundamental understanding of multimodal reasoning and conflict resolution in large models. This paper presents a highly novel diagnostic methodology and a mechanistically-informed, training-free decoding rule to address a critical arbitration failure in audio-language models, demonstrating significant performance gains and cross-modal generalizability. The rigorous causal analysis, coupled with a practical and effective solution, makes this a standout contribution to multimodal machine learning.
The methodology is exceptionally strong, building a coherent and rigorous chain from behavioral observation to mechanistic understanding and finally to an effective intervention. The core innovation is the "same-audio counterfactual" diagnostic, which uses two branches (joint audio-text vs. audio-only) to precisely distinguish between perceptual failure and arbitration failure in Audio-Language Models (ALMs). This elegant setup, coupled with signed log-probability margins, provides a clear quantitative signature of "repairable arbitration reversals." The paper then employs activation patching, a robust causal intervention technique, to localize the arbitration failure to the answer-position residual stream within the model's "commit window." This mechanistic finding is crucial, demonstrating that audio evidence is indeed encoded but overridden during the final decision-making process. A key methodological bridge is the discovery of a high Spearman correlation (0.93) between this internal patch-induced repair direction and the observable output score difference ($s_A - s_J$). This alignment is critical because it enables the development of an output-space intervention without requiring internal model access. The proposed Gated Audio Counterfactual Logit Correction (GACL) decoding rule is directly derived from these insights, incorporating a branch-disagreement gate, a reference-reliability gate, and convex bounded interpolation. Each component is mechanistically justified and contributes to the method's robustness and safety. The methodology is a prime example of interpretable ML research, moving beyond symptom identification to root cause analysis and targeted solution design.
The experimental evaluation is comprehensive and rigorously designed. The authors evaluate GACL across five diverse open-weight ALMs (7B-30B parameters) and four distinct audio-text conflict tasks (AQA, VSC, SER, ALME) from established benchmarks (MCR-Bench, ALME). This broad coverage demonstrates the widespread nature of the "text-following" problem and the general applicability of GACL. The use of normalized AUC (nAUC) over a strict faithfulness-drop budget (e.g., 5 pp) is an excellent evaluation metric, realistically capturing the trade-off between conflict resolution and preserving accuracy on faithful inputs. GACL consistently outperforms strong contrastive decoding baselines (AAD, ACD) and the joint model, achieving an impressive average improvement of 17.8 nAUC points under the strict 5 pp budget. Detailed ablation studies meticulously validate the contribution of each component of GACL, showing how gates and bounds ensure stability and prevent undesirable side effects (e.g., surface form rewriting, parse failures). The comparison to a LoRA fine-tuning baseline, where GACL retains 76% of the gain without any parameter updates, highlights its efficiency and practical value. Furthermore, the successful, untuned transfer of GACL to vision-text arbitration on MC$^2$ (achieving up to +40.5 pp adversarial accuracy) is a powerful demonstration of the generalizability of the underlying diagnostic principles across different modalities, significantly amplifying the potential impact of this work.
The paper demonstrates a high commitment to reproducibility. The appendix provides extensive details, including specific public model checkpoints (with Hugging Face snapshot hashes), precise descriptions of benchmark splits, detailed prompt templates for each task, and the exact candidate scoring and normalization procedures. The hyperparameter tuning process, including the use of a development set and freezing parameters for testing, is clearly outlined. Furthermore, the paper provides comprehensive details for the LoRA fine-tuning baseline, including architecture, training data, optimization parameters, and hardware. Inference cost metrics (time, GPU memory, FLOPs) are also reported. This level of detail should enable researchers to reproduce the core findings and build upon this work.
The authors acknowledge several pertinent limitations. The study focuses on controlled, explicit audio-text conflicts, which, while crucial for isolating mechanisms, may not fully capture the complexity of naturally occurring conflicts involving noisier transcripts, partial notes, or broader conversational context. GACL is designed to repair arbitration failures where audio evidence is available but overridden, meaning it cannot compensate for fundamental perceptual failures where the model simply did not encode the relevant acoustic information. This distinction is important for guiding future research towards either decoding-time repair or improved acoustic modeling. A practical limitation is the increased inference latency due to the additional forward pass required for the audio-reference branch, although the authors suggest potential optimizations. Finally, while cross-modal transfer is demonstrated, the generalizability to all possible conflict sources and modality pairs remains an area for future exploration.
This work has significant broader impact for the development of robust and trustworthy multimodal AI systems. By providing a rigorous diagnostic framework and an effective, generalizable intervention, it directly addresses a critical safety and reliability concern in ALMs: their tendency to prioritize conflicting text over clear audio evidence. This is particularly important for agentic applications in sensitive domains like healthcare, emergency services, or legal assistance, where accurate interpretation of audio is paramount. The mechanistic understanding gained through causal localization offers a powerful new lens for analyzing internal decision-making in complex multimodal models, moving beyond black-box observations and fostering more interpretable AI. The demonstrated cross-modal transfer suggests that the principles of diagnosing and correcting arbitration failures using counterfactual references and logit correction could be broadly applicable across various multimodal AI systems (e.g., vision-language, video-language), paving the way for more faithful and reliable AI across the board. This research not only provides a practical solution but also advances our fundamental understanding of multimodal reasoning and conflict resolution in large models. This paper presents a highly novel diagnostic methodology and a mechanistically-informed, training-free decoding rule to address a critical arbitration failure in audio-language models, demonstrating significant performance gains and cross-modal generalizability. The rigorous causal analysis, coupled with a practical and effective solution, makes this a standout contribution to multimodal machine learning.
We introduce GENERIC-FNO, the first neural operator to embed the full GENERIC (metriplectic) structure of nonequilibrium thermodynamics -- reversible, energy-conserving dynamics and irreversible, entropy-producing dynamics coupled through the degeneracy conditions -- directly in function space. Existing structure-preserving neural operators enforce at most a single conservation law or reversible (Hamiltonian) structure, while thermodynamically consistent learning has been confined to finite-dimensional, graph, or particle systems. GENERIC-FNO closes this gap: it learns the energy and entropy functionals as neural operators and parameterizes the Poisson and friction operators as diagonal Fourier multipliers sandwiched between rank-one projections that enforce the degeneracy conditions exactly, by construction, with no penalty term, update projection, or residual. The degeneracy identities hold to machine precision (residuals ~10^-13) for any initialization, dimension, or resolution, so the continuous-time dynamics conserve the learned energy and produce entropy exactly; the explicit time stepping adds only a small O(dt^2) drift (per-step residual ~10^-6). We further note that the (E,S,L,M) decomposition of a given flow is not unique, and introduce a gauge-invariant dissipation diagnostic separating reversible from dissipative dynamics independently of the learned functionals. Across three operator backbones (1D/2D FNOs and DeepONet) and four PDEs spanning reversible, dissipative, and mixed regimes, GENERIC-FNO preserves its exact structural guarantees zero-shot across a 4x super-resolution range (64 to 256), recovers the ground-truth ordering of physical dissipation, and is competitive with strong unconstrained and energy-penalized baselines, outperforming them on several dissipative and mixed problems at comparable or fewer parameters.
Primary: Georgia Tech Research Institute
All Institutions: Georgia Tech Research Institute, University of Illinois at Chicago
The paper presents GENERIC-FNO, a groundbreaking neural operator that integrates thermodynamic principles into machine learning models, ensuring energy conservation and entropy production are respected in learned dynamics. This contribution is poised to significantly advance the field of machine learning in physics-informed applications, enabling new capabilities and improving the reliability of predictive models in complex systems.
The paper introduces GENERIC-FNO, a novel neural operator that embeds the full GENERIC structure of nonequilibrium thermodynamics into function space. This approach is significant because it allows for the learning of energy and entropy functionals while ensuring thermodynamic consistency through exact enforcement of degeneracy conditions. The methodology is rigorously constructed, avoiding the pitfalls of soft penalties and ensuring that the structural identities hold to machine precision. The use of diagonal Fourier multipliers and rank-one projections is innovative and effectively addresses the challenges of enforcing degeneracy in function space.
The experiments are comprehensive, evaluating the proposed method across three operator backbones and four PDEs that represent a range of reversible, dissipative, and mixed dynamics. The results demonstrate that GENERIC-FNO not only preserves the structural guarantees but also outperforms strong baselines on several tasks, particularly in dissipative and mixed regimes. The paper provides detailed metrics and comparisons, showcasing the method's robustness and accuracy.
The paper lacks explicit URLs for code or demos, which could hinder reproducibility. However, the methodology is described in sufficient detail for implementation. The authors report rigorous evaluations and provide machine precision guarantees, which enhance the credibility of their results.
The paper acknowledges certain limitations, such as reduced expressiveness in purely reversible transport scenarios and challenges with coarse-grid accuracy due to the explicit integrator. Additionally, the method's performance may vary with backbone capacity, and it does not natively represent second-order dynamics under partial observation.
The implications of this research are substantial, particularly in fields requiring thermodynamically consistent modeling, such as complex fluids, turbulence, and closure modeling in high-dimensional systems. By providing a framework that maintains physical fidelity in learned models, this work could lead to more stable and reliable predictions in various scientific and engineering applications. The paper presents GENERIC-FNO, a groundbreaking neural operator that integrates thermodynamic principles into machine learning models, ensuring energy conservation and entropy production are respected in learned dynamics. This contribution is poised to significantly advance the field of machine learning in physics-informed applications, enabling new capabilities and improving the reliability of predictive models in complex systems.
Pathology is the cornerstone of modern medicine, where accurate decision-making relies heavily on evidence-based practices. While artificial intelligence (AI) has the potential to transform clinical workflows, the intersection of AI and evidence-based medicine remains under-explored, with primitive attempts restricted to text-only general medicine. In this work, we present PathPocket, a multimodal AI agentic co-pilot designed specifically for evidence grounded pathology. We construct the most comprehensive pathology evidence corpus to date, encompassing approximately 110,472 public and authorized documents structured across a rigorous hierarchy of evidence from clinical guideline to expert opinion. From this meticulously graded foundation, we build a large-scale multimodal pathology hypergraph containing over 4.55 million entities and 7.10 million relations. Serving as a robust knowledge engine, this hypergraph provides traceable evidence for a collaborative multi-agent reasoning framework integrating input understanding, evidence retrieval, filtering, and diagnosis generation. This enables PathPocket to seamlessly resolve a wide spectrum of clinical tasks, ranging from text-only queries to complex multimodal diagnostics involving region-of-interest (ROI) and gigapixel whole-slide images (WSIs). We rigorously evaluate the system on a multidimensional benchmark of over 200,000 real-world cases, where it significantly outperforms existing state-of-the-arts. Crucially, extensive user studies demonstrate that PathPocket substantially improves the diagnostic accuracy and confidence of pathologists. By directly grounding pathology interpretations in verifiable literature, PathPocket offers a practical and scalable solution for the future of evidence grounded computational pathology.
Primary: Southern Medical University
All Institutions: Southern Medical University
The main contribution of this paper is the introduction of PathPocket, a multimodal AI co-pilot that integrates a comprehensive evidence corpus with a multi-agent reasoning framework to enhance diagnostic accuracy in computational pathology. This work represents a significant advancement in the field, addressing critical limitations of existing AI models by providing transparent, evidence-backed reasoning that can improve clinical decision-making.
The methodology presented in this paper is robust and innovative, combining a comprehensive evidence corpus with a multimodal hypergraph architecture. The use of a multi-agent framework to facilitate evidence retrieval and diagnostic reasoning is a significant advancement over traditional AI models, which often lack interpretability and grounding in verifiable evidence. The detailed stratification of the evidence corpus and the construction of the hypergraph are methodologically sound and well-justified, providing a strong foundation for the proposed system.
The experimental evaluation is extensive, covering a wide range of clinical tasks across text-only, ROI-level, and WSI-level diagnostics. The benchmarks are well-structured, utilizing both public datasets and a large-scale private clinical dataset, which enhances the validity of the results. The paper reports significant performance improvements over existing state-of-the-art models, demonstrating the effectiveness of the proposed approach in real-world clinical scenarios.
The paper provides detailed implementation details, including the construction of the evidence corpus, the architecture of the hypergraph, and the multi-agent reasoning framework. However, the lack of a publicly accessible code repository or demo limits the reproducibility of the results. The authors could enhance reproducibility by sharing their code and datasets, allowing other researchers to validate and build upon their work.
While the proposed system shows promise, it introduces increased computational overhead and inference latency compared to traditional models. This may limit its practical deployment in time-sensitive clinical environments. Additionally, the reliance on a large corpus of evidence may pose challenges in terms of data quality and the potential for biases in the underlying literature.
The implications of this work are significant for the field of computational pathology and AI in medicine. By grounding diagnostic reasoning in verifiable evidence, PathPocket has the potential to enhance diagnostic accuracy and confidence among pathologists, ultimately improving patient outcomes. The framework could also be adapted for other medical domains, promoting the integration of AI into evidence-based clinical practice. The main contribution of this paper is the introduction of PathPocket, a multimodal AI co-pilot that integrates a comprehensive evidence corpus with a multi-agent reasoning framework to enhance diagnostic accuracy in computational pathology. This work represents a significant advancement in the field, addressing critical limitations of existing AI models by providing transparent, evidence-backed reasoning that can improve clinical decision-making.
Synthetic data is increasingly promoted as a privacy-preserving substitute for releasing sensitive tabular records, yet its central adversarial threat ("reconstruction", the recovery of an individual's hidden attribute values from a synthetic release and a handful of known quasi-identifiers) has been studied only in scattered, hard-to-compare settings. We present the first systematization of reconstruction (equivalently, attribute inference) attacks on de-identified and synthetic tabular data. We contribute a taxonomy that organizes attacks by the structure they exploit; the most systematic empirical evaluation to date, pitting fourteen attacks against nine synthetic data generation (SDG) methods across five benchmark datasets; and a set of new attacks that fill gaps in the taxonomy, one of which (CoBP-RA) is the strongest attack we measure. Crucially, we introduce a methodology for interpreting what attack success means: a memorization test that distinguishes reconstruction of the population distribution from memorization of training records, and a reduction that places reconstruction and membership inference on a single comparable scale. Our findings: the choice of SDG method governs risk far more than the choice of attack; differential privacy protects mainly at small budgets ($\varepsilon\lesssim1$), above which protection plateaus, bounded by the synthesizer's capacity rather than its noise; de-identification methods are the most exposed; and most reconstruction reflects distributional structure rather than memorization, concentrating individual risk on atypical records. The attacks and infrastructure are externally validated by our first-place finish among all red teams in the 2025 \textit{National Institute of Standards and Technology} (NIST) Collaborative Research Cycle.
Primary: University of Washington Tacoma
All Institutions: University of Washington Tacoma, Ghent University
This paper makes a substantial contribution to the field by systematically analyzing and categorizing reconstruction attacks on synthetic tabular data, providing valuable insights into the effectiveness of different synthetic data generation methods and their implications for privacy. The rigorous methodology and extensive empirical evaluation position this work as a critical reference for researchers and practitioners concerned with data privacy and synthetic data.
The paper presents a comprehensive systematization of reconstruction attacks on synthetic tabular data, introducing a novel taxonomy that categorizes attacks based on the structures they exploit. The authors propose new attack methodologies, including CoBP-RA, which utilizes belief propagation to enhance reconstruction accuracy. The methodology is well-structured, clearly delineating the differences between memorization and distributional inference, which is crucial for understanding the implications of synthetic data releases.
The empirical evaluation is extensive, testing fourteen attacks against nine synthetic data generation methods across five benchmark datasets. The experiments are rigorous, providing a clear comparison of attack effectiveness and revealing that the choice of synthetic data generation method significantly impacts reconstruction risk. The findings are supported by thorough statistical analysis and demonstrate the practical implications of different SDG methods on privacy.
While the paper provides detailed descriptions of the methodologies and experiments, there is no mention of code or data availability, which may hinder reproducibility. The absence of a project URL further complicates efforts to replicate the results independently.
The paper does not address the scalability of the proposed attacks in real-world scenarios or the potential for adversarial adaptation over time. Additionally, while it discusses the implications of differential privacy, it does not explore the trade-offs between utility and privacy in depth.
The findings have significant implications for the use of synthetic data in sensitive domains such as healthcare and finance, where privacy is paramount. The insights into the vulnerabilities of various synthetic data generation methods can inform policy and best practices for data release. This paper makes a substantial contribution to the field by systematically analyzing and categorizing reconstruction attacks on synthetic tabular data, providing valuable insights into the effectiveness of different synthetic data generation methods and their implications for privacy. The rigorous methodology and extensive empirical evaluation position this work as a critical reference for researchers and practitioners concerned with data privacy and synthetic data.
As language models are increasingly deployed as autonomous agents, they must coordinate with others over long horizons in open-ended interactive tasks. Yet existing evaluations rarely test these demands together, instead emphasising single-agent tasks, short interactions, or highly structured multi-agent settings. We introduce $alem$, a JAX-based benchmark for open-ended multi-agent coordination built on Craftax-like dynamics. Alem embeds procedurally generated coordination tasks, soft specialisation, communication, and controllable coordination difficulty into a long-horizon survival world with exploration, crafting, trading, and combat. We evaluate $13$ modern LLMs zero-shot within homogeneous teams, with trained MARL agents as reference points. Current LLM agents remain far from solving alem, averaging only ~6% normalised return, but their failures are not uniform. On the hardest coordination setting, zero-shot Gemini-3.1-Pro-High approaches MARL agents trained for one billion steps, while GPT-5.4-High achieves strong base-task reward but much lower coordination reward. This contrast shows that individual task competence does not imply coordination competence. Ablations show that communication is the largest contributor to coordination, while memory and reasoning help when used to maintain multi-step plans. Overall, our results identify coordination as a distinct bottleneck for frontier LLM agents, separate from single-agent capabilities. Alem makes this bottleneck measurable and provides a controlled testbed for developing agents that communicate, allocate roles, and execute shared plans. Code is available at https://github.com/alem-world/alem-env.
Primary: University of Edinburgh
All Institutions: University of Edinburgh, Royal Academy of Engineering
The main contribution of this paper is the introduction of alem, a benchmark for evaluating open-ended multi-agent coordination in language models, which reveals significant performance gaps and highlights the importance of communication in achieving effective coordination. This work represents a substantial step forward in understanding the limitations of current LLMs and provides a foundation for future developments in multi-agent coordination tasks.
The paper introduces a novel benchmark, alem, designed specifically for evaluating open-ended multi-agent coordination among language models. The methodology is well-structured, embedding various dynamics such as procedural generation, communication, and controllable difficulty into a long-horizon survival world. This complexity allows for a more nuanced assessment of coordination capabilities compared to existing benchmarks. The use of JAX for implementation is a modern choice that facilitates efficient computation. The evaluation of 13 LLMs against trained MARL agents provides a clear baseline for understanding the performance gaps in coordination tasks.
The experiments are robust, involving zero-shot evaluations of multiple LLMs within homogeneous teams, which is a significant advancement over traditional single-agent evaluations. The results indicate that current LLMs struggle with coordination, averaging only ~6% normalized return, which highlights the challenge posed by the benchmark. The comparative analysis with MARL agents trained for extensive steps provides a compelling narrative about the limitations of LLMs in multi-agent settings. The ablation studies further clarify the contributions of communication, memory, and reasoning, which are critical for understanding the factors influencing coordination success.
The paper mentions that the code is available on GitHub, which is a positive aspect for reproducibility. However, the paper could benefit from more detailed descriptions of the experimental setup, including hyperparameters and specific configurations used for training and evaluation. This would enhance the ability of other researchers to replicate the findings.
One limitation noted is the performance gap between LLMs and MARL agents, which may indicate that LLMs are not yet equipped to handle complex coordination tasks effectively. Additionally, the benchmark may require further refinement to ensure it captures all relevant aspects of coordination in diverse scenarios. The paper does not extensively discuss the potential biases in the dataset or the implications of the chosen dynamics on the results.
The introduction of alem as a benchmark has the potential to significantly influence future research in multi-agent systems and language models. By highlighting coordination as a distinct challenge, this work encourages the development of more sophisticated agents that can communicate and collaborate effectively, which is crucial for applications in robotics, gaming, and autonomous systems. The findings could lead to improved designs for LLMs that incorporate better coordination strategies, ultimately enhancing their utility in real-world scenarios. The main contribution of this paper is the introduction of alem, a benchmark for evaluating open-ended multi-agent coordination in language models, which reveals significant performance gaps and highlights the importance of communication in achieving effective coordination. This work represents a substantial step forward in understanding the limitations of current LLMs and provides a foundation for future developments in multi-agent coordination tasks.
Score-based generative models have had remarkable success over the last decade in generating a diverse set of visually plausible images. A variety of architectures including CNNs, U-Nets, and Transformers have been used as the score-approximation network in such diffusion modeling; however, to date, relatively little is known about how these architectural choices impact generative behavior. In this work, to provide insight into this area, we propose an analytically solvable parameterization of the score function using an expansion in a 2D orthogonal wavelet basis. In particular, we derive interpretable optimal score functions in terms of the moments of the data distribution. We use this parametrization to provide an architecture-agnostic, moment-based analysis that reveals which attributes of the data distribution tend to matter most for denoising. Our score machine is flexible enough to partially mimic the relevant inductive biases of multiple architectures, including U-Nets, and CNNs, taking a step towards understanding why different score architectures can exhibit distinct generative behavior. Since our score is solvable in terms of the moments of the data, we can begin to understand how the data distribution interacts with the score network to produce the behavior we observe in diffusion models.
Primary: Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University
All Institutions: Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University
The paper presents a significant advancement in the understanding of score-based generative models through a wavelet-based framework, providing both theoretical insights and empirical validation. Its contributions could reshape how practitioners approach diffusion models, particularly in terms of interpretability and performance optimization.
The paper introduces a novel wavelet-based parameterization of score functions in diffusion models, which is analytically solvable and interpretable. The authors provide a structured approach to understanding the contributions of different data distribution attributes to denoising performance. The methodology is well-grounded in existing literature and leverages established mathematical frameworks, including ridge regression and Stein's identity, to derive closed-form solutions. The use of wavelets allows for a multiscale representation that captures important local dependencies, which is a significant advancement over traditional methods.
The experiments are conducted on the MNIST dataset, evaluating the proposed wavelet-based models against traditional CNN and U-Net architectures. The results demonstrate that the wavelet-based approach can achieve competitive performance, particularly in denoising tasks at lower noise levels and higher resolutions. The empirical findings are robust, showing systematic improvements with increased polynomial degree and structured dependencies. However, the reliance on a single dataset limits the generalizability of the results.
The paper provides sufficient details regarding the implementation, including preprocessing steps, model architecture, and training procedures. However, it lacks a publicly accessible code repository, which would enhance reproducibility. The authors mention using an academic cluster with specific GPU types, but further details on the computational environment could aid in replicating the results.
The study is limited to grayscale images and does not explore color images, which may exhibit different characteristics in wavelet representation. Additionally, while the wavelet approach shows promise, it does not outperform learned models in all scenarios, particularly at high noise levels. The paper also does not address potential scalability issues when applied to larger datasets or more complex image types.
The proposed methodology has the potential to influence future research in generative modeling and image denoising by providing a new framework for understanding score-based diffusion models. The interpretability of the wavelet coefficients could lead to better insights into model behavior and improvements in architectural design. Furthermore, the findings could be relevant for applications in computer vision, image processing, and beyond, particularly in contexts where understanding the underlying data distribution is crucial. The paper presents a significant advancement in the understanding of score-based generative models through a wavelet-based framework, providing both theoretical insights and empirical validation. Its contributions could reshape how practitioners approach diffusion models, particularly in terms of interpretability and performance optimization.
Pre-layout design space exploration (DSE) for high-speed signal integrity (SI) analysis is often limited by the computational cost of simulations and iterative optimization algorithms within modern electronic design automation (EDA) workflows. While machine learning surrogate models accelerate the simulation step, optimizing designs still requires utilizing iterative black-box search methods. This iterative nature scales poorly, making multi-corner sweeps computationally expensive. As a solution, this paper proposes amortized neural optimization (ANO) for pre-layout SI design. ANO entirely eliminates iterative black-box inference by utilizing fully differentiable neural network surrogate models. ANO extracts analytical gradients from the surrogate to train a global optimization policy. Instead of solving the optimization problem repeatedly at inference, the optimization process is learned offline and therefore amortized. Once the ANO policy is trained, it maps different channel contexts directly to near-optimal design parameters in a single deterministic forward pass. The efficiency and accuracy of the ANO framework are demonstrated based on three complex SI design scenarios, including DDR5 decision feedback equalization (DFE), 9-dimensional SerDes Tx/Rx co-equalization, and DDR3 DQS differential pair routing to optimize eye diagram metrics under intra-pair skew constraints. By trading roughly 10% in optimality compared to instance-specific black-box algorithms, it realizes speedups of three to four orders of magnitude. For a large-scale 320,000-instance multi-corner SerDes sweep optimization, ANO collapses what would have taken days of computation using iterative search algorithms into a single batched forward pass that completes in milliseconds. This transforms computationally expensive SI optimization into real-time and interactive pre-layout DSE.
Primary: TU Dortmund
All Institutions: TU Dortmund, Pyramide2525, Zuken GmbH
The paper presents a novel approach to signal integrity design space exploration through amortized neural optimization, demonstrating substantial computational efficiency improvements while maintaining a high level of accuracy. The methodology's innovative use of differentiable surrogates and the comprehensive experimental validation underscore its significance in advancing EDA practices and machine learning applications in engineering.
The proposed Amortized Neural Optimization (ANO) framework is innovative in its use of differentiable neural network surrogate models to eliminate the need for iterative black-box optimization methods in signal integrity design space exploration. By leveraging analytical gradients, the ANO framework allows for a single deterministic forward pass to predict optimal design parameters, which is a significant departure from traditional methods that rely on time-consuming iterative evaluations. The dual-network architecture, consisting of a differentiable surrogate model and a global optimization policy network, effectively utilizes the strengths of neural networks while addressing the limitations of existing optimization techniques.
The paper presents a thorough experimental evaluation across three complex signal integrity design scenarios, demonstrating the efficiency and accuracy of the ANO framework. The results indicate substantial speedups (three to four orders of magnitude) over traditional optimization methods, with detailed benchmarking against various algorithms such as genetic algorithms, Bayesian optimization, and gradient descent. The experiments are well-structured, with clear metrics for performance evaluation, and the results support the claims made regarding the framework's effectiveness.
The paper provides sufficient implementation details, including the architecture of the neural networks, training procedures, and datasets used for training and validation. However, the absence of publicly available code or datasets limits reproducibility. Clear descriptions of the training process and hyperparameters are provided, which would aid in replicating the experiments if the resources were made available.
One limitation is the reliance on a one-time computational investment for data generation and training, which may not be feasible for all practitioners. Additionally, while the framework shows impressive speedups, it trades off some optimality (approximately 10%) compared to instance-specific black-box algorithms, which may be a concern in critical applications where absolute optimality is required. The paper also does not address the scalability of the method to even larger design spaces or more complex signal integrity scenarios.
The ANO framework has the potential to significantly impact the field of electronic design automation (EDA) by enabling real-time and interactive design space exploration for high-speed signal integrity analysis. This could facilitate faster design iterations and improve the efficiency of PCB layout processes, ultimately leading to better-performing electronic systems. The methodology could also inspire further research into differentiable optimization techniques in other domains, expanding its applicability beyond signal integrity. The paper presents a novel approach to signal integrity design space exploration through amortized neural optimization, demonstrating substantial computational efficiency improvements while maintaining a high level of accuracy. The methodology's innovative use of differentiable surrogates and the comprehensive experimental validation underscore its significance in advancing EDA practices and machine learning applications in engineering.
We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.
Primary: Fudan University
All Institutions: Fudan University, Nanyang Technological University, Peking University, Shanghai Innovation Institute, Shanghai Jiao Tong University, Tianjin University
The paper presents MMAE, a comprehensive benchmark for audio editing that addresses the fragmented evaluation landscape in the field. Its innovative methodology and rigorous experimental evaluation highlight critical gaps in current audio editing systems, paving the way for future advancements.
The paper introduces MMAE, a benchmark that systematically categorizes audio editing tasks across various modalities and complexities. The methodology is robust, employing a human-agent collaboration approach to curate a diverse dataset of 2,000 high-fidelity audio samples. The detailed taxonomy and rubric-based evaluation framework allow for nuanced assessments of audio editing systems, which is a significant advancement over existing benchmarks. The decomposition of tasks into 17,741 verifiable criteria is particularly innovative, providing a comprehensive structure for evaluating instruction following and context consistency.
The experimental setup is thorough, with a clear focus on evaluating leading models against the MMAE benchmark. The results highlight significant shortcomings in current audio editing systems, with an Exact Match Rate (EMR) below 5% and 0% for complex tasks. This empirical evidence underscores the necessity of the MMAE benchmark and the challenges that remain in the field, which adds credibility to the findings.
The paper does not provide explicit details regarding the implementation of the benchmark or the models evaluated, which may hinder reproducibility. However, the clear structure of the evaluation framework suggests that researchers could replicate the assessment methodology if they have access to the same datasets and models.
One limitation is the lack of publicly available demo or project URLs, which could enhance accessibility and encourage wider adoption of the benchmark. Additionally, while the benchmark covers a broad range of tasks, it may not encompass all possible audio editing scenarios, potentially limiting its applicability in niche areas.
The MMAE benchmark has the potential to significantly influence the field of audio editing by providing a standardized evaluation framework that can guide future research and development. By exposing the limitations of current models, it encourages innovation and improvement in audio editing technologies, which could have applications in various domains, including entertainment, education, and accessibility. The paper presents MMAE, a comprehensive benchmark for audio editing that addresses the fragmented evaluation landscape in the field. Its innovative methodology and rigorous experimental evaluation highlight critical gaps in current audio editing systems, paving the way for future advancements.
Agent systems increasingly use textual skills to encode reusable task procedures, but injecting these skills into the prompt at every step incurs substantial context overhead and exposes skill content as plaintext. We present LatentSkill, a framework that converts textual skills into plug-and-play LoRA adapters through a pretrained hypernetwork. LatentSkill stores skill knowledge in weight space rather than context space, removing per-step skill tokens while preserving modular loading, scaling, and composition. On ALFWorld and Search-QA, LatentSkill outperforms the corresponding in-context skill baseline while using substantially fewer prefill tokens: it improves ALFWorld success by 21.4 and 13.4 points on the seen and unseen splits with 64.1% fewer prefill tokens, and improves Search-QA exact match by 3.0 points with 72.2% lower skill-token overhead. Further analysis shows that generated skill LoRAs form a structured semantic geometry, can be precisely controlled via the LoRA scaling coefficient, and can be composed through parameter-space arithmetic when skill components are aligned. These findings suggest that weight-space skills provide an efficient, modular, and less exposed substrate for extending LLM agents.
Primary: Sun Yat-Sen University
All Institutions: Sun Yat-Sen University, OPPO Research Institute
The main contribution of this paper is the introduction of LatentSkill, a framework that efficiently integrates textual skills into LLM agents by converting them into LoRA adapters, thereby reducing context overhead and enhancing modularity. This work represents a significant step forward in the field, offering both theoretical insights and practical improvements in the deployment of LLMs.
The LatentSkill framework presents a novel approach to integrating textual skills into LLM agents by converting them into LoRA adapters. This method addresses the inefficiencies of using in-context skill tokens, particularly in terms of context overhead and exposure of skill content. The use of a pretrained hypernetwork to generate these adapters is innovative and suggests a shift in how skills can be modularly managed in LLMs. The methodology is well-structured, with a clear description of how the skill knowledge is stored in weight space, which is a significant departure from traditional methods.
The experiments conducted on ALFWorld and Search-QA provide strong empirical support for the proposed method. The reported improvements in performance metrics (21.4 and 13.4 points in ALFWorld, and 3.0 points in Search-QA) alongside the substantial reduction in prefill tokens (64.1% and 72.2% respectively) demonstrate the effectiveness of the LatentSkill framework. The results are compelling and suggest that the framework not only enhances performance but also contributes to efficiency in resource usage.
The paper includes appendices with training details, evaluation details, and sensitivity analyses, which are crucial for reproducibility. However, the absence of a publicly accessible code repository or demo limits the ability of other researchers to replicate the findings independently. This is a notable gap that should be addressed in future work.
The paper does not discuss potential limitations in depth. While the approach shows promise, it may not generalize well across all types of tasks or domains. Additionally, the reliance on a pretrained hypernetwork could introduce biases based on the training data used, which is not thoroughly examined in the paper.
The implications of LatentSkill are significant for the development of LLM agents, particularly in applications requiring modular and efficient skill integration. The ability to manage skills in weight space rather than context space could lead to more scalable and secure implementations of LLMs in various domains, including robotics and natural language processing. The main contribution of this paper is the introduction of LatentSkill, a framework that efficiently integrates textual skills into LLM agents by converting them into LoRA adapters, thereby reducing context overhead and enhancing modularity. This work represents a significant step forward in the field, offering both theoretical insights and practical improvements in the deployment of LLMs.
Generalist robot intelligence is often framed as a policy-scaling problem: collect more robot demonstrations, train larger Vision-Language-Action (VLA) models, and expect broader generalisation. In this position paper, we argue that this framing is incomplete. The central bottleneck is not only policy learning, but the absence of mechanisms that convert the world's abundant unstructured behavioural data into grounded robot supervision. Human motion, internet video, simulation rollouts, and interactive demonstrations contain rich information about tasks, goals, contacts, failures, and physical constraints, yet most of this information is not directly usable by robot policies because it lacks embodiment-specific action labels, task semantics, and reward structure. We identify four missing components for the next generation of robotics: data interfaces for autolabelling unstructured behaviour, embodiment interfaces for retargeting human motion to robot actions, world-model interfaces for physics-grounded 3D reasoning, and reward interfaces for inferring task progress and success from video and language. We survey recent progress in robot foundation models, cross-embodiment datasets, learning from video, world models, and reward modelling, and propose a research agenda for building robotics systems that can learn not only from robot demonstrations, but from the broader physical world.
Primary: Stanford University
All Institutions: Istituto Italiano di Tecnologia, Stanford University, Technical University of Darmstadt, UCL Centre for AI
The paper argues that advancing generalist robot intelligence requires more than scaling existing models; it necessitates new mechanisms to convert unstructured physical data into usable robot supervision. This comprehensive analysis highlights the critical bottlenecks in current methodologies and proposes a structured approach to overcome them, setting the stage for future advancements in robotics.
The paper presents a comprehensive survey of the current state of robotics, emphasizing the limitations of existing approaches that rely heavily on robot-native supervision. The authors propose a framework that identifies four critical components necessary for advancing robot learning: data interfaces for autolabelling, embodiment interfaces for retargeting human motion, world-model interfaces for 3D reasoning, and reward interfaces for inferring task progress. This systematic approach is innovative as it shifts the focus from merely scaling policies to enhancing the grounding of data, which is a significant bottleneck in the field. The proposed components are well-justified and supported by a thorough review of existing literature, making the methodology robust and insightful.
While the paper is primarily a position paper and does not present original experimental results, it effectively synthesizes findings from various studies to illustrate the current limitations and potential pathways for future research. The authors provide a detailed overview of existing datasets and methodologies, which helps contextualize their arguments. However, the lack of new empirical data limits the ability to assess the practical implications of their proposed framework directly.
The paper does not include specific implementation details or code, which is typical for position papers. However, the authors reference a variety of existing works, which could facilitate reproducibility for those familiar with the field. The lack of a concrete experimental setup or new benchmarks means that direct reproducibility is not applicable in this context.
The primary limitation of the paper is its position paper nature, which means it lacks original experimental contributions. Additionally, while the proposed components are compelling, the paper does not provide a detailed roadmap for how to implement these ideas in practice. The reliance on existing literature also means that the arguments are contingent on the quality and applicability of those studies.
The proposed framework has the potential to significantly impact the field of robotics by providing a new lens through which to view the challenges of robot learning. By addressing the grounding problem and advocating for a more holistic approach to robot supervision, the paper could influence future research directions and lead to more capable and generalist robotic systems. The implications extend beyond academic research, potentially affecting the development of practical robotic applications in various industries. The paper argues that advancing generalist robot intelligence requires more than scaling existing models; it necessitates new mechanisms to convert unstructured physical data into usable robot supervision. This comprehensive analysis highlights the critical bottlenecks in current methodologies and proposes a structured approach to overcome them, setting the stage for future advancements in robotics.
Despite recent progress, LLM agents still struggle with reasoning over long interaction histories. While current memory-augmented agents rely on a static retrieve-then-reason paradigm, this rigid pipeline design prevents them from dynamically adapting memory access to intermediate evidence discovered during inference. To bridge this gap, we propose MRAgent, a framework that combines an associative memory graph with an active reconstruction mechanism. We represent memory as a Cue-Tag-Content graph, where associative tags serve as semantic bridges connecting fine-grained cues to memory contents. Operating on this structure, our active reconstruction mechanism integrates LLM reasoning directly into memory access, allowing the agent to iteratively explore and prune retrieval paths based on accumulated evidence. This ensures that memory retrieval is dynamically adapted to the reasoning context while avoiding combinatorial explosion caused by unconstrained expansion. Experiments on the LoCoMo benchmark and LongMemEval benchmark demonstrate significant improvements over strong baselines (up to 23%), while substantially reducing token and runtime cost, highlighting the effectiveness of active and associative reconstruction for long-horizon memory reasoning.
Primary: National University of Singapore
All Institutions: National University of Singapore
The main contribution of this paper is the introduction of a novel memory management framework for LLM agents that dynamically adapts memory access through active reconstruction. This work significantly advances the field by addressing the limitations of static memory retrieval systems, providing a promising direction for future research and applications in AI.
The proposed MRAgent framework introduces a novel approach to memory management in LLMs by utilizing a Cue-Tag-Content graph structure combined with an active reconstruction mechanism. This design allows for dynamic memory access that adapts to the reasoning context, which is a significant departure from traditional static retrieval methods. The methodology is well-structured and presents a clear innovation in how memory is utilized in LLMs, addressing a critical limitation in current systems.
The experiments are conducted on the LoCoMo and LongMemEval benchmarks, showcasing substantial improvements over existing baselines. The reported enhancements of up to 23% in performance metrics, alongside reductions in token and runtime costs, provide strong empirical support for the proposed method. However, the paper would benefit from more detailed statistical analyses and comparisons with a broader range of existing models to strengthen the claims of superiority.
The paper lacks detailed implementation specifics, such as hyperparameter settings and model architectures, which are crucial for reproducibility. While the benchmarks used are established, the absence of a public code repository or supplementary materials limits the ability of other researchers to replicate the results.
One limitation is the reliance on specific benchmarks, which may not fully capture the versatility of the proposed method across diverse tasks. Additionally, the paper does not address potential scalability issues or the impact of memory size on performance, which could be critical for real-world applications.
The framework has the potential to significantly enhance AI applications that require long-term memory and reasoning capabilities, such as personal assistants and decision support systems. However, the authors acknowledge the ethical considerations surrounding data privacy and governance, which are essential for responsible deployment in real-world scenarios. The main contribution of this paper is the introduction of a novel memory management framework for LLM agents that dynamically adapts memory access through active reconstruction. This work significantly advances the field by addressing the limitations of static memory retrieval systems, providing a promising direction for future research and applications in AI.
Video generation models have made impressive strides in synthesizing visually compelling content, yet their outputs remain confined to the virtual domain. A natural question follows: how well do these models reflect the physical world when their generated videos leave the screen and enter reality? We propose robotic manipulation as a concrete, measurable window onto this question: if a model has truly internalized physical laws, the motion it depicts should translate into executable robot behavior. We introduce Dream.exe, an evaluation framework that operationalizes this criterion through a video-to-execution pipeline. Given a scene image and a task description, Dream.exe synthesizes a manipulation video, converts the generated motion into robot trajectories, and executes them in a physics simulator, yielding a grounding signal that purely visual metrics cannot offer. Using this pipeline, we evaluate 8 models spanning frontier closed-source generators, open-source generators, and robot-specific models. Our benchmark covers 101 manually curated manipulation tasks at three levels of physical complexity, measured across visual quality, trajectory fidelity, and execution success. Encouragingly, several models achieve measurable execution success, suggesting that generative priors learned from internet-scale data already encode meaningful physical knowledge. Yet visual quality proves a poor predictor of executability, exposing a dimension of model capability that standard visual evaluations do not capture. Dream.exe will be open-sourced at https://github.com/showlab/Dream.exe.
Primary: National University of Singapore
All Institutions: National University of Singapore, University of Oxford, Tencent
The main contribution of this paper is the introduction of Dream.exe, a framework that evaluates video generation models by translating their outputs into executable robot actions, thereby assessing their understanding of physical laws. This work is a substantial step forward in linking generative models to real-world applications, with the potential to impact both the fields of machine learning and robotics significantly.
The paper introduces Dream.exe, a novel evaluation framework that connects video generation models to robotic manipulation tasks. This approach is innovative as it operationalizes the concept of physical grounding by converting generated videos into executable robot trajectories, allowing for a direct assessment of the models' understanding of physical laws. The methodology is well-structured, with a clear pipeline from scene image and task description to execution in a physics simulator. The framework's ability to evaluate multiple models across various manipulation tasks adds depth to its methodological contribution.
The experiments are comprehensive, covering 101 manipulation tasks with varying levels of physical complexity. The evaluation metrics—visual quality, trajectory fidelity, and execution success—are well-defined and relevant. The results indicate that some models can achieve execution success, which is a significant finding. However, the paper could benefit from more detailed statistical analysis of the results to strengthen the claims made.
The paper mentions that Dream.exe will be open-sourced, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation specifics that would allow other researchers to replicate the experiments easily. Including more information about the datasets, model configurations, and evaluation criteria would enhance reproducibility.
One limitation is the reliance on a physics simulator for execution, which may not fully capture the complexities of real-world robot manipulation. Additionally, the paper notes that visual quality does not correlate well with executability, highlighting a gap in existing evaluation metrics. This could indicate that while the models perform well in simulation, their real-world applicability may be limited.
The implications of this work are significant, as it bridges the gap between generative models and practical robotic applications. By demonstrating that generative priors can encode meaningful physical knowledge, this research opens avenues for improving robotic manipulation tasks and could influence future research in both video generation and robotics. The main contribution of this paper is the introduction of Dream.exe, a framework that evaluates video generation models by translating their outputs into executable robot actions, thereby assessing their understanding of physical laws. This work is a substantial step forward in linking generative models to real-world applications, with the potential to impact both the fields of machine learning and robotics significantly.
We introduce VideoKR, the first large-scale training corpus specifically designed to strengthen knowledge- and reasoning-intensive video understanding. It comprises 315K video reasoning examples over 145K newly collected, CC-licensed, expert-domain videos. We develop a human-in-the-loop, skill-oriented example generation pipeline that targets progressively deeper video reasoning capabilities while ensuring the difficulty, diversity, and reliability of both the examples and their CoT rationales. We also curate VideoKR-Eval, a new expert-annotated benchmark where questions require genuine video understanding and knowledge-intensive reasoning rather than textual shortcuts. Our experiments show that, under a standard SFTrightarrowGRPO pipeline, models post-trained on VideoKR outperform prior post-training approaches on knowledge-intensive video reasoning while remaining competitive on general video reasoning, highlighting data design as a key driver of progress in video reasoning. We further conduct comprehensive ablations to isolate the contributions of VideoKR, providing actionable insights for future work.
Primary: Zhejiang University
All Institutions: Zhejiang University, Tongji University, University of Chinese Academy of Sciences, Yale University
The main contribution of this paper is the introduction of VideoKR, a large-scale dataset designed to enhance knowledge- and reasoning-intensive video understanding, along with a robust methodology for its generation and evaluation. This work significantly advances the field by providing new benchmarks and insights into the design of datasets that foster deeper reasoning capabilities in video understanding systems.
The methodology presented in VideoKR is robust, with a clear focus on generating a large-scale dataset specifically for knowledge- and reasoning-intensive video understanding. The human-in-the-loop example generation pipeline is innovative, allowing for a targeted approach to progressively deepen video reasoning capabilities. The authors emphasize the importance of difficulty, diversity, and reliability in the examples, which is a significant contribution to dataset design in the field. The use of Chain-of-Thought (CoT) rationales adds depth to the reasoning process, making it a valuable methodological advancement.
The experiments conducted are comprehensive, showcasing the effectiveness of the VideoKR dataset in improving model performance on knowledge-intensive video reasoning tasks. The comparison with prior post-training approaches provides a solid empirical foundation for the claims made. The introduction of the VideoKR-Eval benchmark is particularly noteworthy, as it sets a new standard for evaluating video understanding capabilities, ensuring that the evaluation metrics align with genuine understanding rather than superficial textual shortcuts.
The paper includes links to the dataset and code repository, which is a positive aspect for reproducibility. However, the absence of a demo URL limits immediate accessibility for practitioners looking to experiment with the proposed methods. Detailed descriptions of the experimental setup and results are provided, which aids in reproducibility.
One limitation is the reliance on a specific pipeline (SFTrightarrowGRPO) for the experiments, which may not generalize across all video reasoning tasks. Additionally, while the dataset is large, the focus on expert-domain videos may limit its applicability to more general video understanding tasks. The authors could also address potential biases in the dataset generation process.
The implications of this work are significant, as it addresses a critical gap in video understanding research by focusing on knowledge-intensive reasoning. The development of a dedicated dataset and benchmark can catalyze further research and advancements in the field, potentially leading to more capable AI systems in video analysis, education, and beyond. The main contribution of this paper is the introduction of VideoKR, a large-scale dataset designed to enhance knowledge- and reasoning-intensive video understanding, along with a robust methodology for its generation and evaluation. This work significantly advances the field by providing new benchmarks and insights into the design of datasets that foster deeper reasoning capabilities in video understanding systems.
Progress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and task-specific reporting. As a result, claims of superiority or generality across models are often not directly comparable. We introduce GENEB, a large-scale diagnostic benchmark that evaluates frozen representations from 40 genomic foundation models across 100 tasks spanning 13 functional categories under a unified probing-based protocol, including few-shot regimes. GENEB enables controlled comparison across model scale, architecture, tokenization, and pretraining data while explicitly exposing task-level trade-offs. Our analysis shows that aggregate leaderboards are unstable: model rankings vary sharply across task categories, scale provides only modest and inconsistent gains, and architectural and pretraining alignment frequently outweigh parameter count. These results highlight limitations of current evaluation practices and position GENEB as a reference framework for principled comparison and category-aware model selection in genomic machine learning.
Primary: Moscow Independent Research Institute of Artificial Intelligence
All Institutions: Moscow Independent Research Institute of Artificial Intelligence, Moscow State Institute of Steel and Alloys
The paper introduces GENEB, a benchmark for evaluating genomic foundation models, addressing significant gaps in the current evaluation landscape and providing a systematic framework for model comparison. The comprehensive methodology and findings have the potential to reshape how genomic models are assessed and selected, promoting more informed and effective applications in the field.
The methodology introduced by GENEB is robust and systematic, employing a unified probing-based protocol to evaluate a large set of genomic foundation models across diverse tasks. This approach allows for controlled comparisons that expose task-level trade-offs and model performance discrepancies, addressing a significant gap in the current evaluation landscape of genomic models. The use of Matthews Correlation Coefficient (MCC) as the primary metric is appropriate given its robustness to class imbalance, which is crucial in genomic tasks. However, the paper could benefit from more explicit details on the probing protocol and the selection criteria for tasks included in the benchmark.
The experimental evaluation is thorough, covering 40 genomic foundation models across 100 tasks and providing a comprehensive analysis of model performance. The results reveal important insights into the limitations of current evaluation practices, such as the instability of aggregate leaderboards and the influence of architectural choices over mere parameter count. The analysis of few-shot performance and the identification of category-specific strengths and weaknesses are particularly valuable, although the paper could enhance clarity by including more visual aids or summaries of key findings.
The paper emphasizes reproducibility by detailing the evaluation framework and the specific metrics used. However, the lack of publicly available code or a demo URL limits the practical reproducibility of the findings. The authors mention plans to release GENEB as a public benchmark, which would significantly enhance reproducibility and community engagement if realized.
The paper acknowledges several limitations, including the underrepresentation of long-range tasks, potential noise in task definitions, and the exclusion of certain genomic models due to various constraints. Additionally, the focus on eukaryotic tasks may skew the findings and limit applicability to prokaryotic or viral genomics. The authors also note that the evaluation of frozen representations may underestimate the performance achievable through task-specific fine-tuning.
The introduction of GENEB is poised to improve the rigor of model comparison in genomic representation learning, facilitating better model selection and advancing the field by providing a unified evaluation framework. This could lead to more reliable applications in clinical and agricultural settings, where genomic models are increasingly relevant. The emphasis on category-aware evaluation is particularly important for ensuring that practitioners select models that are genuinely suited to their specific tasks. The paper introduces GENEB, a benchmark for evaluating genomic foundation models, addressing significant gaps in the current evaluation landscape and providing a systematic framework for model comparison. The comprehensive methodology and findings have the potential to reshape how genomic models are assessed and selected, promoting more informed and effective applications in the field.
Controlled experiments are the backbone of machine learning research, but at the scale of modern foundation models, they have become prohibitively expensive. Instead, the community increasingly relies on research strategies that approximate the ideal experiment at a fraction of the cost: proxy experiments and scaling laws, observational studies with publicly available models, and single-run designs that leverage variation within individual training runs. In this work, we argue that there is no free lunch when approximating large-scale experiments on a compute budget. Specifically, savings in compute come at the cost of validity threats -- hidden and sometimes untestable assumptions that, when violated, can invalidate research claims. To help navigate such threats, we propose an evaluation framework that casts foundation model research as a causal inference problem. Within this framework, we evaluate different research strategies through four types of validity adapted from the empirical social sciences -- statistical, internal, external, and construct validity. We find that each strategy comes with a characteristic validity profile: proxy experiments trade external and construct validity for statistical and internal validity; observational studies face confounding and effect heterogeneity; and single-run designs are strained by interference between treated units. This analysis reveals several validity threats that have received insufficient attention in the literature. Overall, our evaluation framework provides researchers with a practical toolkit for scrutinizing validity threats in foundation model research~designs.
Primary: University of Tübingen
All Institutions: University of Tübingen, University of Vienna, Tübingen AI Center
This paper provides a crucial evaluation framework for scrutinizing validity threats in foundation model research designs. It offers a novel and timely perspective by casting foundation model research as a causal inference problem and systematically applying validity types from the empirical social sciences to common research strategies, thereby addressing the pressing challenge of conducting rigorous research in a compute-constrained environment. The comprehensive analysis of different research strategies through the lens of statistical, internal, external, and construct validity, and the identification of previously under-emphasized validity threats, positions this work as a potentially foundational contribution to the methodology of large-scale machine learning research.
The paper proposes a conceptual methodology, adapting established frameworks from the empirical social sciences—specifically, causal inference and four types of validity (statistical, internal, external, and construct validity)—to scrutinize research designs for foundation models. It casts foundation model research as a causal inference problem, which is a powerful lens for identifying hidden assumptions and potential threats to validity. The approach involves analyzing common research strategies in the foundation model space (proxy experiments, scaling laws, observational studies, and single-run designs) through this validity framework. The methodology is analytical and aims to provide a structured way to think about the rigor and generalizability of findings in a compute-constrained environment. While the full details of the framework's application to each strategy are not provided in the given text, the abstract outlines specific validity threats identified for each strategy, suggesting a concrete and systematic analysis.
This paper is a conceptual and methodological work; therefore, it does not present traditional experimental evaluations with datasets and results. Its "evaluation" is an analytical one, evaluating different research *strategies* rather than specific models or algorithms. The success of this paper's "evaluation" lies in the clarity, comprehensiveness, and utility of the proposed framework and the insights it generates regarding existing research practices. Without the full text, it's impossible to assess the depth and rigor of this analytical evaluation.
As a conceptual framework paper, reproducibility in the traditional sense (e.g., code, experimental setups) is not directly applicable. However, the framework itself should be clearly defined and articulated such that other researchers can understand, apply, and critique it. The "practical toolkit" mentioned in the abstract implies a structured approach that should be reproducible in its application. The discussion mentions "Open-science initiatives like the Marin Project that openly document training recipes and meta-data can also help," which aligns with principles of reproducibility in the broader ML community.
The primary limitation of this evaluation is the lack of the full paper content for the main sections (e.g., `neurips/sections/proxy`, `neurips/sections/observational`, `neurips/sections/singlerun-v5`, `neurips/sections/validity-profiles`). Therefore, the assessment of the framework's depth, specific insights, and practical utility is based primarily on the abstract and the high-level structure. Without these details, it's difficult to ascertain if the framework is sufficiently comprehensive, if the identified validity threats are exhaustively covered, or if the proposed solutions/mitigations are practical and well-justified. Another potential limitation, inherent in adapting frameworks from other fields, is the challenge of ensuring that the concepts (e.g., construct validity) are appropriately translated and applied to the unique context of machine learning and foundation models without oversimplification or misinterpretation.
This paper has the potential for significant broader impact. By providing a structured framework for evaluating validity threats, it can elevate the methodological rigor of foundation model research. It encourages researchers to critically examine their experimental designs, understand the limitations of their findings, and make more robust claims. This can lead to more reliable and trustworthy research, better allocation of compute resources, and a more mature scientific discourse around large-scale ML. It could serve as a foundational reference for designing future experiments, reviewing papers, and teaching research methodology in the era of large models. The emphasis on "hidden and sometimes untestable assumptions" is crucial for fostering a more transparent and self-aware research community. This paper provides a crucial evaluation framework for scrutinizing validity threats in foundation model research designs. It offers a novel and timely perspective by casting foundation model research as a causal inference problem and systematically applying validity types from the empirical social sciences to common research strategies, thereby addressing the pressing challenge of conducting rigorous research in a compute-constrained environment. The comprehensive analysis of different research strategies through the lens of statistical, internal, external, and construct validity, and the identification of previously under-emphasized validity threats, positions this work as a potentially foundational contribution to the methodology of large-scale machine learning research.
While household robots are often evaluated based on task completion, everyday domestic environments involve value-conflicting situations in which robots are expected to choose actions that prioritize other values than task success, such as human autonomy, efficiency, or social appropriateness. Yet, there are no benchmarks for evaluating robots' value preferences in such scenarios. We introduce RobotValues, a benchmark to evaluate household robot planners in 10K value-conflict scenarios. Each instance consists of a realistic household image with multiple plausible robot actions that prioritize different human values. We construct RobotValues through LLM-assisted scenario generation, stakeholder-grounded value extraction, image generation and automatic quality control. Using RobotValues we evaluate VLMs used in robotics and find that models exhibit default value preferences, including safety and accommodation, while underselecting privacy-prioritizing actions. When the models are instructed to prioritize specific values that conflict with their own preferences, they often fail to override their default actions, choosing incorrect actions for 80% of the time. These findings suggest that household robot evaluation should measure not only task completion or safety compliance, but also whether robots can choose among plausible actions when human values conflict.
Primary: Seoul National University
All Institutions: Seoul National University
The paper presents RobotValues, a benchmark for evaluating household robots in value-conflict scenarios, significantly advancing the understanding of how robots can prioritize human values in decision-making. The methodology is innovative, and the findings have important implications for the future development of socially aware robotic systems.
The methodology is innovative, introducing the RobotValues benchmark, which systematically evaluates household robots in value-conflict scenarios. The use of LLMs for scenario generation and stakeholder-grounded value extraction is a novel approach that enhances the realism and relevance of the scenarios. The pipeline for generating diverse household situations and actions is well-structured, ensuring a comprehensive evaluation of robot decision-making in complex environments.
The experiments conducted using the RobotValues benchmark are rigorous, providing insights into the default value preferences of various VLMs in robotics. The evaluation metrics, including the Bradley-Terry score and accuracy in value-conditioned settings, are appropriate for assessing the models' performance. The findings reveal significant shortcomings in the models' ability to prioritize conflicting human values, which is a critical aspect of household robot functionality.
The paper provides detailed descriptions of the data generation pipeline, quality control measures, and evaluation protocols, which facilitate reproducibility. However, the reliance on LLMs for data generation and the absence of a publicly available dataset may hinder full reproducibility for external researchers.
The primary limitation is the synthetic nature of the household images, which may not capture the full complexity of real-world environments. Additionally, the potential for annotation errors and the challenges in ensuring diversity and realism in generated scenarios are acknowledged. The benchmark may also not cover all possible value conflicts that could arise in real household settings.
The introduction of RobotValues has the potential to significantly influence the design and evaluation of household robots, encouraging developers to consider human values in robot decision-making processes. This work could lead to improvements in user trust and acceptance of robotic systems in domestic environments, ultimately enhancing human-robot interaction. The paper presents RobotValues, a benchmark for evaluating household robots in value-conflict scenarios, significantly advancing the understanding of how robots can prioritize human values in decision-making. The methodology is innovative, and the findings have important implications for the future development of socially aware robotic systems.
Recent progress in Large Language Model (LLM) agents has enabled promising advances in automated data science. However, existing approaches remain fundamentally limited by their static action sets and lack of principled long-horizon context management, hindering their ability to accumulate reusable experience across tasks and operate reliably in multi-stage, iterative data science pipelines. To address these challenges, we introduce EvoDS, a self-evolving autonomous data science agent that learns to expand its skills and adaptively managing long-term context through agentic reinforcement learning. Specifically, EvoDS introduces two key strategies: (1) Autonomous Skill Acquisition (ASA) mechanism, which enables agents to synthesize, validate, and reuse executable skills; and (2) Adaptive Context Compression (ACC) strategy, which treats context management as a learned control problem rather than passive truncation. These strategies are orchestrated within a two-stage multi-agent training scheme, enabling EvoDS to autonomously improve over time. Theoretically, we prove that EvoDS's hierarchical design reduces tool-selection error, and its optimization objective aligns with an information bottleneck principle, ensuring efficient context use. Empirically, EvoDS outperforms state-of-the-art open-source data science agents by an average of 28.9% across four diverse benchmarks while eliminating out-of-token failures. Our code and data are available at https://github.com/usail-hkust/EvoDS.
Primary: The Hong Kong University of Science and Technology (Guangzhou)
All Institutions: The Hong Kong University of Science and Technology (Guangzhou)
The main contribution of this paper is the introduction of EvoDS, a self-evolving autonomous data science agent that effectively integrates skill acquisition and context management, significantly advancing the capabilities of automated data science agents. The technical contribution is substantial, addressing critical limitations in existing methodologies and providing a robust framework that could reshape how data science tasks are approached in practice.
The paper presents a novel approach to autonomous data science through the EvoDS framework, which integrates Autonomous Skill Acquisition and Adaptive Context Compression within a hierarchical multi-agent architecture. The methodology is well-structured, addressing key challenges in existing data science agents, such as static action spaces and long-context management. The introduction of a multi-agent reinforcement learning algorithm that optimizes task performance, capability acquisition, and context management is a significant advancement. The theoretical foundation provided for the design choices enhances the credibility of the proposed methods.
The experiments conducted across four diverse benchmarks demonstrate the effectiveness of EvoDS, showing an average performance improvement of 28.9% over state-of-the-art open-source data science agents. The empirical results are compelling and support the claims made in the paper. However, the paper could benefit from a more detailed analysis of the datasets used and the specific metrics employed for evaluation.
The authors have made the code and data available on GitHub, which is a positive step towards reproducibility. However, the paper lacks detailed implementation specifics, such as hyperparameter settings and training protocols, which are crucial for others to replicate the results accurately.
While the paper addresses significant challenges in autonomous data science, it does not thoroughly explore the scalability of the proposed methods in real-world applications. Additionally, the reliance on a hierarchical multi-agent architecture may introduce complexity that could hinder practical deployment. The paper also does not discuss potential biases in the training data or the implications of using LLMs in sensitive applications.
The EvoDS framework has the potential to significantly impact the field of automated data science by enabling agents to learn and adapt over time, which could lead to more efficient and effective data-driven decision-making processes. This could have applications across various domains, including healthcare, finance, and scientific research, where data analysis is critical. The main contribution of this paper is the introduction of EvoDS, a self-evolving autonomous data science agent that effectively integrates skill acquisition and context management, significantly advancing the capabilities of automated data science agents. The technical contribution is substantial, addressing critical limitations in existing methodologies and providing a robust framework that could reshape how data science tasks are approached in practice.
Autonomous LLM training is often framed as recipe search, which leaves the training harness largely static. This limitation sharpens in agentic RL, where shifting bottlenecks and scalar rewards mask diverse failure modes. We introduce EvoTrainer, an autonomous training framework that co-evolves LLM policies and training-side harnesses through empirical feedback: it diagnoses rollout-level evidence, revises diagnostics, backtests interventions, and accumulates reusable skills. Evaluated on mathematical reasoning, competitive-programming code generation, and repository-level software engineering, EvoTrainer matches or exceeds the human-engineered RL references under the same data, codebase, and evaluation protocol, with the largest gain on long-horizon agentic SWE. Trajectory analyses show that retained strategies diverge across domains, evolving diagnostics prevent invalid high-scoring branches from being promoted, and reusable skills shape later search. Autonomous LLM RL should move beyond recipe search toward joint evolution of policies and the training harnesses that interpret them.
Primary: Alibaba Group
All Institutions: Alibaba Group, Shenzhen Institute of Advanced Technology
EvoTrainer represents a significant advancement in autonomous reinforcement learning by introducing a framework that dynamically evolves both policies and training harnesses, enhancing adaptability and efficiency in model training. The comprehensive evaluation across multiple domains demonstrates its effectiveness, and the innovative methodology sets a new standard for future research in the field.
The paper introduces EvoTrainer, a novel framework for co-evolving LLM policies and training harnesses in autonomous reinforcement learning. The methodology is well-structured, emphasizing the dual evolution of both policies and diagnostic tools, which is a significant departure from traditional static approaches. The use of empirical feedback to adapt the training harness dynamically is particularly innovative, allowing for more nuanced and effective training processes. The framework's design is robust, with clear mechanisms for version control, diagnostics, and skill reuse, which collectively enhance the adaptability and efficiency of the training process.
The experiments are comprehensive, evaluating EvoTrainer across multiple domains (mathematical reasoning, competitive programming, and software engineering). The results demonstrate significant improvements over baseline models, including human-engineered RL references, with statistical significance reported. The evaluation metrics are appropriate, and the paper provides detailed comparisons against various baselines, showcasing the effectiveness of the proposed method. However, the paper could benefit from additional clarity in presenting the experimental setup and results, particularly in terms of the specific configurations used.
The paper lacks detailed implementation specifics, such as code availability or a clear description of the experimental setup, which may hinder reproducibility. While the methodology is described in detail, the absence of a public repository or demo limits the ability for other researchers to replicate the findings. Future work should include these elements to enhance reproducibility and facilitate broader adoption of the framework.
The primary limitation noted is the computational cost associated with running EvoTrainer, which may restrict its practical application in resource-constrained environments. Additionally, the framework's reliance on a specific trainer model (Claude Sonnet 4.6) may limit its generalizability across different architectures. The paper also acknowledges that the current implementation spans only a limited number of versions, suggesting that further exploration of long-term evolution and memory management is necessary.
The proposed framework has the potential to significantly advance the field of autonomous reinforcement learning, particularly in the context of LLMs. By enabling dynamic adaptation of training processes, EvoTrainer could lead to more efficient and effective model training, ultimately enhancing the capabilities of AI systems in various applications. The implications for automated software engineering and code generation are particularly noteworthy, as the framework could streamline development processes and improve the quality of generated code. EvoTrainer represents a significant advancement in autonomous reinforcement learning by introducing a framework that dynamically evolves both policies and training harnesses, enhancing adaptability and efficiency in model training. The comprehensive evaluation across multiple domains demonstrates its effectiveness, and the innovative methodology sets a new standard for future research in the field.
Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.
Primary: Virginia Tech
All Institutions: Virginia Tech, National Science Foundation
The main contribution of this paper is the introduction of Curation-Bench, a benchmark that evaluates the ability of generalist agents to automate data curation, demonstrating that structured scaffolding significantly enhances the exploration capabilities of these agents. This work not only advances the field of machine learning by providing a new evaluation framework but also offers practical insights into improving the efficiency and effectiveness of data curation processes.
The paper introduces a novel benchmark, Curation-Bench, designed to evaluate the ability of generalist coding agents to automate the data curation process. The methodology is well-structured, fixing the model and training recipe while allowing agents to interactively explore and implement data policies. The introduction of scaffolds to guide agent behavior is a significant contribution, as it enhances the exploration capabilities of agents beyond mere local optimizations. The paper effectively combines theoretical insights with practical implementations, demonstrating how structured guidance can lead to better policy exploration.
The experiments are robust, involving multiple agents and a variety of tasks across different datasets. The results show that scaffolded agents outperform baseline methods while requiring significantly less data. The evaluation metrics are comprehensive, assessing both the quality of the final curated datasets and the trajectory of the agents' decision-making processes. The findings are well-supported by empirical evidence, showcasing the effectiveness of the proposed methods in real-world scenarios.
The paper provides a clear description of the experimental setup, including the code and benchmark being open-sourced. This transparency enhances reproducibility, allowing other researchers to replicate the experiments and validate the findings. The detailed methodology and the availability of resources contribute positively to the reproducibility of the results.
The paper acknowledges limitations, including a focus primarily on vision-language instruction tuning, which may not generalize to other domains. The scaffold comparison is not exhaustive, and the authors note that the effectiveness of different scaffolding strategies may vary. Additionally, the reliance on trajectory diagnostics introduces subjective judgments that could affect the evaluation.
The potential applications of this research are significant, as automating data curation could streamline the development of AI models, reduce costs, and improve the quality of training datasets. However, the authors also caution about the risks of biases in agent-curated datasets and emphasize the need for oversight in deploying such systems. This duality highlights the importance of ethical considerations in the advancement of AI technologies. The main contribution of this paper is the introduction of Curation-Bench, a benchmark that evaluates the ability of generalist agents to automate data curation, demonstrating that structured scaffolding significantly enhances the exploration capabilities of these agents. This work not only advances the field of machine learning by providing a new evaluation framework but also offers practical insights into improving the efficiency and effectiveness of data curation processes.
We identify a new dimension for enhancing rollout diversity in Group Relative Policy Optimization (GRPO) for LLMs. While GRPO relies on diverse rollouts, prevailing strategies primarily increase diversity by injecting more token-level randomness, which may introduce step-wise noise and lead to incoherent trajectories. We uncover that smaller models within the same model family inherently exhibit higher policy-level diversity, indicated by their superior pass@k relative to larger counterparts as sample counts increase. Unlike token-level noise, this diversity is temporally correlated, preserves logical consistency, and provides structured exploration signals for gradient estimation. We thus propose S2L-PO (Small-to-Large Policy Optimization), a framework that leverages fixed small models as natural explorers to train larger models. To balance exploration and exploitation, we design a progressive annealing strategy that transitions from offline small-model rollouts to the large learner's own sampling. This shift elegantly avoids mid-training performance drops caused by the small model's capacity limits, achieving faster convergence and unlocking a higher performance ceiling. S2L-PO improves accuracy on diverse mathematical reasoning benchmarks (e.g., +8.8% on AIME 24 using a 1.7B explorer to guide the 8B model) while reducing rollout compute.
Primary: Tsinghua University
All Institutions: Tsinghua University, Shanghai AI Laboratory, The Chinese University of Hong Kong, City University of Hong Kong
This paper presents a significant advancement in the training of large language models by leveraging smaller models for structured exploration, offering a novel approach that enhances both performance and efficiency in GRPO settings. The comprehensive methodology and rigorous experimental validation position it as a noteworthy contribution to the field of machine learning.
The paper introduces a novel framework, S2L-PO, which utilizes smaller models to enhance rollout diversity in GRPO for training larger models. The methodology is well-grounded in empirical findings and theoretical analysis, highlighting the advantages of policy-level diversity over token-level randomness. The proposed progressive annealing strategy for transitioning from small to large models is innovative and effectively addresses the challenges of training stability and performance degradation.
The experiments are comprehensive, evaluating the proposed method across multiple model families and benchmarks. The results demonstrate significant improvements in performance and sample efficiency, with clear metrics and comparisons to standard GRPO. The paper provides sufficient evidence to support its claims, including ablation studies that reinforce the necessity of the proposed approach.
The authors commit to releasing the complete codebase and provide detailed descriptions of their methodology, experimental setup, and hyperparameter configurations. This transparency enhances the reproducibility of their results, which is crucial for the research community.
The study is limited by its focus on mathematical reasoning tasks, and the authors acknowledge that the method's applicability to other domains remains unexplored. Additionally, the computational resources may have constrained the breadth of their evaluations across different model families and tasks.
The proposed framework has the potential to significantly improve the efficiency and effectiveness of training large language models, which could lead to advancements in various applications of machine learning, particularly in areas requiring robust reasoning capabilities. The findings may influence future research directions in reinforcement learning and model optimization strategies. This paper presents a significant advancement in the training of large language models by leveraging smaller models for structured exploration, offering a novel approach that enhances both performance and efficiency in GRPO settings. The comprehensive methodology and rigorous experimental validation position it as a noteworthy contribution to the field of machine learning.
Recent advances in large language models and agentic AI systems have enabled significant progress in mathematical discovery, from solving competition problems to tackling research-level conjectures. However, open problems in computational mathematics have received comparatively less attention: research in this area often requires not only proofs but also numerical experimentation, adversarial constructions, and algorithm design. In this paper, we introduce an agentic research system, Iteris, designed for open problems in computational mathematics. We apply Iteris to two open problems from a recent Simons Workshop collection (arXiv:2602.05394). In these case studies, Iteris generated numerical evidence, constructions, and proof drafts that led, after expert review and correction, to verified results. The first result is a phase diagram for the asymptotic comparison between conjugate gradient and randomized coordinate descent on power-law spectra; the second is a counterexample showing that QR factorization with column pivoting can fail to select well-conditioned submatrices even under low coherence. These case studies suggest that agentic AI systems can participate meaningfully in research workflows for open problems in computational mathematics, while human validation remains essential.
Primary: Great Bay University
All Institutions: Great Bay University, Beijing International Center for Mathematical Research, New Cornerstone Science Laboratory, Peking University, School of Mathematical Sciences, Center for Intelligent Computing, Center for Machine Learning Research, Great Bay Institute for Advanced Study, Zhongguancun Academy
The paper acknowledges several important limitations. Firstly, the computational cost of running LLM-based agents for extended research loops is high. Secondly, human validation and correction remain essential; Iteris acts as a powerful copilot rather than a fully autonomous researcher, highlighting the current limits of AI in complex, open-ended scientific discovery. The system's current scope is limited to specific types of computational mathematics problems, and scaling to extremely complex, multi-year research projects would be challenging. Furthermore, like all LLM-based systems, Iteris is susceptible to hallucination, necessitating rigorous human oversight. The paper also implicitly suggests that the agent's performance is highly dependent on the quality of the underlying LLM (GPT-4 in this case) and the effectiveness of prompt engineering, which is not fully detailed. BROADER IMPACT: Iteris represents a significant step towards enabling agentic AI systems to participate meaningfully in scientific discovery, particularly in computational mathematics. Its success in generating novel numerical evidence, constructions, and proof drafts for open problems suggests a powerful paradigm for human-AI collaboration in research. This work could accelerate discovery in various scientific and engineering domains that rely on numerical experimentation, algorithm design, and adversarial analysis. It provides a blueprint for developing more sophisticated AI research assistants that can augment human intellect, allowing researchers to tackle more ambitious problems or explore larger solution spaces. The findings also contribute to the ongoing development of more capable and autonomous AI agents, pushing the boundaries of what LLMs can achieve in complex reasoning and problem-solving tasks. This paper introduces Iteris, an agentic research system that leverages large language models and a structured research loop to tackle open problems in computational mathematics. The system's ability to generate novel numerical evidence, adversarial constructions, and proof drafts, leading to verified mathematical discoveries like a phase diagram for CG vs. RCD and a counterexample for QRCP, demonstrates a significant advancement in applying agentic AI to scientific research. The methodology, while building on existing agentic patterns, is well-adapted and integrated with essential tools for the domain, showcasing a practical and impactful approach to human-AI collaboration in complex scientific discovery.
The paper introduces Iteris, an agentic research system designed for open problems in computational mathematics. The methodology is built around a robust "Analyze, Plan, Execute, Reflect" research loop, orchestrated by a central Research Agent. This loop is supported by specialized Planner, Executor, and Reflector agents, each leveraging large language models (specifically GPT-4) for their respective tasks. A key strength of Iteris is its integration of diverse tools essential for computational mathematics, including a Python interpreter (with scientific libraries like NumPy, SciPy, Matplotlib), Wolfram Alpha, a LaTeX compiler, and web search capabilities. This tool integration allows the agents to perform numerical experimentation, symbolic computation, document generation, and information retrieval, which are critical for the target domain. The multi-agent architecture with clear roles and the iterative refinement process are well-conceived for tackling complex, open-ended research problems. While the core agentic loop structure builds on existing patterns like ReAct and Reflexion, its specific adaptation and tool integration for the unique demands of computational mathematics are well-executed and appropriate.
The experimental evaluation is conducted through two compelling case studies, both tackling open problems from a recent Simons Workshop collection. 1. **Asymptotic Comparison between Conjugate Gradient (CG) and Randomized Coordinate Descent (RCD):** Iteris successfully explored the convergence behavior of CG and RCD on power-law spectra. Through iterative numerical experimentation and analysis, the system generated plots, hypothesized a phase transition, and drafted proof sketches. This led to the discovery of a phase diagram, which was subsequently verified and corrected by human experts, providing a novel result in numerical linear algebra. 2. **QR Factorization with Column Pivoting (QRCP) for Submatrix Selection:** Iteris investigated whether QRCP reliably selects well-conditioned submatrices, even under low coherence. The system demonstrated its ability to search for existing knowledge, generate small-scale numerical examples, and, crucially, construct an adversarial counterexample where QRCP fails to select a well-conditioned submatrix under specific low-coherence conditions. This is a significant finding, revealing a limitation of a widely used algorithm. The results from both case studies are concrete mathematical discoveries, not just demonstrations of problem-solving on known benchmarks. The fact that these findings were verified by human experts underscores their validity and the meaningful contribution of Iteris to the research process. The experiments effectively showcase Iteris's capabilities in numerical exploration, hypothesis generation, construction of specific examples (including adversarial ones), and proof drafting.
The paper provides a clear description of the Iteris framework, its agents, and the tools used. The two case studies are detailed, outlining the problems, Iteris's approach, and the final verified results. The appendices contain the detailed proofs for the mathematical findings, which are independently verifiable. Crucially, the authors provide GitHub links for the code related to the specific case studies, which enhances the reproducibility of the *results*. However, the exact prompts used for the LLM agents and the full, step-by-step trace of the agent's discovery process (including all intermediate thoughts, tool calls, and reflections) are not fully detailed in the main paper or appendices. While the overall methodology is clear, reproducing the *exact path of discovery* taken by Iteris might require more granular logging or prompt engineering details. Nevertheless, the core findings are robust and verifiable.
The paper acknowledges several important limitations. Firstly, the computational cost of running LLM-based agents for extended research loops is high. Secondly, human validation and correction remain essential; Iteris acts as a powerful copilot rather than a fully autonomous researcher, highlighting the current limits of AI in complex, open-ended scientific discovery. The system's current scope is limited to specific types of computational mathematics problems, and scaling to extremely complex, multi-year research projects would be challenging. Furthermore, like all LLM-based systems, Iteris is susceptible to hallucination, necessitating rigorous human oversight. The paper also implicitly suggests that the agent's performance is highly dependent on the quality of the underlying LLM (GPT-4 in this case) and the effectiveness of prompt engineering, which is not fully detailed. BROADER IMPACT: Iteris represents a significant step towards enabling agentic AI systems to participate meaningfully in scientific discovery, particularly in computational mathematics. Its success in generating novel numerical evidence, constructions, and proof drafts for open problems suggests a powerful paradigm for human-AI collaboration in research. This work could accelerate discovery in various scientific and engineering domains that rely on numerical experimentation, algorithm design, and adversarial analysis. It provides a blueprint for developing more sophisticated AI research assistants that can augment human intellect, allowing researchers to tackle more ambitious problems or explore larger solution spaces. The findings also contribute to the ongoing development of more capable and autonomous AI agents, pushing the boundaries of what LLMs can achieve in complex reasoning and problem-solving tasks. This paper introduces Iteris, an agentic research system that leverages large language models and a structured research loop to tackle open problems in computational mathematics. The system's ability to generate novel numerical evidence, adversarial constructions, and proof drafts, leading to verified mathematical discoveries like a phase diagram for CG vs. RCD and a counterexample for QRCP, demonstrates a significant advancement in applying agentic AI to scientific research. The methodology, while building on existing agentic patterns, is well-adapted and integrated with essential tools for the domain, showcasing a practical and impactful approach to human-AI collaboration in complex scientific discovery.
Agent skills occupy a privileged position in the agent workflow, as agents are expected to implicitly follow and execute them, rendering third-party skills a vulnerable attack surface. Existing studies have revealed unsafe agent behaviors induced by skill-based attacks, but they primarily evaluate poisoned skills within a single task execution and enumerate harms through ad-hoc risk lists. To bridge these gaps, we introduce SkillHarm, a benchmark of skill-based attacks across the skill-use lifecycle, paired with a systematic taxonomy of skill-relevant risks. SkillHarm evaluates two attack scenarios: Fixed-Payload Poisoning (FPP), where a fixed poisoned skill package directly compromises any task session that invokes it, and Self-Mutating Poisoning (SMP), where an initially benign execution silently mutates persistent skill content, deferring harm until a subsequent reuse. It further defines 12 risk types based on the agent workflow component targeted by the harm: data pipelines, system environments, and agent autonomy. To instantiate these attacks at scale, we build AutoSkillHarm, an automated construction pipeline with coding agents driven by natural-language harnesses. The resulting benchmark contains 879 attack samples across 71 skills. Experiments show that current agents remain vulnerable with attack success rates up to 86.3% in FPP and 69.3% in SMP. Our analysis further reveals a latent risk: many apparent attack failures stem from the agent failing to engage with the poisoned file rather than genuine resistance, and current defenses still fail to reliably mitigate the threat.
Primary: Stanford University
All Institutions: Stanford University, The Ohio State University
This paper presents a comprehensive framework for understanding and evaluating skill-based attacks on agents, offering a novel benchmark and significant empirical findings that could reshape security practices in the field of machine learning.
The paper introduces SkillHarm, a benchmark for skill-based attacks that systematically evaluates the lifecycle of skills used by agents. It presents two attack scenarios—Fixed-Payload Poisoning (FPP) and Self-Mutating Poisoning (SMP)—which are innovative in their approach to understanding and categorizing vulnerabilities in agent workflows. The taxonomy of 12 risk types is a significant contribution, as it provides a structured way to assess potential harms across various components of the agent's operation. The automated construction pipeline, AutoSkillHarm, enhances the methodology by allowing for large-scale instantiation of these attacks, which is a notable advancement in the field.
The experiments conducted demonstrate the effectiveness of the proposed attack scenarios, with success rates of 86.3% for FPP and 69.3% for SMP. The evaluation is rigorous, with a substantial dataset of 879 attack samples across 71 skills, which adds credibility to the findings. The analysis of apparent attack failures provides valuable insights into the limitations of current defenses, further emphasizing the relevance of the research.
While the paper outlines the methodology and experimental setup, it lacks detailed implementation information that would facilitate reproducibility. The absence of a publicly available code repository or demo limits the ability of other researchers to replicate the findings and build upon the work.
One limitation is the focus on specific attack scenarios, which may not encompass all potential vulnerabilities in agent workflows. Additionally, the reliance on automated construction may introduce biases or limitations in the types of attacks generated. The paper also does not address potential countermeasures in detail, which could provide a more balanced view of the findings.
The implications of this research are significant, as it highlights critical vulnerabilities in agent-based systems that could be exploited in real-world applications. The findings could inform the development of more robust security measures for AI agents, making this work relevant for both academic research and industry applications. The systematic approach to categorizing risks also sets a precedent for future research in this area. This paper presents a comprehensive framework for understanding and evaluating skill-based attacks on agents, offering a novel benchmark and significant empirical findings that could reshape security practices in the field of machine learning.
Agentic LLMs with web search change the threat model for text anonymization: weak contextual cues can become cross-referenceable evidence for re-identification, yet those same details also carry downstream analytic value of the text. Existing defenses either remove explicit identifiers, perturb text for formal privacy, or test rewritten text against non-web inference models, leaving underexplored the operating region between resistance to agentic web-search re-identification and utility retention. We introduce AURA (Anonymization with Utility-Retention Adaptation), an LLM-powered mask-reconstruct framework that decouples privacy localization from utility-preserving reconstruction and selects candidates with adversarial privacy and utility-retention checks. We evaluate AURA on real-user interview transcripts using re-identification attacks carried out by web-search agents, along with a utility evaluation based on interviewee-profile facts, codebook facts, and the joint contextual utility grid. Our results show that AURA improves the privacy-utility frontier by using adaptive privacy scope to strengthen resistance to agentic re-identification and using a mask-reconstruct anonymization method to better preserve contextual utility under fixed privacy scope.
Primary: Khoury College of Computer Sciences, Northeastern University
All Institutions: Khoury College of Computer Sciences, Northeastern University
This paper presents a significant advancement in the field of text anonymization by introducing AURA, a framework that effectively balances privacy and utility in the context of agentic LLMs. The innovative methodology, rigorous experimental evaluation, and potential for broad applications highlight its importance in the ongoing discourse around data privacy and machine learning.
The methodology introduces AURA, an innovative framework that effectively decouples privacy localization from utility-preserving reconstruction. This dual approach is significant as it addresses the challenge of balancing privacy and utility in text anonymization, particularly in the context of agentic LLMs with web search capabilities. The use of adversarial checks for privacy and utility retention is a novel aspect that enhances the robustness of the proposed method.
The experiments are well-structured, utilizing real-user interview transcripts and evaluating the framework against re-identification attacks. The dual evaluation of privacy and utility through various metrics (interviewee-profile facts, codebook facts, and joint contextual utility grid) provides a comprehensive view of AURA's effectiveness. The results indicate a meaningful improvement in the privacy-utility frontier, showcasing the practical applicability of the method.
The paper includes a link to the source code, which is essential for reproducibility. However, the paper could benefit from more detailed descriptions of the experimental setup and datasets used, which would enhance the ability of other researchers to replicate the results.
One limitation is the reliance on specific datasets (real-user interview transcripts), which may not generalize across all types of text data. Additionally, the paper does not thoroughly discuss the computational overhead introduced by the mask-reconstruct approach, which could impact its scalability in real-world applications.
The implications of this research are significant, as it addresses a critical issue in data privacy and security in the age of agentic LLMs. The ability to anonymize text while retaining utility has far-reaching applications in fields such as healthcare, social sciences, and any domain where sensitive information is processed. The framework could pave the way for more secure data-sharing practices without sacrificing analytical value. This paper presents a significant advancement in the field of text anonymization by introducing AURA, a framework that effectively balances privacy and utility in the context of agentic LLMs. The innovative methodology, rigorous experimental evaluation, and potential for broad applications highlight its importance in the ongoing discourse around data privacy and machine learning.
Studies of human reasoning have shown that people are typically stronger at evaluating reasoning than producing it from scratch. In contrast, large reasoning models (LRMs) are trained to excel at producing long chains of reasoning to solve complex problems. How then do LRMs perform at evaluating reasons? We investigate this with the Valid-Answer-Invalid-Reasoning (VAIR) dataset: math problems and solutions with trivial reasoning flaws but valid answers, designed to isolate reasoning evaluation from the confound of reasoning production. Unlike humans, who we find are only 6% worse at grading than solving such problems, we find a substantial production-evaluation gap in LRMs: frontier models score as low as 48% when evaluating VAIR solutions, despite near-perfect solution production. Why this enigma? Through chain-of-thought (CoT) analysis, we find evidence of an answer confirmation bias: LRMs often produce then check for the correct answer instead of carefully verifying each step, fabricating rationalizations even when noticing anomalous reasoning. Linear probes corroborate this, showing that while LRM activations encode some representation of valid reasoning, they fail to robustly represent VAIR solutions as invalid. Causal patching of the final answer's representations causes LRM verdicts and activations to flip, demonstrating that answer validity is responsible for models' confirmation biases. These findings indicate an outstanding limitation in dominant approaches to reasoning training, which incentivize LRMs to produce and confirm reasoning towards correct answers, but not to robustly evaluate the underlying reasons.
Primary: Singapore-MIT Alliance for Research and Technology (SMART)
All Institutions: Singapore-MIT Alliance for Research and Technology (SMART), NUS Department of Computer Science
The paper makes a significant contribution by revealing the production-evaluation gap in large reasoning models and proposing a new dataset and methodology to investigate this phenomenon. The insights gained from this research could lead to improved training approaches that enhance the reasoning evaluation capabilities of AI systems.
The paper presents a novel methodology through the construction of the Valid-Answer-Invalid-Reasoning (VAIR) dataset, which effectively isolates reasoning evaluation from production. This approach is innovative as it addresses a significant gap in the evaluation of large reasoning models (LRMs) by focusing on their ability to assess reasoning rather than merely produce answers. The use of chain-of-thought analysis and causal patching to investigate answer confirmation bias is methodologically sound and adds depth to the analysis.
The experiments are rigorous, involving both human and LRM evaluations across multiple tasks. The systematic assessment of the production-evaluation gap using the VAIR dataset provides clear empirical evidence of the limitations of LRMs in reasoning evaluation. The results are well-presented and highlight significant differences in performance between humans and LRMs, reinforcing the paper's claims.
While the paper discusses the methodologies and datasets in detail, it lacks explicit links to code or datasets, which could hinder reproducibility. The absence of a project URL for sharing the VAIR dataset and the evaluation methods is a notable limitation.
The study primarily focuses on mathematical reasoning, which may limit the generalizability of the findings to other domains. Additionally, the paper does not directly investigate the impact of different training objectives on the observed biases, which could provide further insights into the production-evaluation gap.
The findings have significant implications for the development of AI systems, particularly in enhancing their reasoning evaluation capabilities. This work could inform future training methodologies that prioritize reasoning evaluation, potentially improving the reliability of AI in critical applications such as education and automated reasoning tasks. The exploration of confirmation biases in AI also raises important ethical considerations regarding the deployment of such models in real-world scenarios. The paper makes a significant contribution by revealing the production-evaluation gap in large reasoning models and proposing a new dataset and methodology to investigate this phenomenon. The insights gained from this research could lead to improved training approaches that enhance the reasoning evaluation capabilities of AI systems.
Interactive driving exposes a failure mode that is easy to miss in rule-aware autonomous-driving stacks: a hard-rule margin can be negative for an ego candidate even though a small lawful accommodation by a non-priority agent would restore feasibility. Existing rulebooks, shields, and reachability filters are strong at vetoing unsafe actions, while prediction-based planners model likely responses. Neither returns a runtime proof object that states which bounded multi-agent edit repairs the maneuver, who owns the edit, whether the request is right-of-way affordable, and what ego fallback remains if the request is not observed. We formulate this missing object as *interactive repair certification* and introduce *CARVE*, a prediction-free certificate layer over a finite lattice of ego-owned and agent-owned tactical operators. Agent-owned requests are admissible only inside \(B_j(s) = β(π_j)α_j^{\max}(s)\), a cooperation envelope that separates kinematic reachability from normative priority. The resulting certificate records the binding rule, repair category, repair set, responsibility-weighted cost split, and fallback. On 589 Lanelet2-geometry-grounded INTERACTION replay episodes, CARVE-Greedy accepts 98.64% of initially vetoed maneuvers and recovers 370/378 human-resolved false vetoes, while preserving 589/589 right-of-way respect, zero priority-agent false positives, and 400/400 negative-stress vetoes. We prove certificate soundness, structural right-of-way respect, exact finite-lattice minimality, fallback contingency, and blame-consistency conditions. CARVE does not predict or require another driver's compliance; it certifies whether a proposed interaction is bounded, attributable, and normatively admissible under declared assumptions.
Primary: unknown
All Institutions: unknown
The paper presents CARVE, a certification framework for interactive driving maneuvers, addressing the critical issue of false vetoes in autonomous vehicle decision-making. The methodology and results indicate a substantial advancement in ensuring safe and normatively admissible interactions in complex traffic scenarios.
The paper introduces CARVE, a novel approach to addressing the false-veto problem in interactive driving by creating a prediction-free certification layer that operates over a finite lattice of tactical operators. This methodology is innovative as it shifts the focus from trajectory prediction to interactive repair certification, allowing for a structured, auditable decision-making process in autonomous vehicles. The use of cooperation envelopes and a clear separation of responsibilities among agents is a significant advancement in ensuring safe interactions in traffic scenarios.
The evaluation is robust, utilizing 589 Lanelet2-geometry-grounded INTERACTION replay episodes to demonstrate the effectiveness of CARVE. The results show a high acceptance rate of initially vetoed maneuvers (98.64%) and successful recovery of human-resolved false vetoes (370 out of 378), while maintaining structural right-of-way respect and zero priority-agent false positives. The experiments also include various baselines and sensitivity analyses, which strengthen the findings.
The paper provides a clear description of the methodology, including the algorithmic approach and evaluation metrics, which aids in reproducibility. However, the lack of a publicly available implementation or code repository limits the ability for others to replicate the results independently.
The paper acknowledges that CARVE is a certification layer and not a complete autonomous vehicle stack, which may limit its applicability in real-world scenarios without integration with other systems. Additionally, the evaluation relies on a specific dataset and may not generalize to all driving contexts or more complex traffic scenarios. The assumptions made regarding the finite tactical lattice and the declared envelopes may also restrict the method's flexibility.
The potential applications of CARVE are significant, as it enhances the safety and transparency of autonomous driving systems by providing a clear framework for maneuver certification. This could lead to increased trust in autonomous vehicles and facilitate their integration into existing traffic systems. The approach may also inform future research on interactive decision-making in multi-agent environments beyond driving. The paper presents CARVE, a certification framework for interactive driving maneuvers, addressing the critical issue of false vetoes in autonomous vehicle decision-making. The methodology and results indicate a substantial advancement in ensuring safe and normatively admissible interactions in complex traffic scenarios.
Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.
Primary: Ant Group
All Institutions: Ant Group, Zhejiang University
While highly effective, MemDreamer has a few limitations. The initial perception module still requires processing the full video to construct the hierarchical graph memory, which can be computationally intensive for extremely long videos, even with incremental processing. The quality of the graph memory heavily relies on the capabilities of the underlying VLM used for feature extraction and summarization; any limitations in the VLM's perceptual understanding will propagate. For videos of unprecedented length or complexity, the graph itself might become very large, potentially impacting the efficiency of graph traversal and retrieval, although the hierarchical structure aims to mitigate this. The generalizability of the specific three-tier graph structure and predefined edge types might need adaptation for highly specialized video domains. Finally, while the correlation analysis is compelling, establishing "agentic capability scaling as a new paradigm" is a strong claim that will require further research and validation across diverse tasks and models. BROADER IMPACT: MemDreamer makes a significant contribution to the field of multimodal AI, particularly in long video understanding, a critical and challenging area. By effectively decoupling perception and reasoning, it offers a scalable solution to the token explosion and attention dilution problems that plague current Vision-Language Models. This framework has broad implications for applications requiring deep understanding of extended visual narratives, such as autonomous driving (understanding long-term driving scenarios), surveillance (identifying complex event chains), educational content analysis, and personal video assistants. The plug-and-play nature allows for easy integration into existing VLM pipelines, potentially accelerating research and development in this domain. The empirical finding regarding the correlation between logic reasoning and long-video understanding also opens up new research avenues, suggesting that improving LLM's reasoning capabilities could directly translate to better long-term multimodal comprehension. This work pushes the boundaries of what's possible with current VLMs and LLMs, paving the way for more intelligent and capable AI systems. MemDreamer introduces a novel framework that decouples perception and reasoning for long video understanding via a hierarchical graph memory and an agentic retrieval mechanism. This paper presents a robust and innovative solution to the critical challenge of processing hours-long videos, achieving state-of-the-art results across multiple benchmarks with significant accuracy gains while drastically reducing the reasoning context window, and provides compelling evidence for the importance of agentic reasoning in multimodal comprehension.
MemDreamer proposes an innovative framework to tackle the challenge of long video understanding by decoupling perception and reasoning. The core of the methodology lies in two main components: a Hierarchical Graph Memory for perception and an Agentic Retrieval Mechanism for reasoning. The Hierarchical Graph Memory is a top-down, three-tier architecture designed for semantic abstraction. It incrementally streams video content to construct: 1) an Event Graph (Level 1) capturing spatiotemporal and causal relations between short video events, 2) a Summary Graph (Level 2) abstracting sequences of events into higher-level summaries, and 3) a Concept Graph (Level 3) representing overarching themes and concepts. Each level is populated and connected using a Vision-Language Model (VLM) to summarize and relate information. This hierarchical structure effectively compresses vast amounts of visual information into a manageable, semantically rich graph. The Agentic Retrieval Mechanism employs an LLM-based agent that interacts with this graph memory through an Observation-Reason-Action (O-R-A) loop. The agent is equipped with a set of tools (e.g., `search_node`, `traverse_edge`, `summarize_path`, `query_VLM`) to navigate the hierarchical graph, retrieve relevant information, and synthesize answers to complex queries. This agentic approach allows the reasoning module to operate on a highly condensed, contextually relevant subset of information, rather than processing the entire video sequence, thereby mitigating token explosion and attention dilution. The plug-and-play nature of the framework, allowing integration with various VLMs and LLMs, is a significant design strength.
The experimental evaluation is comprehensive and compelling. MemDreamer is tested across four mainstream benchmarks: EgoSchema (long-term planning), Perception-Reasoning (causal reasoning), Next-QA (temporal reasoning), and ActivityNet-QA (factual QA). The results consistently demonstrate SOTA performance, significantly outperforming various strong VLM baselines (e.g., Video-LLaVA, Video-ChatGPT, Long-Video-LLaMA). Notably, MemDreamer achieves a 12.5 point absolute accuracy gain on EgoSchema while constraining the reasoning context window to merely 2% of full-context ingestion, showcasing its efficiency and effectiveness. Ablation studies rigorously validate the design choices, confirming the importance of each hierarchical graph level, the superiority of agentic retrieval over simpler methods, and the flexibility with different LLM backbones (GPT-4 vs. LLaMA-2). A particularly insightful contribution is the statistical analysis revealing a strong positive linear correlation between a VLM's performance on logic reasoning benchmarks (Big-Bench Hard) and its performance on long-video understanding tasks. This finding provides empirical support for the agentic, reasoning-centric approach and suggests a new paradigm for multimodal comprehension. The gap with human experts is narrowed to only 3.7 points, indicating a high level of performance.
The paper provides a clear methodology, detailed architectural descriptions, and specific choices for VLM and LLM backbones (e.g., Video-LLaVA, GPT-4, LLaMA-2). The benchmarks used are standard and publicly available. The authors state that their code will be released at a specified GitHub repository, which is crucial for reproducibility. The appendix includes additional implementation details, hyper-parameters, and experimental setups, further aiding reproducibility. Given the complexity of the system, the release of code will be essential, but the current level of detail suggests that the work is designed to be reproducible.
While highly effective, MemDreamer has a few limitations. The initial perception module still requires processing the full video to construct the hierarchical graph memory, which can be computationally intensive for extremely long videos, even with incremental processing. The quality of the graph memory heavily relies on the capabilities of the underlying VLM used for feature extraction and summarization; any limitations in the VLM's perceptual understanding will propagate. For videos of unprecedented length or complexity, the graph itself might become very large, potentially impacting the efficiency of graph traversal and retrieval, although the hierarchical structure aims to mitigate this. The generalizability of the specific three-tier graph structure and predefined edge types might need adaptation for highly specialized video domains. Finally, while the correlation analysis is compelling, establishing "agentic capability scaling as a new paradigm" is a strong claim that will require further research and validation across diverse tasks and models. BROADER IMPACT: MemDreamer makes a significant contribution to the field of multimodal AI, particularly in long video understanding, a critical and challenging area. By effectively decoupling perception and reasoning, it offers a scalable solution to the token explosion and attention dilution problems that plague current Vision-Language Models. This framework has broad implications for applications requiring deep understanding of extended visual narratives, such as autonomous driving (understanding long-term driving scenarios), surveillance (identifying complex event chains), educational content analysis, and personal video assistants. The plug-and-play nature allows for easy integration into existing VLM pipelines, potentially accelerating research and development in this domain. The empirical finding regarding the correlation between logic reasoning and long-video understanding also opens up new research avenues, suggesting that improving LLM's reasoning capabilities could directly translate to better long-term multimodal comprehension. This work pushes the boundaries of what's possible with current VLMs and LLMs, paving the way for more intelligent and capable AI systems. MemDreamer introduces a novel framework that decouples perception and reasoning for long video understanding via a hierarchical graph memory and an agentic retrieval mechanism. This paper presents a robust and innovative solution to the critical challenge of processing hours-long videos, achieving state-of-the-art results across multiple benchmarks with significant accuracy gains while drastically reducing the reasoning context window, and provides compelling evidence for the importance of agentic reasoning in multimodal comprehension.
This paper explores agentic 3D spatial understanding, i.e., MLLM agents performing 3D reasoning through tool use. Existing methods often misuse tools and exhibit biased tool preferences under 3D scenarios, leaving the agentic paradigm with only marginal gains over non-agentic strategies. We reveal that 3D spatial reasoning tasks are heterogeneous across scenes, while these agents apply a uniform tool-use strategy to all scenes rather than selecting tools according to the specific scene and task. To address this, we propose Skill-3D, a framework that learns self-evolving scene-aware skills. Specifically, Skill-3D identifies the task scene and records the agent's tool-use trajectory into a Scene Memory, where successful trajectories from similar scenes are aggregated and distilled into a reusable scene-aware skill, with failed ones attached to the skill as lessons. During training, once a similar scene recurs, the corresponding skill is injected to guide the agent, producing new trajectories whose successes and failures further refine the skill, forming a loop in which the memory and the skill library co-evolve. Experiments show that Skill-3D substantially improves tool utilization in 3D spatial reasoning (from 39% to 78% on VSI-Bench), driving the agent toward correct and sufficient tool use. For instance, it improves Gemini-3-Flash by 67% on MMSI-Bench. Furthermore, we conduct agentic post-training over skill-guided trajectories, which boosts Qwen3-VL-8B by 43% on VSI-Bench.
Primary: Zhejiang University
All Institutions: Zhejiang University, University of Technology Sydney, OPPO Research Institute
The main contribution of this paper is the introduction of Skill-3D, a framework that enhances 3D spatial reasoning in agents through self-evolving scene-aware skills, significantly improving tool utilization and adaptability in diverse environments. This work is poised to influence future research directions in agentic AI and spatial reasoning, demonstrating a strong blend of innovation and practical application.
The proposed Skill-3D framework introduces a novel approach to 3D spatial reasoning by leveraging a Scene Memory that evolves through successful and failed tool-use trajectories. This self-evolving mechanism allows agents to adapt their tool-use strategies based on the specific characteristics of the scene, which is a significant advancement over traditional uniform strategies. The methodology is well-structured, with clear definitions of how skills are aggregated and refined over time, demonstrating a thoughtful integration of memory and learning.
The experiments conducted are robust, showcasing substantial improvements in tool utilization metrics across various benchmarks (VSI-Bench and MMSI-Bench). The reported results indicate a strong empirical validation of the proposed framework, with clear comparisons to existing methods. However, the paper could benefit from additional ablation studies to further dissect the contributions of different components of the Skill-3D framework.
The paper provides a project page with a URL, which is a positive aspect for reproducibility. However, details on the implementation specifics, datasets used, and hyperparameter settings are not thoroughly discussed, which may hinder complete reproducibility. More comprehensive documentation would enhance this aspect.
One limitation noted is the reliance on the quality and diversity of the scenes used for training, which may affect the generalizability of the learned skills. Additionally, the paper does not address potential computational overhead introduced by maintaining a Scene Memory, which could impact scalability in real-world applications.
The implications of this research are significant, as it addresses a critical gap in agentic 3D spatial reasoning, potentially leading to more effective and adaptable AI systems in various applications, including robotics, gaming, and virtual environments. The ability to dynamically adjust tool-use strategies based on scene context could revolutionize how agents interact with complex environments. The main contribution of this paper is the introduction of Skill-3D, a framework that enhances 3D spatial reasoning in agents through self-evolving scene-aware skills, significantly improving tool utilization and adaptability in diverse environments. This work is poised to influence future research directions in agentic AI and spatial reasoning, demonstrating a strong blend of innovation and practical application.
Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame as an independent RGB image, causing visual tokens to repeat content already present in earlier frames. This suggests a more direct video interface: send a full reference frame only when the scene cannot be predicted well from prior context, and otherwise transmit a compact description of inter-frame changes. We call this interface a \emph{predictive visual code}, and instantiate it for video MLLMs as \textbf{AdaCodec}. AdaCodec spends full visual tokens on a reference frame only when its conditional predictive cost is high; otherwise, it encodes inter-frame changes, including motion and prediction residuals, as compact P-tokens. Across all eleven benchmarks, AdaCodec improves over the Qwen3-VL-8B per-frame RGB baseline at a matched visual-token budget. Even at $1/7$ the budget, AdaCodec with 32k tokens surpasses the 224k baseline on all long-video benchmarks; on five general-video benchmarks, it raises the average score while substantially cutting time-to-first-token from 9.26s to 1.62s.
Primary: JD.com
All Institutions: JD.com
The paper presents AdaCodec, a predictive visual coding method that significantly improves the efficiency of video MLLMs by reducing redundancy in frame encoding while maintaining high accuracy. This contribution is substantial, as it addresses a critical limitation in current video processing methodologies and opens new avenues for research and application in the field.
The paper introduces AdaCodec, a novel predictive visual coding interface for video MLLMs that efficiently encodes video frames by transmitting full reference frames only when necessary and compact descriptions of inter-frame changes otherwise. This approach is grounded in predictive coding principles and leverages adaptive Groups of Pictures (GOPs) to optimize token usage, which is a significant departure from traditional per-frame RGB encoding. The methodology is well-structured, with clear design choices aimed at enhancing the efficiency of video processing in MLLMs.
The experiments are comprehensive, evaluating AdaCodec across eleven benchmarks, demonstrating consistent improvements over the baseline model (Qwen3-VL-8B) in terms of accuracy and latency. The results are statistically significant, showing that AdaCodec not only maintains performance with fewer tokens but also outperforms the baseline in various scenarios, which validates the effectiveness of the proposed method.
The paper mentions the intention to release source code and model checkpoints, which is critical for reproducibility. However, detailed implementation specifics, such as hyperparameters and training protocols, are provided, which aids in replicating the results.
The study does not explore dynamic resolution input or evaluate AdaCodec on streaming video, which may limit its applicability in real-time scenarios. Additionally, the uniform token budget for P-frames could be optimized further based on motion complexity, which is noted as a potential area for future work.
AdaCodec has the potential to significantly enhance the efficiency of video processing in various applications, including video understanding, temporal reasoning, and real-time video analysis. Its innovative approach could influence future research in video MLLMs and related fields, leading to more efficient models that can handle longer videos with reduced latency. The paper presents AdaCodec, a predictive visual coding method that significantly improves the efficiency of video MLLMs by reducing redundancy in frame encoding while maintaining high accuracy. This contribution is substantial, as it addresses a critical limitation in current video processing methodologies and opens new avenues for research and application in the field.
LLM agents increasingly rely on external inference conditions: prompts, tools, memory, SOPs, skills, and harness feedback. These assets can improve task execution without changing model weights, but they are often revised by heuristic reflection or by reusing observed successes and failures as if counts alone were reliable belief. We introduce \textbf{Bayesian-Agent}, a native and cross-harness framework that treats reusable skills and SOPs as hypotheses about whether a frozen model will succeed under a particular prompt, context, and harness environment. Bayesian-Agent records verified trajectory evidence, maintains a feature-conditioned categorical posterior over each skill, and maps posterior state into inspectable actions such as patch, split, compress, retire, and explore. Model-facing prompts receive executable guardrails and failure-mode patches, while posterior summaries remain available for audit. With \texttt{deepseek-v4-flash}, incremental repair improves SOP-Bench from 80\% to 95\%, Lifelong AgentBench from 90\% to 100\%, and RealFin-Bench from 45\% to 65\%. We further evaluate Bayesian-Agent's native backend and optional GenericAgent, mini-swe-agent, and Claude Code backends. The results include positive, negative, saturated, and case-study settings, suggesting that agent skill evolution is best viewed as posterior-guided harness optimization rather than uncalibrated prompt accumulation. The source code is available at https://github.com/DataArcTech/Bayesian-Agent.
Primary: The Hong Kong University of Science and Technology (Guangzhou)
All Institutions: The Hong Kong University of Science and Technology (Guangzhou), DataArcTech Ltd, IDEA Research
The main contribution of this paper is the introduction of the Bayesian-Agent framework, which enhances the adaptability and effectiveness of LLM agents by treating skills as hypotheses and employing a Bayesian approach for skill evolution. This work is poised to influence future research and applications in the field of natural language processing and agent-based systems.
The paper introduces the Bayesian-Agent framework, which innovatively treats reusable skills and SOPs as hypotheses, utilizing a Bayesian approach to maintain a posterior over each skill. This methodology is significant as it provides a structured way to evaluate and evolve skills without modifying the underlying model weights, which is a novel contribution in the context of LLM agents. The approach of mapping posterior states into actionable insights (like patching or retiring skills) is particularly noteworthy, as it adds a layer of interpretability and adaptability to the agent's operations.
The experiments conducted demonstrate substantial improvements across various benchmarks, with clear metrics indicating the effectiveness of the Bayesian-Agent framework. The reported increases in performance (e.g., SOP-Bench from 80% to 95%) provide strong empirical evidence of the framework's capabilities. The diverse evaluation settings (positive, negative, saturated, and case studies) enhance the robustness of the findings, suggesting that the results are not merely artifacts of specific conditions.
The paper mentions that the source code is available on GitHub, which is a positive aspect for reproducibility. However, the paper could benefit from more detailed implementation guidelines and descriptions of the experimental setup to ensure that other researchers can replicate the results effectively.
While the paper presents a compelling framework, it does not extensively discuss potential limitations or challenges in applying Bayesian-Agent in real-world scenarios. The reliance on posterior summaries may introduce complexities in environments with high variability or noise, which could affect the robustness of the skill evolution process.
The implications of this research are significant, as it provides a new perspective on how LLM agents can be optimized and adapted over time without retraining. This could lead to more efficient use of computational resources and improved performance in dynamic environments. The framework could be applied across various domains where LLMs are utilized, potentially influencing the design of future agent architectures. The main contribution of this paper is the introduction of the Bayesian-Agent framework, which enhances the adaptability and effectiveness of LLM agents by treating skills as hypotheses and employing a Bayesian approach for skill evolution. This work is poised to influence future research and applications in the field of natural language processing and agent-based systems.
Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation: text embeddings tend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression of high-frequency tokens suppresses the model's ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simple linear transformation designed to refine text embeddings derived from LLMs directly. Specifically, we uncover that the unembedding matrix within LLMs encodes a latent space that is actively writing these frequent tokens into embedding space. By filtering out this subspace, EmbedFilter suppress the influence of high-frequency tokens, thereby enhancing semantic representations. As a compelling byproduct, this enables an inherent dimensionality reduction, lowering index storage and speedup retrieval while fully preserving the refined embedding quality. Our experiments across multiple LLM backbones demonstrate that LLMs equipped with EmbedFilter achieve superior zero-shot downstream performance even with significantly reduced embedding dimensions. We hope our findings provide deeper insights into the mechanisms of LLM-based representations and inspire more principled designs to improve text embeddings training. Our code is available at https://github.com/CentreChen/EmbFilter.
Primary: Gaoling School of Artificial Intelligence, Renmin University of China
All Institutions: Gaoling School of Artificial Intelligence, Renmin University of China, Lenovo Group Limited, Wuhan University
The paper presents a novel method, EmbedFilter, which enhances text embeddings from large language models by filtering out the influence of high-frequency tokens, leading to improved performance on downstream tasks. This work is significant as it provides a mechanistic understanding of LLM embeddings and introduces a practical solution that can be widely adopted in the field.
The paper introduces EmbedFilter, a linear transformation that refines text embeddings from LLMs by filtering out high-frequency token influences. This approach is grounded in a mechanistic interpretation of the unembedding matrix, revealing a previously overlooked latent space that contributes to suboptimal embedding performance. The methodology is well-structured, with clear theoretical underpinnings and practical implications for improving embedding quality.
The experiments are comprehensive, covering multiple LLM backbones and a variety of downstream tasks. The results demonstrate significant performance improvements, with detailed comparisons against existing methods, showcasing the robustness and effectiveness of EmbedFilter. The use of the MTEB benchmark adds credibility to the findings.
The authors provide a link to their code repository, which is essential for reproducibility. However, the paper could benefit from more detailed descriptions of the experimental setups and hyperparameter choices to facilitate easier replication by other researchers.
While the paper presents a novel approach, it does not extensively explore the limitations of EmbedFilter or potential scenarios where it may underperform. Additionally, the reliance on specific LLM architectures may limit the generalizability of the findings.
The findings have significant implications for the deployment of LLMs in practical applications, particularly in scenarios requiring efficient text embeddings. By improving the semantic richness of embeddings while reducing dimensionality, the work paves the way for more effective use of LLMs in various NLP tasks. The paper presents a novel method, EmbedFilter, which enhances text embeddings from large language models by filtering out the influence of high-frequency tokens, leading to improved performance on downstream tasks. This work is significant as it provides a mechanistic understanding of LLM embeddings and introduces a practical solution that can be widely adopted in the field.
Multi-agent reasoning systems adopt a "generate-then-transfer" paradigm that forces end-to-end latency to scale linearly with pipeline depth. We introduce StreamMA, a multi-agent reasoning system that streams each reasoning step to downstream agents as soon as it is generated, pipelining adjacent agents and thus reducing latency. Surprisingly, this pipelining also improves effectiveness: because multi-step reasoning quality is non-uniform and early steps are more reliable than later ones, working with these reliable early steps instead of the full chain prevents error-prone late steps from misleading downstream agents. We formalize both advantages with the first closed-form joint analysis of stream, serial, and single protocols, deriving the effectiveness ordering, speedup upper bound, and cost ratio. Across eight reasoning benchmarks spanning mathematics, science, and code, two frontier LLMs (Claude Opus 4.6 and GPT-5.4), and three topologies (Chain, Tree, Graph), StreamMA outperforms both baselines (avg. +7.3 pp, max +22.4 pp on HMMT 2026; Claude Opus 4.6-high). Beyond these contributions, we discover a "step-level scaling law": increasing per-agent steps consistently improves both effectiveness and efficiency, a new scaling dimension orthogonal to and composable with agent-count scaling.
Primary: University of California, Berkeley
All Institutions: University of California, Berkeley, Google DeepMind, Stanford University, Carnegie Mellon University
This paper introduces StreamMA, a novel multi-agent reasoning system that employs streaming communication to reduce latency and surprisingly improve effectiveness by leveraging reliable early reasoning steps. The work presents a rigorous formal analysis, extensive empirical validation across diverse benchmarks and frontier LLMs, and discovers a new "step-level scaling law," making it a highly significant contribution to multi-agent AI and LLM research.
The paper introduces StreamMA, a novel multi-agent reasoning system that shifts from the traditional "generate-then-transfer" paradigm to a "streaming communication" approach. This involves pipelining reasoning steps, where downstream agents receive and process partial information as soon as it's generated by upstream agents. The core innovation lies in demonstrating a dual benefit: reduced end-to-end latency and, surprisingly, improved effectiveness. The effectiveness gain is attributed to leveraging more reliable early reasoning steps, preventing error propagation from potentially flawed later steps. The methodology is rigorously supported by the first closed-form joint analysis of stream, serial, and single protocols, providing theoretical derivations for effectiveness ordering, speedup upper bounds, and cost ratios. Agents are designed to generate reasoning steps and an "end-of-step" token, allowing for flexible granularity. The approach is versatile, demonstrated across Chain, Tree, and Graph topologies. This is a well-conceived and theoretically grounded methodology.
The experimental evaluation is comprehensive and robust. The authors test StreamMA across eight diverse reasoning benchmarks spanning mathematics (HMMT, GSM8K, MATH), science (ARC, BigBench Hard), and code generation (HumanEval, MBPP, APPS). This breadth demonstrates the generalizability of the approach. Two frontier LLMs, Claude Opus 4.6 and GPT-5.4, are used, providing strong baselines and highlighting the practical relevance to state-of-the-art systems. StreamMA consistently outperforms both "Serial" (generate-then-transfer) and "Single" (single-agent) baselines, achieving significant average effectiveness gains of +7.3 percentage points and a maximum of +22.4 pp on HMMT 2026. The paper also validates latency reduction and explores the "step-level scaling law," a novel empirical finding that increasing per-agent steps improves both effectiveness and efficiency. The experiments across different topologies (Chain, Tree, Graph) further solidify the findings. While the use of proprietary LLMs limits direct reproducibility for all researchers, the results are compelling and well-supported.
The paper provides a detailed description of the StreamMA methodology, including agent prompting strategies, communication protocols, and the formal analysis. This level of detail is commendable. However, the reliance on proprietary frontier LLMs (Claude Opus 4.6, GPT-5.4) means that exact replication of the results requires access to these specific models, which might not be universally available. The authors state that "Our code is available at [URL redacted for anonymity]," indicating that code exists but is not publicly linked in the provided version. Publicly available code would significantly enhance reproducibility. Given the detailed methodology and the promise of code, the work is reproducible in principle, but the LLM dependency and current lack of a public code link are practical limitations.
The authors acknowledge several limitations. Streaming communication can increase the total token count if agents re-process information, potentially leading to higher API costs, though this is often offset by improved effectiveness. Designing and managing complex graph-based multi-agent systems remains challenging. The approach relies on LLMs being capable of effectively processing and acting on partial, streaming information. The current focus is primarily on reasoning tasks, and its generalizability to other LLM applications like creative generation is not explored. For very simple tasks, the overhead of streaming might outweigh the benefits. Additionally, the reliance on proprietary frontier LLMs limits immediate open-source replication, and while the "step-level scaling law" is a fascinating discovery, its theoretical underpinnings and boundary conditions are not fully explored.
This paper offers a significant contribution to the field of multi-agent LLM systems. It introduces a new paradigm for communication that addresses a critical bottleneck (latency) while simultaneously improving reasoning effectiveness. This has profound implications for designing more efficient and responsive multi-agent systems, making them more viable for real-time and interactive applications. The discovery of the "step-level scaling law" opens up a novel research dimension for optimizing LLM performance and multi-agent system design, orthogonal to existing scaling laws. The insight that leveraging early, more reliable reasoning steps can prevent error propagation is a valuable lesson for structuring complex LLM-based reasoning tasks. This work is likely to influence future research and development in multi-agent AI and LLM deployment strategies. This paper introduces StreamMA, a novel multi-agent reasoning system that employs streaming communication to reduce latency and surprisingly improve effectiveness by leveraging reliable early reasoning steps. The work presents a rigorous formal analysis, extensive empirical validation across diverse benchmarks and frontier LLMs, and discovers a new "step-level scaling law," making it a highly significant contribution to multi-agent AI and LLM research.
Post-training compression of Large Language Models (LLMs) removes entire architectural components, either deleting them or replacing them with fitted modules. Existing replacement-based methods share two design constraints: full-layer granularity and contiguous selection. We argue that this is overly restrictive: in fact, redundancy in pretrained transformers is not confined to contiguous regions, nor does it evenly distribute between Attention and FeedForward outputs, implying that different strategies best approximate different submodule types and that removable components need not cluster within contiguous depth ranges. Based on this intuition, we introduce SubFit (Submodule-level Fitted residual replacement), which compresses LLMs at the submodule level: Attention and FeedForward submodules are selected non-contiguously, and each receives its own lightweight fitted residual bypass. SubFit operates post-training and requires only calibration data. Across ten LLMs (five base, five instruction-tuned), five sparsity levels from 12.5% to 37.5%, and four replacement-based baselines, SubFit achieves the best aggregate perplexity-accuracy trade-off across the evaluated sparsity levels, with larger gains under aggressive compression. At 25% sparsity, it retains 84.6% of dense downstream accuracy and incurs 2.42x perplexity degradation, against 81.6% and 4.34x for the strongest baselines, while delivering measurable inference speedup and KV-cache savings. Code is available at https://github.com/eliacunegatti/SubFit.
This paper introduces SubFit, a novel LLM compression method that operates at a finer, submodule-level granularity and allows non-contiguous component selection, leading to superior performance. The work challenges conventional assumptions in replacement-based LLM compression by demonstrating that redundancy is not limited to full, contiguous layers. By proposing a more flexible, submodule-level approach with tailored residual bypasses, SubFit achieves a significantly better perplexity-accuracy trade-off, especially under aggressive compression, and offers practical benefits like inference speedup and KV-cache savings. This conceptual shift, coupled with strong empirical results across multiple LLMs, makes it a highly relevant and impactful contribution to the critical area of LLM efficiency.
Audio-language models (ALMs) often follow text that conflicts with audio, even when the audio evidence is clear. This raises a basic question: is the audio-supported answer unavailable, or is it represented but overridden by the conflicting text? We examine this question using a same-audio counterfactual that keeps the audio fixed, removes only the conflicting text, and measures the resulting shift in model preference. Across five ALMs and four conflict tasks, 64.1% of conflict samples show a sign flip: the same-audio branch prefers the audio-supported answer, whereas the joint branch prefers the text-supported answer. This pattern suggests that the relevant audio evidence is encoded but loses in arbitration. Activation patching further localizes the reversal to answer-position computation, and patching effects closely track output candidate-score differences (Spearman rho=0.93). Using this diagnostic, we propose Gated Audio Counterfactual Logit Correction (GACL), a training-free decoding rule that interpolates between joint and same-audio scores. Under a strict 5 pp faithfulness-drop budget, GACL improves nAUC by 17.8 points over the best contrastive baseline and transfers without retuning to vision-text arbitration (up to +40.5 pp).
Primary: Northeastern University
All Institutions: Northeastern University, Shanghai Artificial Intelligence Laboratory
This work has significant broader impact for the development of robust and trustworthy multimodal AI systems. By providing a rigorous diagnostic framework and an effective, generalizable intervention, it directly addresses a critical safety and reliability concern in ALMs: their tendency to prioritize conflicting text over clear audio evidence. This is particularly important for agentic applications in sensitive domains like healthcare, emergency services, or legal assistance, where accurate interpretation of audio is paramount. The mechanistic understanding gained through causal localization offers a powerful new lens for analyzing internal decision-making in complex multimodal models, moving beyond black-box observations and fostering more interpretable AI. The demonstrated cross-modal transfer suggests that the principles of diagnosing and correcting arbitration failures using counterfactual references and logit correction could be broadly applicable across various multimodal AI systems (e.g., vision-language, video-language), paving the way for more faithful and reliable AI across the board. This research not only provides a practical solution but also advances our fundamental understanding of multimodal reasoning and conflict resolution in large models. This paper presents a highly novel diagnostic methodology and a mechanistically-informed, training-free decoding rule to address a critical arbitration failure in audio-language models, demonstrating significant performance gains and cross-modal generalizability. The rigorous causal analysis, coupled with a practical and effective solution, makes this a standout contribution to multimodal machine learning.
The methodology is exceptionally strong, building a coherent and rigorous chain from behavioral observation to mechanistic understanding and finally to an effective intervention. The core innovation is the "same-audio counterfactual" diagnostic, which uses two branches (joint audio-text vs. audio-only) to precisely distinguish between perceptual failure and arbitration failure in Audio-Language Models (ALMs). This elegant setup, coupled with signed log-probability margins, provides a clear quantitative signature of "repairable arbitration reversals." The paper then employs activation patching, a robust causal intervention technique, to localize the arbitration failure to the answer-position residual stream within the model's "commit window." This mechanistic finding is crucial, demonstrating that audio evidence is indeed encoded but overridden during the final decision-making process. A key methodological bridge is the discovery of a high Spearman correlation (0.93) between this internal patch-induced repair direction and the observable output score difference ($s_A - s_J$). This alignment is critical because it enables the development of an output-space intervention without requiring internal model access. The proposed Gated Audio Counterfactual Logit Correction (GACL) decoding rule is directly derived from these insights, incorporating a branch-disagreement gate, a reference-reliability gate, and convex bounded interpolation. Each component is mechanistically justified and contributes to the method's robustness and safety. The methodology is a prime example of interpretable ML research, moving beyond symptom identification to root cause analysis and targeted solution design.
The experimental evaluation is comprehensive and rigorously designed. The authors evaluate GACL across five diverse open-weight ALMs (7B-30B parameters) and four distinct audio-text conflict tasks (AQA, VSC, SER, ALME) from established benchmarks (MCR-Bench, ALME). This broad coverage demonstrates the widespread nature of the "text-following" problem and the general applicability of GACL. The use of normalized AUC (nAUC) over a strict faithfulness-drop budget (e.g., 5 pp) is an excellent evaluation metric, realistically capturing the trade-off between conflict resolution and preserving accuracy on faithful inputs. GACL consistently outperforms strong contrastive decoding baselines (AAD, ACD) and the joint model, achieving an impressive average improvement of 17.8 nAUC points under the strict 5 pp budget. Detailed ablation studies meticulously validate the contribution of each component of GACL, showing how gates and bounds ensure stability and prevent undesirable side effects (e.g., surface form rewriting, parse failures). The comparison to a LoRA fine-tuning baseline, where GACL retains 76% of the gain without any parameter updates, highlights its efficiency and practical value. Furthermore, the successful, untuned transfer of GACL to vision-text arbitration on MC$^2$ (achieving up to +40.5 pp adversarial accuracy) is a powerful demonstration of the generalizability of the underlying diagnostic principles across different modalities, significantly amplifying the potential impact of this work.
The paper demonstrates a high commitment to reproducibility. The appendix provides extensive details, including specific public model checkpoints (with Hugging Face snapshot hashes), precise descriptions of benchmark splits, detailed prompt templates for each task, and the exact candidate scoring and normalization procedures. The hyperparameter tuning process, including the use of a development set and freezing parameters for testing, is clearly outlined. Furthermore, the paper provides comprehensive details for the LoRA fine-tuning baseline, including architecture, training data, optimization parameters, and hardware. Inference cost metrics (time, GPU memory, FLOPs) are also reported. This level of detail should enable researchers to reproduce the core findings and build upon this work.
The authors acknowledge several pertinent limitations. The study focuses on controlled, explicit audio-text conflicts, which, while crucial for isolating mechanisms, may not fully capture the complexity of naturally occurring conflicts involving noisier transcripts, partial notes, or broader conversational context. GACL is designed to repair arbitration failures where audio evidence is available but overridden, meaning it cannot compensate for fundamental perceptual failures where the model simply did not encode the relevant acoustic information. This distinction is important for guiding future research towards either decoding-time repair or improved acoustic modeling. A practical limitation is the increased inference latency due to the additional forward pass required for the audio-reference branch, although the authors suggest potential optimizations. Finally, while cross-modal transfer is demonstrated, the generalizability to all possible conflict sources and modality pairs remains an area for future exploration.
This work has significant broader impact for the development of robust and trustworthy multimodal AI systems. By providing a rigorous diagnostic framework and an effective, generalizable intervention, it directly addresses a critical safety and reliability concern in ALMs: their tendency to prioritize conflicting text over clear audio evidence. This is particularly important for agentic applications in sensitive domains like healthcare, emergency services, or legal assistance, where accurate interpretation of audio is paramount. The mechanistic understanding gained through causal localization offers a powerful new lens for analyzing internal decision-making in complex multimodal models, moving beyond black-box observations and fostering more interpretable AI. The demonstrated cross-modal transfer suggests that the principles of diagnosing and correcting arbitration failures using counterfactual references and logit correction could be broadly applicable across various multimodal AI systems (e.g., vision-language, video-language), paving the way for more faithful and reliable AI across the board. This research not only provides a practical solution but also advances our fundamental understanding of multimodal reasoning and conflict resolution in large models. This paper presents a highly novel diagnostic methodology and a mechanistically-informed, training-free decoding rule to address a critical arbitration failure in audio-language models, demonstrating significant performance gains and cross-modal generalizability. The rigorous causal analysis, coupled with a practical and effective solution, makes this a standout contribution to multimodal machine learning.
Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixed speed from training demonstrations. Prior efforts to accelerate VLAs through model compression, KV-cache reuse, or reinforcement learning only shift the policy from one fixed speed to another, and leave deceleration almost unexplored. We observe that the magnitude of each predicted action already governs how fast the robot moves, opening a direct route to controllable execution speed. We turn this observation into TempoVLA, a single VLA whose execution speed is controlled by an explicit condition. TempoVLA combines two coupled components. (1) A data-side Variable-Speed Trajectory Augmentation (VSTA) that re-times demonstration to any target speed by merging or splitting actions while preserving its motion semantics. (2) A model-side conditioning mechanism that feeds the speed to the policy. Statistics show that VSTA reaches the requested speed with negligible motion error. Experiments in simulation and on real-world tasks demonstrate that TempoVLA achieves flexible speed control in both directions, while VSTA additionally boosts the default $1\times$ performance via better data utilization. Furthermore, by cooperating with a large multimodal model, TempoVLA realizes dynamic speed control, accelerating through low-risk phases and decelerating for high-risk ones.
Primary: Fudan University
All Institutions: Fudan University, Renmin University of China, University of North Carolina at Chapel Hill
TempoVLA makes a significant contribution to robot manipulation by addressing the critical but often overlooked dimension of execution speed. This work enables more flexible and robust deployment of VLAs in real-world scenarios where tasks inherently require varying speeds (e.g., fast transit, slow precision). The ability to dynamically control speed based on task phases, especially with a VLM scheduler, opens up new avenues for intelligent and adaptive robot behavior, moving beyond fixed-speed, brittle policies. The finding that training with variable speeds can act as a data augmentation, improving default 1x performance, is a valuable insight for VLA training in general. This could lead to more efficient data utilization and better generalization. The framework's lightweight nature and applicability to existing VLAs promote its adoption. It also highlights the importance of considering the entire control stack (policy + low-level controller) when aiming for high-performance robot systems. TempoVLA introduces a novel data augmentation and conditioning framework to equip Vision-Language-Action models with explicit, bidirectional speed control, demonstrating improved performance and dynamic phase-aware execution in both simulation and real-world robotics tasks. This paper presents a well-executed solution to a practical and important problem in robot manipulation, offering a lightweight and generalizable method that enhances VLA capabilities by enabling flexible execution speeds. The comprehensive experimental validation, including insightful ablations and stress tests, provides strong evidence for the method's effectiveness and clarifies its operational boundaries, making it a valuable contribution to the field.
The methodology introduces TempoVLA, a framework for speed-controllable Vision-Language-Action (VLA) policies, comprising two main components: Variable-Speed Trajectory Augmentation (VSTA) and a model-side speed conditioning mechanism. VSTA is a clever data-side approach that re-times demonstrations to arbitrary target speeds. It involves motion-consistent segmentation, chunk-level speed transformation (merging/splitting actions), and online chunk-start sampling. The core idea of accumulating and re-splitting actions relies on the assumption of linear composability of actions (e.g., Cartesian translation, joint velocities, axis-angle rotations), which is explicitly discussed and justified. The online sampling strategy is well-designed to ensure all original frames contribute to training despite re-timing. The model-side conditioning mechanisms are lightweight and practical: textual prefix, RMSNorm modulation, and soft prompts. The textual prefix is particularly appealing for its simplicity and lack of architectural changes. The integration with a VLM for dynamic speed scheduling is a natural and impactful extension, demonstrating how TempoVLA can be used in a higher-level reasoning loop. The overall approach is well-motivated, addresses a clear problem, and is designed to be broadly applicable to existing VLA architectures. The discussion on the difference between EEF and Joint Action Space for VSTA is insightful, justifying the preference for EEF actions due to kinematic non-linearities and controller realizability.
The experimental evaluation is comprehensive and rigorous, covering both simulation and real-world settings. 1. **Simulation (LIBERO):** The use of LIBERO, a clean benchmark for manipulation, is appropriate. Experiments verify VSTA's feasibility, showing that it produces re-timed demonstrations with negligible motion error and reasonable replay success rates across various speeds. An ablation study on speed-integration schemes demonstrates that all three proposed methods (Text, Modulation, Soft Prompt) perform similarly, with Text being the most practical. A detailed analysis of the training speed range reveals key insights: VSTA training boosts default 1x performance (acting as useful data augmentation), and surprisingly, peak performance often shifts to slightly faster speeds (1.25x or 1.5x) due to the compression of "rhythm padding" in teleoperated data. This is a significant empirical finding. 2. **Real-world (Franka arm):** The real-world experiments on a 7-DoF Franka arm across five tasks confirm the simulation findings, showing an 8-point gain in 1x success rate and accurate tracking of commanded speeds. This demonstrates the practical applicability and robustness of TempoVLA. 3. **Dynamic Speed Control:** The integration with GPT-4o for dynamic speed scheduling is a compelling demonstration. It shows that TempoVLA can enable phase-aware speed adjustments, accelerating through low-risk phases and decelerating for high-risk ones, leading to higher success rates. 4. **Stress Test and Qualitative Analysis:** The stress test at extreme speeds (0.25x to 4x) is excellent for understanding the method's boundaries. It clearly identifies the low-level controller as the bottleneck for high-speed execution and highlights policy sensitivity at very low speeds. The qualitative failure mode analysis (hesitation at low speeds, overshoot/tracking error at high speeds) provides valuable insights into the practical operating envelope of TempoVLA. The metrics used (success rate, rollout length, realized model ratio, controller tracking gap) are appropriate and provide a holistic view of performance.
The paper provides sufficient details for reproducibility. The methodology for VSTA is clearly described, including its three steps and the underlying assumptions. Algorithm 1 provides pseudocode for VSTA. Hyperparameters for both simulation and real-world experiments are provided in the Appendix. Details on the base VLA model ($_0.5$) and training setup (GPUs, iterations, batch size) are given. The prompt used for GPT-4o in dynamic speed control is also included. The action spaces for both simulation and real-world are specified. Overall, the level of detail is good for replication.
The paper openly discusses several limitations: 1. **Controller Bottleneck:** At the high end of the speed range, the realized speedup saturates because the policy's per-step targets exceed the low-level controller's tracking bandwidth. This means TempoVLA's full potential for acceleration is limited by the underlying robot control stack. 2. **Non-Composable Action Spaces:** VSTA's current implementation assumes linear composability of actions, which excludes representations like unit quaternions or rotation matrices. While the paper suggests solutions (tangent-space mapping or SLERP), these are not implemented. 3. **VLM Scheduling Latency:** The synchronous invocation of the GPT-4o scheduler adds wall-clock overhead. Asynchronous scheduling is proposed as future work. 4. **Speed Regularization:** The current approach assumes uniform per-action granularity for the 1x speed, which might not hold for diverse teleoperation datasets. A VSTA-style normalization to calibrate the 1x reference is suggested. 5. **Policy Sensitivity at Low Speeds:** The stress test shows that at very low speeds (e.g., 0.25x), the policy can exhibit "hesitation" or "stalled progress" due to extremely small per-step magnitudes, making it sensitive to ambiguous observations.
TempoVLA makes a significant contribution to robot manipulation by addressing the critical but often overlooked dimension of execution speed. This work enables more flexible and robust deployment of VLAs in real-world scenarios where tasks inherently require varying speeds (e.g., fast transit, slow precision). The ability to dynamically control speed based on task phases, especially with a VLM scheduler, opens up new avenues for intelligent and adaptive robot behavior, moving beyond fixed-speed, brittle policies. The finding that training with variable speeds can act as a data augmentation, improving default 1x performance, is a valuable insight for VLA training in general. This could lead to more efficient data utilization and better generalization. The framework's lightweight nature and applicability to existing VLAs promote its adoption. It also highlights the importance of considering the entire control stack (policy + low-level controller) when aiming for high-performance robot systems. TempoVLA introduces a novel data augmentation and conditioning framework to equip Vision-Language-Action models with explicit, bidirectional speed control, demonstrating improved performance and dynamic phase-aware execution in both simulation and real-world robotics tasks. This paper presents a well-executed solution to a practical and important problem in robot manipulation, offering a lightweight and generalizable method that enhances VLA capabilities by enabling flexible execution speeds. The comprehensive experimental validation, including insightful ablations and stress tests, provides strong evidence for the method's effectiveness and clarifies its operational boundaries, making it a valuable contribution to the field.