Week of April 05 – April 12, 2026
Numerical simulation of wave propagation and run-up is a cornerstone of coastal engineering and tsunami hazard assessment. However, applying these forward models to inverse problems, such as bathymetry estimation, source inversion, and structural optimization, remains notoriously difficult due to the rigidity and high computational cost of deriving discrete adjoints. In this paper, we introduce AegirJAX, a fully differentiable hydrodynamic solver based on the depth-integrated, non-hydrostatic shallow-water equations. By implementing the solver entirely within a reverse-mode automatic differentiation framework, AegirJAX treats the time-marching physics loop as a continuous computational graph. We demonstrate the framework's versatility across a suite of scientific machine learning tasks: (1) discovering regime-specific neural corrections for model misspecifications in highly dispersive wave propagation; (2) performing continuous topology optimization for breakwater design; (3) training recurrent neural networks in-the-loop for active wave cancellation; and (4) inverting hidden bathymetry and submarine landslide kinematics directly from downstream sensor data. The proposed differentiable paradigm fundamentally blurs the line between forward simulation and inverse optimization, offering a unified, end-to-end framework for coastal hydrodynamics.
Primary: Memorial University of Newfoundland
All Institutions: Memorial University of Newfoundland
The main contribution of this paper is the introduction of AegirJAX, a fully differentiable hydrodynamic solver that integrates machine learning techniques with classical numerical methods to address complex inverse problems in coastal hydrodynamics. This work represents a significant advancement in the field, providing a versatile framework that enhances the capabilities of traditional simulation methods and opens new avenues for research and application in coastal engineering.
The paper introduces AegirJAX, a fully differentiable hydrodynamic solver that leverages reverse-mode automatic differentiation to treat the time-marching physics loop as a continuous computational graph. This approach allows for seamless integration of machine learning techniques with classical numerical methods, enabling the solver to address complex inverse problems in coastal hydrodynamics. The methodology is innovative, particularly in its ability to bypass traditional adjoint methods and provide exact gradients for optimization tasks, which is a significant advancement in the field of scientific machine learning.
The authors demonstrate the capabilities of AegirJAX across multiple scientific machine learning tasks, including model discovery, topology optimization, active control, and parameter estimation. The experiments are well-structured, showcasing the framework's versatility and effectiveness in real-world scenarios. The benchmarks used are relevant and highlight the improvements made over traditional methods, providing strong empirical evidence for the claims made in the paper.
While the paper provides a detailed description of the methodology and experiments, it lacks specific implementation details that would facilitate reproducibility. The absence of a publicly accessible code repository or demo limits the ability of other researchers to replicate the findings. However, the clarity of the methodology may allow for independent implementation.
The paper acknowledges the limitations of the neural network's ability to learn universally applicable corrections due to data scarcity in certain benchmarks. Additionally, the reliance on synthetic data for some experiments may not fully capture the complexities of real-world scenarios. The paper could benefit from a more thorough discussion on the potential challenges in applying the framework to diverse coastal environments.
The proposed framework has significant implications for coastal engineering, tsunami hazard assessment, and environmental monitoring. By integrating machine learning with traditional numerical methods, AegirJAX can enhance the accuracy and efficiency of simulations, leading to better-informed decision-making in coastal management and disaster preparedness. The main contribution of this paper is the introduction of AegirJAX, a fully differentiable hydrodynamic solver that integrates machine learning techniques with classical numerical methods to address complex inverse problems in coastal hydrodynamics. This work represents a significant advancement in the field, providing a versatile framework that enhances the capabilities of traditional simulation methods and opens new avenues for research and application in coastal engineering.
An important recurring pattern in scientific breakthroughs is a two-stage process: an initial phase of undirected experimentation that yields an unexpected finding, followed by a retrospective phase that explains why the finding works and situates it within existing theory. We present ResearchEVO, an end-to-end framework that computationally instantiates this discover-then-explain paradigm. The Evolution Phase employs LLM-guided bi-dimensional co-evolution -- simultaneously optimizing both algorithmic logic and overall architecture -- to search the space of code implementations purely by fitness, without requiring any understanding of the solutions it produces. The Writing Phase then takes the best-performing algorithm and autonomously generates a complete, publication-ready research paper through sentence-level retrieval-augmented generation with explicit anti-hallucination verification and automated experiment design. To our knowledge, ResearchEVO is the first system to cover this full pipeline end to end: no prior work jointly performs principled algorithm evolution and literature-grounded scientific documentation. We validate the framework on two cross-disciplinary scientific problems -- Quantum Error Correction using real Google quantum hardware data, and Physics-Informed Neural Networks -- where the Evolution Phase discovered human-interpretable algorithmic mechanisms that had not been previously proposed in the respective domain literatures. In both cases, the Writing Phase autonomously produced compilable LaTeX manuscripts that correctly grounded these blind discoveries in existing theory via RAG, with zero fabricated citations.
Primary: City University of Hong Kong
All Institutions: City University of Hong Kong
The main contribution of ResearchEVO is its innovative end-to-end framework that automates both the discovery of novel algorithms and the generation of scientifically rigorous documentation, addressing a critical gap in automated scientific research. This work represents a significant advancement in the field of machine learning and automated research, with implications for various scientific domains.
The methodology presented in ResearchEVO is innovative, combining bi-dimensional co-evolution for algorithm discovery with a structured writing phase that generates publication-ready papers. The separation of discovery and explanation phases allows for a more focused optimization process, which is a significant advancement over existing systems that typically only address one aspect. The use of LLM-guided evolution without predefined templates and the integration of literature-grounded explanations through a retrieval-augmented generation (RAG) pipeline are particularly noteworthy. The framework's ability to autonomously generate scientifically valid papers based on evolved algorithms demonstrates a sophisticated understanding of both algorithmic and scientific processes.
The experimental evaluation is robust, showcasing the framework's application to two significant scientific problems: Quantum Error Correction and Physics-Informed Neural Networks. The results indicate that the evolved algorithms not only outperform existing solutions but also provide human-interpretable insights into their mechanisms. The use of real data from Google quantum hardware and rigorous statistical methods, including bootstrap confidence intervals and paired sign tests, strengthens the validity of the findings. The comprehensive nature of the experiments, including ablation studies, further supports the framework's effectiveness.
While the paper outlines a detailed methodology, there is no explicit mention of a public code repository or supplementary materials that would facilitate reproducibility. The lack of a demo or project URL is a significant limitation, as it hinders other researchers from validating the findings or building upon the work. However, the clear description of the processes involved in both the Evolution and Writing phases provides a solid foundation for future implementations.
The paper acknowledges several limitations, including the computational cost associated with LLM-guided evolutionary approaches, potential mis-citations or over-generalizations in generated papers, and the need for human review before submission. Additionally, the sequential nature of the pipeline may limit the feedback loop between discovery and explanation, and the framework has only been validated on two specific domains, suggesting that broader applicability remains to be tested.
ResearchEVO has the potential to democratize scientific research by enabling researchers from various fields to generate initial research directions without deep expertise in adjacent domains. The framework's structural safeguards against fabrication enhance its reliability, making it a valuable tool for scientific inquiry. By bridging the gap between algorithm discovery and scientific documentation, it could significantly impact how research is conducted and communicated across disciplines. The main contribution of ResearchEVO is its innovative end-to-end framework that automates both the discovery of novel algorithms and the generation of scientifically rigorous documentation, addressing a critical gap in automated scientific research. This work represents a significant advancement in the field of machine learning and automated research, with implications for various scientific domains.
Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (general service orchestration, multimodal perception and generation, and multi-turn professional dialogue). Every agent action is recorded through three independent evidence channels (execution traces, audit logs, and environment snapshots), enabling trajectory-aware grading over 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, reporting Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.
Primary: The University of Hong Kong
All Institutions: The University of Hong Kong
The paper presents Claw-Eval, a comprehensive evaluation suite for autonomous agents that addresses critical gaps in existing benchmarks and offers a robust framework for assessing agent performance across multiple dimensions. The innovative methodology and significant empirical findings position this work as a valuable contribution to the field of machine learning, particularly in the context of evaluating autonomous systems.
The methodology introduced in Claw-Eval is robust and comprehensive, addressing critical gaps in existing evaluation frameworks for autonomous agents. The three-phase execution lifecycle ensures that every action taken by the agent is auditable, which is a significant improvement over trajectory-opaque evaluations. The integration of multi-dimensional scoring that evaluates completion, safety, and robustness simultaneously is innovative and necessary for real-world applications. The use of controlled error injection to assess robustness under realistic conditions adds depth to the evaluation process.
The experiments conducted on 14 frontier models provide a thorough analysis of the proposed evaluation framework. The findings reveal substantial insights into the performance of these models across various tasks and modalities, highlighting the unreliability of trajectory-opaque evaluations and the importance of multi-trial assessments. The results are statistically significant and demonstrate the practical implications of the proposed framework.
The paper lacks explicit details regarding the implementation of Claw-Eval, such as code availability or specific datasets used, which could hinder reproducibility. While the methodology is well-documented, the absence of a public repository or demo limits the ability of other researchers to replicate the findings.
One limitation is the reliance on human-verified tasks, which may introduce biases in the evaluation process. Additionally, the framework may require significant computational resources for execution, which could limit its accessibility for smaller research groups. The paper does not address potential scalability issues when applied to larger models or more complex tasks.
The implications of Claw-Eval extend beyond academic research; it has the potential to influence the deployment of autonomous agents in various industries, including customer service, healthcare, and automation. By providing a more trustworthy evaluation framework, it can enhance the reliability and safety of AI systems in real-world applications. The paper presents Claw-Eval, a comprehensive evaluation suite for autonomous agents that addresses critical gaps in existing benchmarks and offers a robust framework for assessing agent performance across multiple dimensions. The innovative methodology and significant empirical findings position this work as a valuable contribution to the field of machine learning, particularly in the context of evaluating autonomous systems.
We present DMax, a new paradigm for efficient diffusion language models (dLLMs). It mitigates error accumulation in parallel decoding, enabling aggressive decoding parallelism while preserving generation quality. Unlike conventional masked dLLMs that decode through a binary mask-to-token transition, DMax reformulates decoding as a progressive self-refinement from mask embeddings to token embeddings. At the core of our approach is On-Policy Uniform Training, a novel training strategy that efficiently unifies masked and uniform dLLMs, equipping the model to recover clean tokens from both masked inputs and its own erroneous predictions. Building on this foundation, we further propose Soft Parallel Decoding. We represent each intermediate decoding state as an interpolation between the predicted token embedding and the mask embedding, enabling iterative self-revising in embedding space. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of DMax. Compared with the original LLaDA-2.0-mini, our method improves TPF on GSM8K from 2.04 to 5.47 while preserving accuracy. On MBPP, it increases TPF from 2.71 to 5.86 while maintaining comparable performance. On two H200 GPUs, our model achieves an average of 1,338 TPS at batch size 1. Code is available at: https://github.com/czg1225/DMax
Primary: National University of Singapore
All Institutions: National University of Singapore
The main contribution of this paper is the introduction of DMax, a novel paradigm for efficient diffusion language models that mitigates error accumulation in parallel decoding through innovative training and decoding strategies, significantly improving performance and establishing a new baseline for future research.
The methodology introduces two significant innovations: On-Policy Uniform Training (OPUT) and Soft Parallel Decoding (SPD). OPUT effectively bridges the training-inference gap by sampling noisy sequences based on the model's own predictions, which enhances the model's ability to self-correct errors. SPD allows for a soft representation of intermediate states, which helps in maintaining predictive uncertainty and improving robustness during parallel decoding. The combination of these methods addresses the critical issue of error accumulation in diffusion language models, marking a substantial advancement in the field.
The experimental evaluation is extensive, covering multiple benchmarks such as GSM8K and MBPP, demonstrating significant improvements in tokens per forward (TPF) while maintaining accuracy. The results show that DMax outperforms existing models in terms of both efficiency and accuracy, establishing a new baseline for parallel decoding in diffusion language models. The experiments are well-structured and provide a clear comparison against various baselines, showcasing the effectiveness of the proposed methods.
The paper provides sufficient implementation details, including training procedures, hyperparameters, and the datasets used. The availability of the code on GitHub enhances reproducibility, allowing other researchers to validate the findings and build upon the work.
One limitation is the reliance on a specific base model (LLaDA-2.0-mini), which may limit the generalizability of the findings to other diffusion language models. Additionally, while the proposed methods mitigate error accumulation, they may not completely eliminate it under extreme parallel decoding conditions.
The advancements presented in this paper have the potential to significantly enhance the efficiency of diffusion language models, making them more applicable in real-time applications such as conversational agents, code generation, and other text generation tasks. By improving parallel decoding capabilities, this work could lead to faster and more accurate language models, influencing future research and applications in natural language processing. The main contribution of this paper is the introduction of DMax, a novel paradigm for efficient diffusion language models that mitigates error accumulation in parallel decoding through innovative training and decoding strategies, significantly improving performance and establishing a new baseline for future research.
Federated Learning (FL) enables collaborative model training while preserving data privacy, but its practical deployment is hampered by system and statistical heterogeneity. While federated network pruning offers a path to mitigate these issues, existing methods face a critical dilemma: server-side pruning lacks personalization, whereas client-side pruning is computationally prohibitive for resource-constrained devices. Furthermore, the pruning process itself induces significant parametric divergence among heterogeneous submodels, destabilizing training and hindering global convergence. To address these challenges, we propose SubFLOT, a novel framework for server-side personalized federated pruning. SubFLOT introduces an Optimal Transport-enhanced Pruning (OTP) module that treats historical client models as proxies for local data distributions, formulating the pruning task as a Wasserstein distance minimization problem to generate customized submodels without accessing raw data. Concurrently, to counteract parametric divergence, our Scaling-based Adaptive Regularization (SAR) module adaptively penalizes a submodel's deviation from the global model, with the penalty's strength scaled by the client's pruning rate. Comprehensive experiments demonstrate that SubFLOT consistently and substantially outperforms state-of-the-art methods, underscoring its potential for deploying efficient and personalized models on resource-constrained edge devices.
Primary: Tsinghua University
All Institutions: Tsinghua University, Beijing University of Technology
This paper presents a significant contribution to the field of federated learning by introducing SubFLOT, a framework that enhances model personalization and efficiency through innovative methodologies. The combination of Optimal Transport and adaptive regularization addresses critical challenges in the deployment of federated learning systems, making it a noteworthy advancement in the field.
The proposed SubFLOT framework introduces a novel approach to federated learning by leveraging Optimal Transport for server-side pruning, which is a significant advancement over existing methods that either lack personalization or are computationally expensive. The use of Wasserstein distance minimization to align client models with local data distributions is innovative and addresses the critical challenges of parametric divergence. The Scaling-based Adaptive Regularization (SAR) module further enhances the framework by adaptively penalizing deviations from the global model, which is a thoughtful addition that could improve convergence stability. Overall, the methodology is well-structured and addresses key issues in federated learning.
The experiments conducted demonstrate a thorough evaluation of SubFLOT against state-of-the-art methods. The authors provide comprehensive results across various scenarios, highlighting the performance improvements in terms of both efficiency and personalization. However, the paper could benefit from more extensive ablation studies to dissect the contributions of individual components within the framework. The datasets used appear relevant, but further details on their diversity and representativeness would strengthen the evaluation.
The paper lacks specific implementation details that would facilitate reproducibility. While the methodology is described, the absence of a publicly available code repository or supplementary materials limits the ability of other researchers to replicate the results. Providing a GitHub link or similar resource would significantly enhance the paper's impact and usability.
One limitation noted is the reliance on server-side pruning, which may not fully address the needs of all client devices, particularly those with highly variable computational resources. Additionally, while the framework shows promise in improving convergence, the long-term stability and performance across diverse federated learning environments remain to be thoroughly evaluated. The paper could also explore the trade-offs between personalization and model complexity in more depth.
The implications of SubFLOT are significant, particularly for applications in resource-constrained environments such as mobile devices and IoT systems. By enabling more efficient and personalized federated learning, this work could enhance privacy-preserving machine learning applications across various domains, including healthcare, finance, and smart cities. The framework's potential to improve model deployment in real-world scenarios could lead to broader adoption of federated learning techniques. This paper presents a significant contribution to the field of federated learning by introducing SubFLOT, a framework that enhances model personalization and efficiency through innovative methodologies. The combination of Optimal Transport and adaptive regularization addresses critical challenges in the deployment of federated learning systems, making it a noteworthy advancement in the field.
Scaling laws describe how language model capabilities grow with compute and data, but say nothing about how long a model matters once released. We provide the first large-scale empirical account of how scientists adopt and abandon language models over time. We track 62 LLMs across over 108k citing papers (2018-2025), each with at least three years of post-release data, and classify every citation as active adoption or background reference to construct per-model adoption trajectories that raw citation counts cannot resolve. We find three regularities. First, scientific adoption follows an inverted-U trajectory: usage rises after release, peaks, and declines as newer models appear, a pattern we term the \textit{scientific adoption curve}. Second, this curve is compressing: each additional release year is associated with a 27\% reduction in time-to-peak adoption ($p < 0.001$), robust to minimum-age thresholds and controls for model size. Third, release timing dominates model-level attributes as a predictor of lifecycle dynamics. Release year explains both time-to-peak and scientific lifespan more strongly than architecture, openness, or scale, though model size and access modality retain modest predictive power for total adoption volume. Together, these findings complement scaling laws with adoption-side regularities and suggest that the forces driving rapid capability progress may be the same forces compressing scientific relevance.
Primary: Massachusetts Institute of Technology
All Institutions: Massachusetts Institute of Technology
The main contribution of this paper is the empirical characterization of the lifecycle of language models in scientific research, revealing that the adoption of LLMs follows a compressing inverted-U trajectory, with significant implications for the future of scientific practice. The comprehensive analysis of adoption dynamics provides valuable insights into the interplay between model development and scientific relevance, highlighting the need for a demand-side perspective in the ongoing discourse around language models.
The paper employs a robust methodology by tracking the adoption of 62 language models across over 108,000 citing papers, classifying citations into active adoption or background references. The use of zero-shot prompting with GPT-4 for classification, combined with a Bayesian correction for misclassification, adds rigor to the approach. The statistical models employed to analyze the adoption trajectories, including quadratic regressions and fixed effects, are appropriate for the research questions posed. However, the reliance on citation data from Semantic Scholar and S2ORC may introduce biases, particularly in fields with lower open-access rates.
The empirical findings are well-supported by the data, revealing three key regularities in the adoption of LLMs: the inverted-U trajectory, the compression of the adoption curve, and the dominance of release timing over model characteristics. The analysis is thorough, with a clear presentation of results that demonstrate the trends in adoption and the implications for scientific practice. The findings are statistically significant and robust across various models and specifications.
The paper provides sufficient details regarding the methods and statistical analyses employed, allowing for reproducibility. However, the classification of citations using a zero-shot model may introduce variability that is not fully accounted for, potentially affecting reproducibility. The authors acknowledge the limitations of their approach, which is a positive aspect.
The study has several limitations, including potential biases in the citation data sources, the reliance on a zero-shot classifier for citation context classification, and the limited sample size for subgroup analyses. Additionally, the focus on formal citations may overlook informal adoption channels, which could provide a more comprehensive view of LLM usage in scientific practice.
The findings have significant implications for the scientific community, particularly regarding the rapid turnover of language models and its impact on reproducibility and knowledge accumulation. The paper highlights the challenges faced by researchers in adapting to new models and the potential costs associated with frequent updates. It calls for a reevaluation of how models are released and maintained to support long-term scientific workflows. The main contribution of this paper is the empirical characterization of the lifecycle of language models in scientific research, revealing that the adoption of LLMs follows a compressing inverted-U trajectory, with significant implications for the future of scientific practice. The comprehensive analysis of adoption dynamics provides valuable insights into the interplay between model development and scientific relevance, highlighting the need for a demand-side perspective in the ongoing discourse around language models.
RL training of multi-turn LLM agents is inherently unstable, and reasoning quality directly determines task performance. Entropy is widely used to track reasoning stability. However, entropy only measures diversity within the same input, and cannot tell whether reasoning actually responds to different inputs. In RAGEN-2, we find that even with stable entropy, models can rely on fixed templates that look diverse but are input-agnostic. We call this template collapse, a failure mode invisible to entropy and all existing metrics. To diagnose this failure, we decompose reasoning quality into within-input diversity (Entropy) and cross-input distinguishability (Mutual Information, MI), and introduce a family of mutual information proxies for online diagnosis. Across diverse tasks, mutual information correlates with final performance much more strongly than entropy, making it a more reliable proxy for reasoning quality. We further explain template collapse with a signal-to-noise ratio (SNR) mechanism. Low reward variance weakens task gradients, letting regularization terms dominate and erase cross-input reasoning differences. To address this, we propose SNR-Aware Filtering to select high-signal prompts per iteration using reward variance as a lightweight proxy. Across planning, math reasoning, web navigation, and code execution, the method consistently improves both input dependence and task performance.
Primary: Imperial College London
All Institutions: Imperial College London, Northwestern University, Oxford University, University of Washington, Stanford University, Microsoft
This paper presents a significant advancement in understanding and mitigating reasoning collapse in multi-turn RL agents, providing both theoretical insights and practical methodologies that could reshape training practices in the field.
The paper introduces a novel approach to diagnosing and mitigating a specific failure mode in reinforcement learning (RL) for multi-turn large language models (LLMs) called "template collapse." The authors propose a mutual information (MI) proxy to assess input dependence in reasoning outputs, which is shown to correlate more strongly with task performance than traditional entropy metrics. They also introduce SNR-Aware Filtering, which prioritizes prompts based on reward variance to enhance training efficiency. This dual approach is innovative and addresses a critical gap in the current understanding of RL training dynamics.
The experimental setup is robust, covering a diverse range of tasks and environments, including planning, mathematical reasoning, and web navigation. The results demonstrate consistent improvements in task performance and input dependence when using SNR-Aware Filtering. The paper provides thorough empirical validation of the proposed methods across multiple algorithms and model scales, showcasing the effectiveness of the MI proxy in diagnosing template collapse.
The paper includes detailed descriptions of the experimental setup, algorithms used, and the metrics for evaluation. However, it lacks a publicly available code repository or demo to facilitate direct reproducibility by other researchers. The absence of a project URL limits the ability for others to replicate the findings easily.
The authors acknowledge that the SNR decomposition may not account for all complexities in gradient dynamics and that their method may not generalize well in multi-agent settings. Additionally, the reliance on reward variance as a proxy for signal quality may not hold in all environments, particularly those with sparse or noisy rewards. There is also a risk of overfitting to the filtering criterion, which could lead to suboptimal exploration.
The findings have significant implications for the development of more reliable and efficient RL agents, particularly in multi-turn contexts where reasoning quality is crucial. By addressing template collapse, the work could enhance the performance of LLMs in various applications, including conversational agents, automated reasoning systems, and interactive AI tools. The proposed methods could also inspire further research into improving training stability in RL frameworks. This paper presents a significant advancement in understanding and mitigating reasoning collapse in multi-turn RL agents, providing both theoretical insights and practical methodologies that could reshape training practices in the field.
Conformal prediction provides distribution-free prediction intervals with finite-sample coverage guarantees, and recent work by Snell \& Griffiths reframes it as Bayesian Quadrature (BQ-CP), yielding powerful data-conditional guarantees via Dirichlet posteriors over thresholds. However, BQ-CP fundamentally requires the i.i.d. assumption -- a limitation the authors themselves identify. Meanwhile, weighted conformal prediction handles distribution shift via importance weights but remains frequentist, producing only point-estimate thresholds. We propose \textbf{Weighted Bayesian Conformal Prediction (WBCP)}, which generalizes BQ-CP to arbitrary importance-weighted settings by replacing the uniform Dirichlet $\Dir(1,\ldots,1)$ with a weighted Dirichlet $\Dir(\neff \cdot \tilde{w}_1, \ldots, \neff \cdot \tilde{w}_n)$, where $\neff$ is Kish's effective sample size. We prove four theoretical results: (1)~$\neff$ is the unique concentration parameter matching frequentist and Bayesian variances; (2)~posterior standard deviation decays as $O(1/\sqrt{\neff})$; (3)~BQ-CP's stochastic dominance guarantee extends to per-weight-profile data-conditional guarantees; (4)~the HPD threshold provides $O(1/\sqrt{\neff})$ improvement in conditional coverage. We instantiate WBCP for spatial prediction as \emph{Geographical BQ-CP}, where kernel-based spatial weights yield per-location posteriors with interpretable diagnostics. Experiments on synthetic and real-world spatial datasets demonstrate that WBCP maintains coverage guarantees while providing substantially richer uncertainty information.
Primary: Massachusetts Institute of Technology
All Institutions: Massachusetts Institute of Technology, Technical University of Munich
The paper presents a significant advancement in uncertainty quantification through the introduction of Weighted Bayesian Conformal Prediction (WBCP), which effectively addresses the limitations of existing methods in the presence of distribution shifts. The comprehensive theoretical foundation, coupled with robust experimental validation, positions WBCP as a valuable tool for practitioners seeking reliable prediction intervals in complex real-world scenarios.
The paper introduces Weighted Bayesian Conformal Prediction (WBCP), a novel framework that combines Bayesian Quadrature with weighted conformal prediction to address the limitations of existing methods under distribution shifts. The methodology is well-structured, providing a clear theoretical foundation with four significant results that enhance the understanding of meta-uncertainty in prediction intervals. The use of a weighted Dirichlet model to replace the uniform Dirichlet in BQ-CP is a key innovation that allows for per-weight-profile posteriors, which is a meaningful advancement in the field.
The experiments conducted on both synthetic and real-world datasets demonstrate the effectiveness of WBCP in maintaining coverage guarantees while providing richer uncertainty information. The results are compelling, showing that WBCP outperforms standard CP and weighted CP in terms of both coverage and interpretability. The inclusion of spatial diagnostics enhances the practical applicability of the method, particularly in spatial prediction contexts.
The paper lacks specific implementation details or a publicly available code repository, which raises concerns about reproducibility. While the methodology is theoretically sound, the absence of a demo or project URL limits the ability of other researchers to validate the findings independently.
The authors acknowledge certain limitations, including the parametric nature of the Dirichlet model and the potential for inflated effective sample sizes in the presence of spatially correlated residuals. These limitations suggest areas for future research, such as exploring nonparametric alternatives and improving computational efficiency for large-scale applications.
WBCP has significant implications for various applications where uncertainty quantification is critical, such as in finance, healthcare, and environmental modeling. By providing a principled way to quantify meta-uncertainty, WBCP can enhance decision-making processes in high-stakes scenarios. The method's adaptability to covariate shifts and its spatial instantiation make it particularly relevant in contemporary machine learning challenges. The paper presents a significant advancement in uncertainty quantification through the introduction of Weighted Bayesian Conformal Prediction (WBCP), which effectively addresses the limitations of existing methods in the presence of distribution shifts. The comprehensive theoretical foundation, coupled with robust experimental validation, positions WBCP as a valuable tool for practitioners seeking reliable prediction intervals in complex real-world scenarios.
LLM agents increasingly draft messages on behalf of users, yet users routinely overshare sensitive information and disagree on what counts as private. Existing systems support only suppression (omitting sensitive information) and generalization (replacing information with an abstraction), and are typically evaluated on single isolated messages, leaving both the strategy space and evaluation setting incomplete. We formalize privacy-preserving LLM communication as an \textbf{Information Sufficiency (IS)} task, introduce \textbf{free-text pseudonymization} as a third strategy that replaces sensitive attributes with functionally equivalent alternatives, and propose a \textbf{conversational evaluation protocol} that assesses strategies under realistic multi-turn follow-up pressure. Across 792 scenarios spanning three power-relation types (institutional, peer, intimate) and three sensitivity categories (discrimination risk, social cost, boundary), we evaluate seven frontier LLMs on privacy at two granularities, covertness, and utility. Pseudonymization yields the strongest privacy\textendash utility tradeoff overall, and single-message evaluation systematically underestimates leakage, with generalization losing up to 16.3 percentage points of privacy under follow-up.
Primary: Massachusetts Institute of Technology
All Institutions: Massachusetts Institute of Technology
This paper makes a significant contribution by formalizing privacy-preserving communication as an Information Sufficiency task and introducing free-text pseudonymization, providing a robust alternative to traditional privacy strategies in LLMs. The comprehensive evaluation framework and empirical findings have the potential to reshape how privacy is approached in machine learning applications.
The paper introduces a novel approach to privacy-preserving communication for LLMs through the concept of Information Sufficiency (IS) and free-text pseudonymization. This methodology is innovative as it expands the existing strategies of suppression and generalization by providing a third option that maintains functional equivalence while enhancing privacy. The conversational evaluation protocol is a significant advancement, allowing for a more realistic assessment of privacy strategies in multi-turn interactions. The formalization of the IS task is well-articulated and sets a new standard for evaluating privacy in LLM communications.
The authors conducted extensive experiments across 792 scenarios, evaluating seven frontier LLMs under different power-relation types and sensitivity categories. This comprehensive dataset allows for robust comparisons between the proposed pseudonymization strategy and traditional methods. The results indicate that pseudonymization offers a superior privacy-utility tradeoff, which is a critical finding for the field. However, the reliance on LLM judges for privacy assessments may introduce biases, and the lack of adversarial testing could limit the generalizability of the findings.
The paper lacks detailed implementation specifics, such as code or datasets, which are crucial for reproducibility. While the methodology is clearly described, the absence of a publicly available project URL or demo limits the ability of other researchers to validate the findings independently. Providing access to the experimental setup and data would significantly enhance the reproducibility of the results.
The paper acknowledges several limitations, including the potential for LLM judges to miss subtle norm violations and the non-adversarial nature of the receiver simulator. Additionally, the scenarios are derived from U.S.-centric privacy norms, which may not generalize to other cultural contexts. The authors also note that while pseudonymization excels in intimate contexts, its effectiveness diminishes in institutional settings where sensitive attributes are closely tied to functional requests.
The implications of this work are substantial, as it addresses a critical gap in privacy-preserving communication for LLMs. By enabling users to manage their disclosures without compromising the utility of the information shared, this research could lead to more ethical and user-centric AI applications. The findings challenge existing norms in privacy strategies and could influence future designs of AI communication systems, promoting a more nuanced understanding of privacy in digital interactions. This paper makes a significant contribution by formalizing privacy-preserving communication as an Information Sufficiency task and introducing free-text pseudonymization, providing a robust alternative to traditional privacy strategies in LLMs. The comprehensive evaluation framework and empirical findings have the potential to reshape how privacy is approached in machine learning applications.
The advent of agentic multimodal models has empowered systems to actively interact with external environments. However, current agents suffer from a profound meta-cognitive deficit: they struggle to arbitrate between leveraging internal knowledge and querying external utilities. Consequently, they frequently fall prey to blind tool invocation, resorting to reflexive tool execution even when queries are resolvable from the raw visual context. This pathological behavior precipitates severe latency bottlenecks and injects extraneous noise that derails sound reasoning. Existing reinforcement learning protocols attempt to mitigate this via a scalarized reward that penalizes tool usage. Yet, this coupled formulation creates an irreconcilable optimization dilemma: an aggressive penalty suppresses essential tool use, whereas a mild penalty is entirely subsumed by the variance of the accuracy reward during advantage normalization, rendering it impotent against tool overuse. To transcend this bottleneck, we propose HDPO, a framework that reframes tool efficiency from a competing scalar objective to a strictly conditional one. By eschewing reward scalarization, HDPO maintains two orthogonal optimization channels: an accuracy channel that maximizes task correctness, and an efficiency channel that enforces execution economy exclusively within accurate trajectories via conditional advantage estimation. This decoupled architecture naturally induces a cognitive curriculum-compelling the agent to first master task resolution before refining its self-reliance. Extensive evaluations demonstrate that our resulting model, Metis, reduces tool invocations by orders of magnitude while simultaneously elevating reasoning accuracy.
Primary: Huazhong University of Science and Technology
All Institutions: Huazhong University of Science and Technology, Alibaba Group
This paper presents a significant advancement in the field of multimodal machine learning by addressing the critical issue of tool invocation in agentic systems. The proposed HDPO framework not only enhances the efficiency and accuracy of multimodal agents but also sets a new standard for future research in this domain.
The paper introduces Hierarchical Decoupled Policy Optimization (HDPO), a novel framework that decouples accuracy and efficiency in reinforcement learning for multimodal agents. This approach addresses the critical issue of blind tool invocation by establishing two independent optimization channels, allowing agents to first master task resolution before refining tool usage. The methodology is well-articulated and presents a clear theoretical foundation for the proposed changes, with rigorous mathematical analysis of the limitations of existing methods.
The experiments are extensive, demonstrating the effectiveness of the proposed model (Metis) across various benchmarks for perception, document understanding, and mathematical reasoning. The results show significant improvements in both accuracy and efficiency, with the model achieving state-of-the-art performance while drastically reducing tool invocations. The use of ablation studies further strengthens the evaluation by isolating the contributions of the proposed changes.
The paper provides detailed implementation information, including the architecture, training data curation, and hyperparameter settings. However, the lack of a public demo or clear reproducibility guidelines may hinder broader adoption of the methodology. The GitHub repository link is provided, which may contain additional resources for implementation.
While the proposed method shows promising results, the paper does not address potential limitations in terms of scalability to more complex environments or the generalizability of the results across different task types. Additionally, the reliance on specific datasets may limit the applicability of the findings.
The work has significant implications for the development of more intelligent and efficient multimodal agents, particularly in applications requiring complex reasoning and tool use. By improving meta-cognitive capabilities, this research could lead to advancements in various fields, including robotics, autonomous systems, and human-computer interaction. This paper presents a significant advancement in the field of multimodal machine learning by addressing the critical issue of tool invocation in agentic systems. The proposed HDPO framework not only enhances the efficiency and accuracy of multimodal agents but also sets a new standard for future research in this domain.
Multimodal reasoning models (MRMs) trained with reinforcement learning with verifiable rewards (RLVR) show improved accuracy on visual reasoning benchmarks. However, we observe that accuracy gains often come at the cost of reasoning quality: generated Chain-of-Thought (CoT) traces are frequently inconsistent with the final answer and poorly grounded in the visual evidence. We systematically study this phenomenon across seven challenging real-world spatial reasoning benchmarks and find that it affects contemporary MRMs such as ViGoRL-Spatial, TreeVGR as well as our own models trained with standard Group Relative Policy Optimization (GRPO). We characterize CoT reasoning quality along two complementary axes: "logical consistency" (does the CoT entail the final answer?) and "visual grounding" (does each reasoning step accurately describe objects, attributes, and spatial relationships in the image?). To address this, we propose Faithful GRPO (FGRPO), a variant of GRPO that enforces consistency and grounding as constraints via Lagrangian dual ascent. FGRPO incorporates batch-level consistency and grounding constraints into the advantage computation within a group, adaptively adjusting the relative importance of constraints during optimization. We evaluate FGRPO on Qwen2.5-VL-7B and 3B backbones across seven spatial datasets. Our results show that FGRPO substantially improves reasoning quality, reducing the inconsistency rate from 24.5% to 1.7% and improving visual grounding scores by +13%. It also improves final answer accuracy over simple GRPO, demonstrating that faithful reasoning enables better answers.
Primary: Microsoft Research
All Institutions: Microsoft Research
The main contribution of this paper is the introduction of Faithful GRPO, a novel method that significantly enhances the reasoning quality of multimodal language models by enforcing logical consistency and visual grounding, thereby improving accuracy on spatial reasoning tasks. This work represents a meaningful advancement in the field of multimodal reasoning, with the potential to influence future research and applications significantly.
The proposed Faithful GRPO (FGRPO) method introduces a novel approach to enhance the reasoning quality of multimodal reasoning models by enforcing logical consistency and visual grounding through Lagrangian dual ascent. This methodology is innovative as it integrates constraints into the optimization process, which is not commonly seen in existing reinforcement learning frameworks. The systematic characterization of reasoning quality along two axes (logical consistency and visual grounding) adds depth to the methodology, allowing for a comprehensive understanding of the issues at hand.
The paper presents a robust experimental evaluation across seven challenging spatial reasoning benchmarks, which is commendable. The authors provide quantitative results demonstrating significant improvements in reasoning quality and accuracy, with a notable reduction in inconsistency rates and enhanced visual grounding scores. The use of established models (Qwen2.5-VL-7B and 3B backbones) for evaluation adds credibility to the findings. However, the paper could benefit from a more detailed comparison with other state-of-the-art methods to contextualize the improvements.
The paper lacks explicit details regarding the implementation of the proposed FGRPO method, which could hinder reproducibility. While the results are promising, the absence of a publicly available code repository or detailed algorithmic steps limits the ability of other researchers to replicate the findings. Including such details would significantly enhance the paper's impact and utility.
One limitation noted in the paper is the potential trade-off between the enforcement of constraints and the overall performance of the model in other areas not covered by the constraints. Additionally, the focus on only seven benchmarks may not fully capture the generalizability of the proposed method across diverse multimodal reasoning tasks.
The implications of this research are significant, as improving visual spatial reasoning in multimodal models can enhance applications in areas such as robotics, autonomous systems, and human-computer interaction. By providing a framework that ensures more reliable reasoning, this work could lead to more trustworthy AI systems capable of complex decision-making based on visual inputs. The main contribution of this paper is the introduction of Faithful GRPO, a novel method that significantly enhances the reasoning quality of multimodal language models by enforcing logical consistency and visual grounding, thereby improving accuracy on spatial reasoning tasks. This work represents a meaningful advancement in the field of multimodal reasoning, with the potential to influence future research and applications significantly.
To extend the reinforcement learning post-training paradigm to omni-modal models for concurrently bolstering video-audio understanding and collaborative reasoning, we propose OmniJigsaw, a generic self-supervised framework built upon a temporal reordering proxy task. Centered on the chronological reconstruction of shuffled audio-visual clips, this paradigm strategically orchestrates visual and auditory signals to compel cross-modal integration through three distinct strategies: Joint Modality Integration, Sample-level Modality Selection, and Clip-level Modality Masking. Recognizing that the efficacy of such proxy tasks is fundamentally tied to puzzle quality, we design a two-stage coarse-to-fine data filtering pipeline, which facilitates the efficient adaptation of OmniJigsaw to massive unannotated omni-modal data. Our analysis reveals a ``bi-modal shortcut phenomenon'' in joint modality integration and demonstrates that fine-grained clip-level modality masking mitigates this issue while outperforming sample-level modality selection. Extensive evaluations on 15 benchmarks show substantial gains in video, audio, and collaborative reasoning, validating OmniJigsaw as a scalable paradigm for self-supervised omni-modal learning.
Primary: Zhejiang University
All Institutions: Zhejiang University, Xiaomi Inc.
The paper presents a significant advancement in omni-modal reasoning through the OmniJigsaw framework, which leverages self-supervised learning and innovative modality orchestration strategies to enhance reasoning capabilities across multiple modalities. The comprehensive methodology, extensive evaluation, and practical implications position this work as a valuable contribution to the field of machine learning.
The paper introduces the OmniJigsaw framework, which innovatively applies a self-supervised learning paradigm to enhance omni-modal reasoning through a temporal reordering proxy task. The methodology is well-structured, utilizing three distinct strategies (Joint Modality Integration, Sample-level Modality Selection, and Clip-level Modality Masking) to address the challenges of cross-modal integration. The authors also implement a two-stage data filtering pipeline to ensure the quality of the training data, which is critical for the success of their approach. The identification of the "bi-modal shortcut phenomenon" is a significant insight that informs their design choices, demonstrating a deep understanding of the underlying challenges in omni-modal reasoning.
The experiments are extensive, covering 15 benchmarks across video, audio, and collaborative reasoning tasks. The results show substantial performance improvements over existing methods, particularly with the Clip-level Modality Masking strategy. The authors provide detailed ablation studies that validate their design choices and demonstrate the sensitivity of their method to data quality and reward design. This rigorous evaluation strengthens the credibility of their findings and showcases the practical applicability of their framework.
The paper includes detailed implementation details, including the training data preparation, preprocessing steps, and hyperparameter settings. However, the actual code and data are not provided, which could hinder full reproducibility. The authors mention using specific models and configurations, but without access to the code or datasets, independent verification of results may be challenging.
One limitation noted is the reliance on high-quality annotated data for training, which may not be readily available for all applications. Additionally, while the paper addresses the "bi-modal shortcut phenomenon," it does not explore the implications of this phenomenon in depth, nor does it provide a comprehensive analysis of potential failure cases in real-world scenarios. The scalability of the method to larger datasets or more complex tasks remains to be fully evaluated.
The OmniJigsaw framework has the potential to significantly advance the field of omni-modal reasoning by providing a scalable, self-supervised approach that does not rely on extensive manual annotation. This could democratize access to advanced reasoning capabilities in AI systems, enabling applications in various domains such as robotics, autonomous systems, and multimedia content understanding. The insights gained from this research could inform future work in multi-modal learning and AI reasoning. The paper presents a significant advancement in omni-modal reasoning through the OmniJigsaw framework, which leverages self-supervised learning and innovative modality orchestration strategies to enhance reasoning capabilities across multiple modalities. The comprehensive methodology, extensive evaluation, and practical implications position this work as a valuable contribution to the field of machine learning.
Photon-counting CT (PCCT) provides superior image quality with higher spatial resolution and lower noise compared to conventional energy-integrating CT (EICT), but its limited clinical availability restricts large-scale research and clinical deployment. To bridge this gap, we propose SUMI, a simulated degradation-to-enhancement method that learns to reverse realistic acquisition artifacts in low-quality EICT by leveraging high-quality PCCT as reference. Our central insight is to explicitly model realistic acquisition degradations, transforming PCCT into clinically plausible lower-quality counterparts and learning to invert this process. The simulated degradations were validated for clinical realism by board-certified radiologists, enabling faithful supervision without requiring paired acquisitions at scale. As outcomes of this technical contribution, we: (1) train a latent diffusion model on 1,046 PCCTs, using an autoencoder first pre-trained on both these PCCTs and 405,379 EICTs from 145 hospitals to extract general CT latent features that we release for reuse in other generative medical imaging tasks; (2) construct a large-scale dataset of over 17,316 publicly available EICTs enhanced to PCCT-like quality, with radiologist-validated voxel-wise annotations of airway trees, arteries, veins, lungs, and lobes; and (3) demonstrate substantial improvements: across external data, SUMI outperforms state-of-the-art image translation methods by 15% in SSIM and 20% in PSNR, improves radiologist-rated clinical utility in reader studies, and enhances downstream top-ranking lesion detection performance, increasing sensitivity by up to 15% and F1 score by up to 10%. Our results suggest that emerging imaging advances can be systematically distilled into routine EICT using limited high-quality scans as reference.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University, University of California, San Francisco, Harvard Medical School, Emory University, Nvidia, Johns Hopkins Medicine
The paper introduces a transformative method for enhancing routine chest CT scans to PCCT-like quality through simulated degradation modeling. This innovative approach not only bridges a critical gap in medical imaging but also has the potential to significantly impact clinical practices and research in the field.
The paper presents a novel simulated degradation-to-enhancement method (SUMI) that effectively addresses the gap between photon-counting CT (PCCT) and energy-integrating CT (EICT) by modeling realistic acquisition degradations. This approach allows for the enhancement of low-quality EICT images to a quality comparable to high-quality PCCT images without the need for paired acquisitions, which is a significant advancement in medical imaging. The use of a latent diffusion model trained on a large dataset, combined with a continual-learning autoencoder, demonstrates a sophisticated methodology that is both innovative and practical for clinical applications.
The authors provide a comprehensive evaluation of their method across multiple datasets, demonstrating substantial improvements in image quality metrics (SSIM and PSNR) and clinical utility as rated by radiologists. The experiments include comparisons with state-of-the-art image translation methods and show clear gains in downstream lesion detection performance. The use of external datasets for validation adds robustness to the findings, although specific details on dataset splits and evaluation protocols could enhance transparency.
The paper mentions the release of datasets and models, which is a positive step towards reproducibility. However, the lack of specific URLs for the project and demo limits accessibility for other researchers. Clear documentation on the implementation details and training processes would further improve reproducibility.
The study is limited to chest CT imaging, which may restrict the generalizability of the findings to other anatomical regions. Additionally, the current implementation focuses on 2D slices rather than full 3D volumes, which presents a challenge for practical clinical applications. Future work is needed to address these limitations and extend the methodology's applicability.
This work has the potential to democratize access to high-quality imaging by enabling hospitals with EICT systems to achieve PCCT-like quality without the need for expensive hardware upgrades. The implications for improving diagnostic accuracy and enhancing patient care are significant, particularly in underserved communities where access to advanced imaging technologies is limited. The paper introduces a transformative method for enhancing routine chest CT scans to PCCT-like quality through simulated degradation modeling. This innovative approach not only bridges a critical gap in medical imaging but also has the potential to significantly impact clinical practices and research in the field.
We introduce HY-Embodied-0.5, a family of foundation models specifically designed for real-world embodied agents. To bridge the gap between general Vision-Language Models (VLMs) and the demands of embodied agents, our models are developed to enhance the core capabilities required by embodied intelligence: spatial and temporal visual perception, alongside advanced embodied reasoning for prediction, interaction, and planning. The HY-Embodied-0.5 suite comprises two primary variants: an efficient model with 2B activated parameters designed for edge deployment, and a powerful model with 32B activated parameters targeted for complex reasoning. To support the fine-grained visual perception essential for embodied tasks, we adopt a Mixture-of-Transformers (MoT) architecture to enable modality-specific computing. By incorporating latent tokens, this design effectively enhances the perceptual representation of the models. To improve reasoning capabilities, we introduce an iterative, self-evolving post-training paradigm. Furthermore, we employ on-policy distillation to transfer the advanced capabilities of the large model to the smaller variant, thereby maximizing the performance potential of the compact model. Extensive evaluations across 22 benchmarks, spanning visual perception, spatial reasoning, and embodied understanding, demonstrate the effectiveness of our approach. Our MoT-2B model outperforms similarly sized state-of-the-art models on 16 benchmarks, while the 32B variant achieves performance comparable to frontier models such as Gemini 3.0 Pro. In downstream robot control experiments, we leverage our robust VLM foundation to train an effective Vision-Language-Action (VLA) model, achieving compelling results in real-world physical evaluations. Code and models are open-sourced at https://github.com/Tencent-Hunyuan/HY-Embodied.
Primary: Tencent Robotics X
All Institutions: Tencent Robotics X, HY Vision Team
The main contribution of this paper is the introduction of HY-Embodied-0.5, a family of foundation models designed to enhance the capabilities of real-world embodied agents through innovative architectural choices and training paradigms. This work presents a comprehensive approach that bridges the gap between general VLMs and the specific demands of embodied intelligence, showcasing significant potential for future applications and research in the field.
The methodology presented in HY-Embodied-0.5 is innovative, particularly with the introduction of the Mixture-of-Transformers (MoT) architecture, which allows for modality-specific computing. This design choice is significant as it enhances the perceptual representation critical for embodied tasks. The iterative, self-evolving post-training paradigm is another noteworthy aspect, as it suggests a novel approach to improving reasoning capabilities. The on-policy distillation technique for transferring knowledge from a larger model to a smaller one is also a valuable contribution, as it addresses the practical need for efficient models in real-world applications. Overall, the methodology is well-structured and presents several novel components that could influence future research in embodied intelligence.
The paper provides extensive evaluations across 22 benchmarks, which is commendable. The results showing that the MoT-2B model outperforms similarly sized state-of-the-art models on 16 benchmarks and that the 32B variant achieves performance comparable to leading models like Gemini 3.0 Pro indicate a thorough experimental design. The inclusion of downstream robot control experiments further strengthens the evaluation, demonstrating real-world applicability. However, details on the specific datasets used and the statistical significance of the results would enhance the robustness of the evaluation.
The authors have made their code and models open-sourced, which is a positive step towards reproducibility. However, the paper lacks detailed implementation specifics that would allow other researchers to replicate the experiments easily. Providing more comprehensive documentation and guidelines for reproducing the results would significantly improve the reproducibility aspect.
One limitation of the study is that while the models show strong performance on various benchmarks, the paper does not address potential scenarios where the models may underperform or fail. Additionally, the focus on specific benchmarks may not fully capture the generalizability of the models across diverse real-world tasks. The reliance on large models for performance improvements may also raise concerns regarding computational resources and accessibility for smaller research teams.
The development of foundation models for embodied agents has significant implications for robotics, autonomous systems, and AI applications in real-world environments. By enhancing spatial and temporal visual perception and reasoning capabilities, these models could lead to advancements in various fields, including healthcare, logistics, and personal assistance. The open-sourcing of the models also promotes collaboration and innovation within the research community, potentially accelerating progress in embodied intelligence. The main contribution of this paper is the introduction of HY-Embodied-0.5, a family of foundation models designed to enhance the capabilities of real-world embodied agents through innovative architectural choices and training paradigms. This work presents a comprehensive approach that bridges the gap between general VLMs and the specific demands of embodied intelligence, showcasing significant potential for future applications and research in the field.
World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model that formulates policy learning as multiview video generation. Instead of encoding control as low-dimensional tokens, we translate 7-DoF robot actions into interpretable action images: multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion. This pixel-grounded action representation allows the video backbone itself to act as a zero-shot policy, without a separate policy head or action module. Beyond control, the same unified model supports video-action joint generation, action-conditioned video generation, and action labeling under a shared representation. On RLBench and real-world evaluations, our model achieves the strongest zero-shot success rates and improves video-action joint generation quality over prior video-space world models, suggesting that interpretable action images are a promising route to policy learning.
Primary: UMass Amherst
All Institutions: UMass Amherst, UTokyo, NVIDIA, Harvard University, Genesis AI
The main contribution of this paper is the introduction of Action Images, a novel approach to policy learning that utilizes pixel-grounded action representations through multiview video generation, significantly advancing the state of the art in robot control and action representation. The methodology is innovative, the experimental results are promising, and the potential impact on the field is substantial, warranting high scores in all evaluated categories.
The proposed methodology of translating 7-DoF robot actions into pixel-grounded action images is innovative, as it integrates policy learning with multiview video generation in a unified framework. This approach allows for the direct use of video models as zero-shot policies, which is a significant departure from traditional methods that rely on separate action modules. The formulation of actions as interpretable action images enhances the understanding of robot motion and control, making it a noteworthy advancement in the field.
The experiments conducted on RLBench and real-world evaluations demonstrate the effectiveness of the proposed model. The reported results indicate that the model achieves the strongest zero-shot success rates and improves the quality of video-action joint generation compared to existing models. However, the paper could benefit from more comprehensive comparisons with a wider range of baselines to further validate the claims.
The paper provides a clear description of the methodology and experiments, which aids in reproducibility. However, details regarding the implementation specifics, hyperparameter settings, and training procedures are somewhat limited. Including these details would enhance the ability of other researchers to replicate the results.
One limitation is the reliance on specific environments (RLBench and real-world scenarios) for evaluation, which may not generalize to all robotic tasks. Additionally, while the concept of action images is promising, the paper does not extensively discuss potential challenges in scaling this approach to more complex environments or tasks.
The implications of this work are significant, as it opens new avenues for robot policy learning and action representation. The ability to leverage video models for zero-shot policy learning could lead to more adaptable and intelligent robotic systems. Furthermore, the unified approach to action representation may inspire future research in related fields, such as video understanding and robotics. The main contribution of this paper is the introduction of Action Images, a novel approach to policy learning that utilizes pixel-grounded action representations through multiview video generation, significantly advancing the state of the art in robot control and action representation. The methodology is innovative, the experimental results are promising, and the potential impact on the field is substantial, warranting high scores in all evaluated categories.
Realistic digital avatars require expressive and dynamic hair motion; however, most existing head avatar methods assume rigid hair movement. These methods often fail to disentangle hair from the head, representing it as a simple outer shell and failing to capture its natural volumetric behavior. In this paper, we address these limitations by introducing PhysHead, a hybrid representation for animatable head avatars with realistic hair dynamics learned from multi-view video. At the core is a 3D Gaussian-based layered representation of the head. Our approach combines a 3D parametric mesh for the head with strand-based hair, which can be directly simulated using physics engines. For the appearance model, we employ Gaussian primitives attached to both the head mesh and hair segments. This representation enables the creation of photorealistic head avatars with dynamic hair behavior, such as wind-blown motion, overcoming the constraints of rigid hair in existing methods. However, these animation capabilities also require new training schemes. In particular, we propose the use of VLM-based models to generate appearance of regions that are occluded in the dynamic training sequences. In quantitative and qualitative studies, we demonstrate the capabilities of the proposed model and compare it with existing baselines. We show that our method can synthesize physically plausible hair motion besides expression and camera control.
Primary: Max Planck Institute for Intelligent Systems
All Institutions: Max Planck Institute for Intelligent Systems, Max Planck Institute for Informatics, Technical University of Darmstadt, Tübingen AI Center, University of Tübingen
The paper presents PhysHead, a novel method for creating photorealistic head avatars with dynamic hair, significantly advancing the field of avatar representation in computer graphics. The combination of innovative methodologies and strong technical contributions positions this work as a potential cornerstone for future developments in realistic avatar animation and simulation.
The methodology proposed in this paper introduces a hybrid representation for animatable head avatars that combines a 3D parametric mesh for the head with a strand-based hair model. This approach is innovative as it allows for realistic hair dynamics, which is a significant advancement over traditional methods that treat hair as a rigid structure. The use of Gaussian primitives for appearance modeling and the integration of physics engines for hair simulation are noteworthy contributions. The introduction of VLM-based models to address occluded regions during training is also a clever solution that enhances the model's robustness.
The paper presents both quantitative and qualitative evaluations, comparing their method with existing baselines. The results demonstrate significant improvements in synthesizing realistic hair motion and facial expressions. However, the paper could benefit from more extensive benchmarking against a wider range of existing methods to further validate its claims.
The paper does not provide extensive implementation details, which may hinder reproducibility. While the project page offers a demo, the lack of a public code repository limits the ability for other researchers to reproduce the results fully.
One limitation is the reliance on multi-view video data for training, which may not be readily available for all applications. Additionally, the complexity of the model may pose challenges for real-time applications, and the paper does not address potential computational overhead.
The ability to create photorealistic avatars with dynamic hair opens up significant possibilities in various fields, including gaming, virtual reality, and film. This technology could enhance user experience and engagement in digital environments, making it a valuable contribution to the field of computer graphics and machine learning. The paper presents PhysHead, a novel method for creating photorealistic head avatars with dynamic hair, significantly advancing the field of avatar representation in computer graphics. The combination of innovative methodologies and strong technical contributions positions this work as a potential cornerstone for future developments in realistic avatar animation and simulation.
Autoregressive (AR) language models generate text one token at a time, even when consecutive tokens are highly predictable given earlier context. We introduce MARS (Mask AutoRegreSsion), a lightweight fine-tuning method that teaches an instruction-tuned AR model to predict multiple tokens per forward pass. MARS adds no architectural modifications, no extra parameters, and produces a single model that can still be called exactly like the original AR model with no performance degradation. Unlike speculative decoding, which maintains a separate draft model alongside the target, or multi-head approaches such as Medusa, which attach additional prediction heads, MARS requires only continued training on existing instruction data. When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks. When allowed to accept multiple tokens per step, it maintains baseline-level accuracy while achieving 1.5-1.7x throughput. We further develop a block-level KV caching strategy for batch inference, achieving up to 1.71x wall-clock speedup over AR with KV cache on Qwen2.5-7B. Finally, MARS supports real-time speed adjustment via confidence thresholding: under high request load, the serving system can increase throughput on the fly without swapping models or restarting, providing a practical latency-quality knob for deployment.
Primary: Nanyang Technological University
All Institutions: Nanyang Technological University, Uppsala University, Singapore Management University
The paper presents MARS, a lightweight fine-tuning method that enhances autoregressive models' capabilities for multi-token generation while maintaining their original performance. This contribution is significant as it addresses a critical limitation in autoregressive models, providing a practical solution that can be widely adopted in the field.
The methodology introduces MARS, a novel fine-tuning approach that enables autoregressive models to generate multiple tokens per forward pass without architectural changes or additional parameters. The authors effectively analyze the shortcomings of existing multi-token generation methods and propose a solution that preserves the autoregressive nature of the model while enhancing throughput. The use of a structured attention mask and a dual-stream training approach is innovative and addresses key challenges in multi-token generation.
The experiments are thorough, utilizing multiple benchmarks to validate the claims made about MARS. The paper demonstrates that MARS matches or exceeds the performance of standard autoregressive models while significantly improving throughput. The results are well-presented, providing a clear comparison against existing methods and showing the benefits of the proposed approach across different model sizes.
The paper provides sufficient implementation details and includes a GitHub repository for code access, which enhances reproducibility. However, the training process is computationally intensive, which may limit accessibility for some researchers.
The paper acknowledges that the training process doubles the effective sequence length, leading to increased training costs. Additionally, the quality degradation at aggressive thresholds indicates that further refinement of the acceptance strategy is needed. The block-level KV caching strategy may also impose limitations on throughput at larger batch sizes.
The proposed method has the potential to significantly improve the efficiency of autoregressive language models in real-time applications, making it highly relevant for deployment in production settings. The ability to control the speed-quality tradeoff dynamically could lead to broader adoption in various NLP tasks, enhancing user experience in applications requiring rapid responses. The paper presents MARS, a lightweight fine-tuning method that enhances autoregressive models' capabilities for multi-token generation while maintaining their original performance. This contribution is significant as it addresses a critical limitation in autoregressive models, providing a practical solution that can be widely adopted in the field.
SANDO is a safe trajectory planner for 3D dynamic unknown environments, where obstacle locations and motions are unknown a priori and a collision-free plan can become unsafe at any moment, requiring fast replanning. Existing soft-constraint planners are fast but cannot guarantee collision-free paths, while hard-constraint methods ensure safety at the cost of longer computation. SANDO addresses this trade-off through three contributions. First, a heat map-based A* global planner steers paths away from high-risk regions using soft costs, and a spatiotemporal safe flight corridor (STSFC) generator produces time-layered polytopes that inflate obstacles only by their worst-case reachable set at each time layer, rather than by the worst case over the entire horizon. Second, trajectory optimization is formulated as a Mixed-Integer Quadratic Program (MIQP) with hard collision-avoidance constraints, and a variable elimination technique reduces the number of decision variables, enabling fast computation. Third, a formal safety analysis establishes collision-free guarantees under explicit velocity-bound and estimation-error assumptions. Ablation studies show that variable elimination yields up to 7.4x speedup in optimization time, and that STSFCs are critical for feasibility in dense dynamic environments. Benchmark simulations against state-of-the-art methods across standardized static benchmarks, obstacle-rich static forests, and dynamic environments show that SANDO consistently achieves the highest success rate with no constraint violations across all difficulty levels; perception-only experiments without ground truth obstacle information confirm robust performance under realistic sensing. Hardware experiments on a UAV with fully onboard planning, perception, and localization demonstrate six safe flights in static environments and ten safe flights among dynamic obstacles.
Primary: Massachusetts Institute of Technology
All Institutions: Massachusetts Institute of Technology, Comillas ICAI
SANDO introduces a novel safe trajectory planning method for dynamic unknown environments, significantly advancing the field of robotics. The combination of innovative methodologies, rigorous experimental validation, and practical applicability positions this work as a strong candidate for high impact within the machine learning community.
The methodology proposed in SANDO is innovative, combining a heat map-based A* global planner with a spatiotemporal safe flight corridor (STSFC) generator. The use of Mixed-Integer Quadratic Programming (MIQP) for trajectory optimization with variable elimination techniques is a significant contribution, allowing for faster computations while maintaining safety guarantees. The formal safety analysis under specific assumptions adds rigor to the approach, addressing a critical need in dynamic environments.
The experimental evaluation is robust, featuring comprehensive benchmark simulations against state-of-the-art methods across various environments, including static and dynamic scenarios. The results demonstrate a high success rate and no constraint violations, indicating the effectiveness of the proposed methods. The hardware experiments with a UAV further validate the approach in real-world conditions, showcasing practical applicability.
The paper provides sufficient details regarding the implementation and methodology, including links to supplementary materials and code repositories. This enhances the reproducibility of the results, allowing other researchers to build upon the work.
While the paper presents significant advancements, it does not extensively address the scalability of the approach in highly complex environments or the potential computational overhead in extremely dynamic scenarios. Additionally, the reliance on specific assumptions for safety guarantees may limit applicability in more unpredictable settings.
The implications of SANDO are substantial, particularly for autonomous systems operating in dynamic environments, such as drones and robotic vehicles. The ability to guarantee safety while maintaining efficiency could lead to broader adoption of autonomous technologies in various industries, enhancing operational safety and effectiveness. SANDO introduces a novel safe trajectory planning method for dynamic unknown environments, significantly advancing the field of robotics. The combination of innovative methodologies, rigorous experimental validation, and practical applicability positions this work as a strong candidate for high impact within the machine learning community.