Week of June 14 – June 21, 2026
Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in research. Using a formal psychometric framework, we show that these profiles are largely a measurement artifact. Administering a battery of personality and risk-preference instruments spanning self-reports and behavioral tasks to 56 instruction-tuned LLMs alongside large human reference samples, we report four findings. First, differences between models are driven not by the traits an instrument targets but by a directional response bias, a tendency to respond toward one end of the scale, or one labeled option, regardless of item content; a variance decomposition attributes 81-90% of between-model variation to this bias, against 9-16% in humans. Second, the bias declines with model capability but is not eliminated by it. Third, because bias rather than trait drives responding, an instrument's apparent reliability is almost entirely predicted by its response orthogonality, a term we coin for the proportion of items for which trait and bias point in opposite directions. Fourth, the profile a model appears to have shifts with the items used and can be manufactured through item selection. These results demonstrate that the apparent psychological profiles of LLMs are artifacts of the instrument used to measure them, not properties of the models themselves. As instruments borrowed from human psychology are rarely fully orthogonal and may inherently lack validity for LLMs, we call for dedicated assessments centered on response orthogonality.
Primary: Max Planck Institute for Human Development
All Institutions: Max Planck Institute for Human Development, University of Konstanz, Barcelona Supercomputing Center, University of Basel
This paper provides a critical psychometric analysis demonstrating that apparent psychological profiles in LLMs are largely artifacts of response bias rather than genuine traits, fundamentally challenging current evaluation practices and calling for more rigorous, orthogonal measurement frameworks in AI research.
The paper employs a rigorous psychometric framework to deconstruct the measurement of LLM "personalities." By modeling responses as a function of latent trait and response bias, and utilizing the concept of "response orthogonality" (the proportion of reverse-keyed items), the authors provide a mathematically sound method to separate genuine trait variance from systematic response artifacts. This approach is theoretically robust and directly addresses a fundamental confound in current LLM evaluation methodologies.
The experimental design is comprehensive, testing 56 instruction-tuned LLMs against a battery of 29 instruments (personality and risk preference) and comparing them against large human reference samples. The results are striking and consistent: LLMs show positive forward-reverse correlations (indicating bias dominance) whereas humans show negative correlations (indicating trait dominance). The variance decomposition showing 81-90% of LLM variation is bias-driven is a powerful empirical finding. The robustness checks across prompting conditions, model sizes, and elicitation methods strengthen the validity of the conclusions.
The paper provides detailed methodological descriptions, including the specific instruments used, the prompting strategies, and the mathematical derivations for the measurement model. The code repository is explicitly linked, ensuring high reproducibility. The use of standard psychometric instruments and clear definitions of variables enhances the clarity of the experimental setup.
The study focuses primarily on post-trained models and treats each model as a single respondent, which may not capture within-model variability or the nuances of different prompting strategies (e.g., persona adoption). The proprietary model sample size is small (N=10), limiting statistical power for those specific comparisons. Additionally, the study is limited to personality and risk domains; while the authors argue for broader applicability, empirical validation in other domains (e.g., moral reasoning, cognitive biases) is left for future work.
This paper has significant implications for the field of AI safety, alignment, and the use of LLMs as proxies for human participants in research. It challenges the validity of current LLM profiling practices and calls for a re-evaluation of how we measure and interpret LLM behaviors. The concept of response orthogonality offers a new standard for designing valid evaluation instruments for AI systems. This paper provides a critical psychometric analysis demonstrating that apparent psychological profiles in LLMs are largely artifacts of response bias rather than genuine traits, fundamentally challenging current evaluation practices and calling for more rigorous, orthogonal measurement frameworks in AI research.
Autoformalization, translating natural-language mathematics into formal proof assistants, is bottlenecked not by translation fluency but by \emph{faithfulness}: a formal statement can typecheck and be provable, yet still encode a different theorem than the source intended. We introduce \emph{Bidirectional Provability Fingerprinting} (\bpf{}), a framework that certifies faithfulness by characterizing each candidate through its forward and backward consequence neighborhoods in the ambient theory and matching these against probes derived from the natural-language statement. We further introduce four novel components: (i) \emph{Counterfactual Probe Generation} (\cpg{}), a contrastive procedure that synthesizes probes targeting specific drift directions; (ii) the \emph{Equivalence Spectrum}, a continuous faithfulness score that replaces brittle binary verdicts; (iii) \emph{Adaptive Probe Budget Allocation} (\apba{}), an information-theoretic budget router; and (iv) \emph{Faithfulness-Guided Decoding} (\fgd{}), which uses \bpf{} signals as a reward during autoformalization. We prove a \emph{drift detection theorem} and a \emph{PAC-faithfulness} result establishing that the equivalence class of a natural language statement is learnable from $\mathcal{O}(\log(1/δ)/\varepsilon)$ probes under mild assumptions. We release \driftbench{}, a benchmark of $2{,}183$ NL/Lean~4 pairs with controlled drift labels across six subfields of mathlib4. \bpf{}\,+\,\cpg{} detects $89.6\%$ of drifted formalizations at a $3.0\%$ false-positive rate-against $41.2\%$ for typecheck and $63.3\%$ for LLM-judge baselines, and \fgd{} reduces the rate at which a state-of-the-art autoformalizer emits drifted statements by $47\%$. https://pmlrbd.github.io/BPF/
Primary: Istanbul Technical University
All Institutions: Istanbul Technical University, Jashore University of Science and Technology
This paper introduces a significant advancement in the field of autoformalization by proposing a framework that certifies semantic equivalence between natural-language and formal mathematical statements. The comprehensive methodology and rigorous experimental evaluation demonstrate its potential to improve the reliability of AI-assisted formal mathematics, making it a highly impactful contribution to machine learning research.
The paper presents a novel framework, Bidirectional Provability Fingerprinting (BPF), which addresses the critical issue of faithfulness in autoformalization. The methodology is robust, introducing several innovative components such as Counterfactual Probe Generation and Adaptive Probe Budget Allocation. The framework's reliance on consequence neighborhoods to certify semantic equivalence is a significant conceptual advancement, moving beyond traditional typechecking and provability metrics. The introduction of continuous faithfulness scores through the Equivalence Spectrum adds a layer of sophistication to the evaluation of formalizations.
The experiments are thorough, utilizing a well-constructed benchmark dataset (driftbench) with 2,183 natural language/Lean 4 pairs. The results demonstrate a substantial improvement in detecting drifted formalizations compared to existing methods, with empirical evidence supporting the effectiveness of the proposed techniques. The paper also includes rigorous evaluations against multiple baselines, showcasing the practical benefits of the BPF framework.
The paper provides sufficient details on the methodology and experimental setup, including the algorithms used and the nature of the datasets. However, the lack of a publicly available code repository or demo limits the ease of reproducibility. The authors do release a benchmark, which is a positive aspect for future research.
The paper acknowledges limitations such as the inability to detect convention drift and the reliance on an entailment oracle, which may not always be complete. Additionally, the potential for residual drift in certified statements is highlighted, emphasizing the need for expert review in practical applications.
The framework has the potential to significantly enhance the reliability of AI-assisted formal mathematics, addressing a critical bottleneck in the field. However, the authors caution against uncritical use of the certifier, as it could propagate subtle errors into formal libraries if not properly validated by experts. This paper introduces a significant advancement in the field of autoformalization by proposing a framework that certifies semantic equivalence between natural-language and formal mathematical statements. The comprehensive methodology and rigorous experimental evaluation demonstrate its potential to improve the reliability of AI-assisted formal mathematics, making it a highly impactful contribution to machine learning research.
Aligning neural activity across subjects offers the promise of discovering shared computational principles and generalizable decoders. However, traditional alignment methods require shared stimuli across subjects, a constraint that limits applicability to naturalistic paradigms with limited or non-overlapping data. We introduce a Multi-Encoder-Decoder Variational Autoencoder (MED-VAE) that achieves cross-subject alignment without shared stimuli by anchoring representations to a common scaffold provided by a pretrained ANN. Using the Natural Scenes Dataset, we show that MED-VAE creates common latent spaces with superior semantic organisation, achieving higher cross-subject alignment than common methods while maintaining robust generalisation to held-out stimuli where traditional methods degrade. Reconstructing from these common spaces back to each subject's original neural space, MED-VAE preserves equal stimulus-driven signal in its cross-subject latent space. Finally, we show that this superior alignment directly enables cross-subject neural prediction, as demonstrated via cross-subject image decoding. In summary, we introduce a framework to identify generalisable common subspaces for cross-subject predictions and downstream tasks, demonstrated here for visual cortex responses to static images.
Primary: University of Oxford
All Institutions: Big Data Institute, Centre for Neural Circuits and Behaviour, Department of Medicine, Department of Physiology Anatomy and Genetics, University of Oxford
The main contribution of this paper is the introduction of MED-VAE, a novel architecture that enables cross-subject neural alignment without shared stimuli, demonstrating superior semantic organization and generalization capabilities. This work significantly advances the field of systems neuroscience by providing a robust method for aligning neural data across individuals, paving the way for more comprehensive analyses of brain function and individual differences.
The proposed Multi-Encoder-Decoder Variational Autoencoder (MED-VAE) introduces a novel architecture that bypasses the need for shared stimuli in cross-subject neural alignment by leveraging a pretrained artificial neural network (ANN) as a scaffold. This method allows for implicit alignment pressure, creating a common latent space that captures shared computational structures across subjects. The architecture's design is well-justified, with a clear explanation of how the various components interact to achieve alignment, and the use of a variational autoencoder framework is appropriate for the task. The integration of multiple encoders and decoders tailored to individual subjects while maintaining a shared latent space is a significant methodological advancement.
The experiments conducted using the Natural Scenes Dataset are robust, with a clear focus on evaluating the performance of MED-VAE against traditional methods such as Shared Response Models and Procrustes analysis. The metrics employed, including component-wise correlation, Representational Similarity Analysis (RSA), and cross-trial evaluations, provide a comprehensive assessment of alignment quality and generalization capabilities. The results demonstrate that MED-VAE achieves superior performance in preserving semantic organization and cross-subject alignment, with rigorous statistical analysis supporting the findings.
The paper includes a URL to the code repository, which enhances the reproducibility of the results. The methodology is described in sufficient detail, allowing other researchers to replicate the experiments. However, the reliance on specific datasets and pretrained models may limit broader applicability without further validation on diverse datasets.
The study is limited by its focus on a single dataset, which may restrict the generalizability of the findings. Additionally, while the architecture is designed to accommodate heterogeneous inputs, the actual implementation has only been tested on subjects from the same dataset. Future work should explore cross-study alignment with varying acquisition parameters. The quality of alignment is also dependent on the pretrained ANN's representational capabilities, which may not be universally applicable across different domains.
The implications of this work are significant for population-level neuroscience, enabling the analysis of neural data across subjects without the need for shared stimuli. This could facilitate new insights into universal coding mechanisms and individual differences in neural processing. The framework has potential applications in brain-computer interfaces and clinical settings, where aligning neural responses from different individuals can enhance predictive modeling and data augmentation strategies. The main contribution of this paper is the introduction of MED-VAE, a novel architecture that enables cross-subject neural alignment without shared stimuli, demonstrating superior semantic organization and generalization capabilities. This work significantly advances the field of systems neuroscience by providing a robust method for aligning neural data across individuals, paving the way for more comprehensive analyses of brain function and individual differences.
Graph convolutional networks (GCNs) have demonstrated significant success in capturing complex user-item relationships for collaborative filtering (CF). However, due to their reliance on extensive model training, training-free graph filtering (GF)-based CF methods have emerged as a promising alternative, offering computational efficiency by smoothing graph signals via matrix operations. In particular, polynomial GF-based approaches demonstrate improved accuracy through their ability to design more expressive and flexible filtering functions. Despite these advantages, existing GF methods suffer from a critical memory bottleneck: they necessitate storing the full item similarity graph, incurring prohibitive memory costs for large-scale datasets, which limits their practical applicability. To tackle this challenge, we propose Mem-GF (Memory-efficient GF), a new GF-based CF method that departs from conventional designs by principally leveraging the structure of Krylov subspaces as a core mechanism for approximating polynomial graph filters without explicitly storing the item similarity graph. We theoretically analyze the minimum Krylov subspace size that guarantees lossless approximation. Through extensive experiments, we demonstrate that Mem-GF achieves up to 5.74$\times$ lower memory usage and 4.38$\times$ speedup in runtime, while consistently exceeding the recommendation accuracy of state-of-the-art GF and GCN-based methods. Mem-GF robustly scales to datasets with tens of millions of interactions, establishing itself as a practically viable and theoretically grounded solution for efficient CF.
Primary: Yonsei University
All Institutions: Yonsei University
Mem-GF has a significant broader impact on the field of recommender systems and potentially other areas of graph machine learning. By effectively addressing the memory bottleneck, it makes high-accuracy, polynomial graph filtering techniques practically viable for large-scale collaborative filtering, a critical requirement for modern online platforms. This enables faster preprocessing, low-latency real-time inference, and superior recommendation quality on standard hardware, democratizing access to advanced graph-based CF. The principled use of Krylov subspaces as a core filtering mechanism, rather than merely a computational shortcut, could inspire similar memory-efficient approaches in other graph signal processing or graph machine learning contexts where large, implicitly defined matrices are a challenge. The strong theoretical grounding further enhances the trustworthiness and potential for generalization of this methodology. Mem-GF proposes a novel, memory-efficient, and training-free graph filtering method for collaborative filtering that leverages Krylov subspaces to approximate high-order polynomial graph filters without explicitly storing the full item similarity graph. This paper makes a substantial technical contribution by elegantly solving a critical memory bottleneck in graph filtering-based collaborative filtering, enabling scalable and high-accuracy recommendations on large datasets. The method's theoretical grounding, combined with comprehensive experimental validation demonstrating significant memory savings, speedups, and state-of-the-art accuracy, establishes Mem-GF as a practically viable and theoretically sound solution that will likely influence future research and deployment of graph-based recommender systems.
The paper effectively identifies a critical memory bottleneck in existing Graph Filtering (GF)-based Collaborative Filtering (CF) methods, which stem from the necessity to explicitly store a full item similarity graph $P$ of size $|I| \times |I|$. The proposed Mem-GF method offers an elegant and principled solution by leveraging Krylov subspaces. Instead of forming and storing $P$, Mem-GF approximates polynomial graph filters $f(P)r_u$ by projecting $P$ onto a user-specific Krylov subspace $K_K(P, r_u)$, generated by the user's interaction vector $r_u$. This projection is efficiently computed using the Lanczos algorithm, which yields an orthonormal basis $Q_u$ and a much smaller tridiagonal matrix $T_u$. The filtering operation is then performed in this reduced space as $\|r_u\|_2 Q_u f(T_u) e_1$. A key methodological strength is that the matrix-vector product $Pq_{u,j}$ required by Lanczos is computed as $R^T(Rq_{u,j})$, completely bypassing the explicit construction of $P$. The theoretical analysis provides a clear and strong guarantee: for a polynomial filter of degree $N$, setting the Krylov subspace size $K > N$ ensures lossless approximation under exact arithmetic. This theoretical foundation is crucial for understanding and applying the method. Furthermore, the ability to operate within a low-dimensional subspace grants Mem-GF the flexibility to design and utilize high-order polynomial filters (e.g., approximating a Gaussian filter), which are typically infeasible for conventional methods due to memory constraints, thereby enhancing filter expressiveness and accuracy. The "training-free" nature aligns with the paper's goal of computational efficiency.
The experimental evaluation is exceptionally comprehensive and provides strong empirical evidence for all claims. Experiments are conducted on three widely used CF benchmark datasets: Yelp, Amazon-book, and the large-scale MovieLens-20M, covering diverse scales and characteristics. A broad range of 21 baselines is included, encompassing various CF categories (MF, Autoencoder, GCN, Generative, LinkProp), with a particular focus on other GF-based methods. Key metrics such as memory usage (VRAM, RAM), runtime (preprocessing and inference), and recommendation accuracy (Recall@K, NDCG@K) are rigorously evaluated. The results are highly impactful: Mem-GF achieves up to 5.74x lower memory usage and 4.38x speedup during preprocessing, and a remarkable 26.2x speedup during inference. Crucially, these significant efficiency gains are accompanied by state-of-the-art recommendation accuracy, consistently outperforming both GF and GCN-based methods across most datasets and metrics. The scalability analysis on synthetic datasets further validates the method's linear complexity with respect to the number of users, items, and interactions, confirming its practical applicability for real-world, large-scale deployments. The empirical validation of the theoretical condition ($K > N$), along with analyses of different polynomial filters and hyperparameter sensitivity, adds to the robustness and thoroughness of the evaluation.
The paper demonstrates a strong commitment to reproducibility. A GitHub link to the source code (`https://github.com/jindeok/Mem-GF`) is provided, which is a critical component for enabling replication. Detailed hyperparameters for Mem-GF are explicitly stated for each dataset. Furthermore, the paper outlines the data splitting, evaluation protocols, hardware specifications (CPU, GPU, RAM), and software environment (PyTorch), along with the method for generating synthetic datasets. These comprehensive details provide sufficient information for researchers to reproduce the reported results.
While Mem-GF's "training-free" nature offers efficiency, it inherently implies less flexibility compared to learnable GCNs that can adapt their filters through end-to-end optimization. The polynomial coefficients are found by approximating a target frequency response, which is still a predefined approach rather than a fully learned one. The theoretical guarantee of lossless approximation holds under *exact arithmetic* and when the polynomial degree $N$ is less than the Krylov subspace size $K$. While the paper mentions that finite-precision arithmetic or $N \ge K$ might lead to instability, a deeper exploration of these practical implications beyond empirical observation would be beneficial. The method still requires tuning of hyperparameters such as $s$ (Hadamard power) and $\delta$ (damping factor for the Gaussian filter). Although Mem-GF enables user-specific filtering in the Krylov subspace, the underlying polynomial filter itself is still globally defined, rather than being truly personalized to each user's unique spectral characteristics.
Mem-GF has a significant broader impact on the field of recommender systems and potentially other areas of graph machine learning. By effectively addressing the memory bottleneck, it makes high-accuracy, polynomial graph filtering techniques practically viable for large-scale collaborative filtering, a critical requirement for modern online platforms. This enables faster preprocessing, low-latency real-time inference, and superior recommendation quality on standard hardware, democratizing access to advanced graph-based CF. The principled use of Krylov subspaces as a core filtering mechanism, rather than merely a computational shortcut, could inspire similar memory-efficient approaches in other graph signal processing or graph machine learning contexts where large, implicitly defined matrices are a challenge. The strong theoretical grounding further enhances the trustworthiness and potential for generalization of this methodology. Mem-GF proposes a novel, memory-efficient, and training-free graph filtering method for collaborative filtering that leverages Krylov subspaces to approximate high-order polynomial graph filters without explicitly storing the full item similarity graph. This paper makes a substantial technical contribution by elegantly solving a critical memory bottleneck in graph filtering-based collaborative filtering, enabling scalable and high-accuracy recommendations on large datasets. The method's theoretical grounding, combined with comprehensive experimental validation demonstrating significant memory savings, speedups, and state-of-the-art accuracy, establishes Mem-GF as a practically viable and theoretically sound solution that will likely influence future research and deployment of graph-based recommender systems.
We study first-order methods for solving monotone variational inequalities arising in min-max optimization. Classical approaches such as the extragradient method rely on two gradient queries per iteration, which limits their analysis and applicability in the online and stochastic settings. We propose a family of Generalized Optimistic Methods with Anchoring (GOMA), which combine two-time-scale optimistic updates with an anchoring term inspired by Halpern iteration. In the deterministic setting, GOMA achieves the optimal accelerated last-iterate rate $O(1/k^2)$ on the squared gradient norm for monotone Lipschitz operators. In the stochastic setting with unbounded variance, a simplified single-call variant of GOMA achieves a last-iterate convergence rate of $O(1/\sqrt{k})$ on the squared gradient norm. To the best of our knowledge, this is the first such guarantee for stochastic monotone Lipschitz variational inequalities in the unconstrained setting without variance reduction or growing batches.
Primary: Université de Montréal
All Institutions: Université de Montréal, Mila - Quebec AI Institute, Mohammed Bin Zayed University of Artificial Intelligence, CIFAR AI Chair
This paper contributes to the fundamental understanding and development of optimization algorithms for variational inequalities and min-max optimization, which are crucial in various machine learning applications like adversarial training, GANs, and multi-agent reinforcement learning. By providing a method that offers last-iterate convergence in challenging stochastic settings (single-call, no variance reduction, no growing batches, unbounded variance), GOMA could enable more efficient and stable training of models in online or resource-constrained environments. The explicit acknowledgment of AI assistant use in proof development is also a noteworthy aspect regarding research methodology. The impact statement correctly identifies the potential for more efficient use of computing resources but also cautions about the Jevons paradox. This paper introduces Generalized Optimistic Method with Anchoring (GOMA), a novel first-order method for monotone variational inequalities that achieves optimal $O(1/k^2)$ last-iterate convergence in the deterministic setting and, critically, provides the first last-iterate $O(1/N)$ convergence guarantee on the expected squared operator norm for stochastic monotone Lipschitz VIs in the unconstrained setting without variance reduction, growing batches, or bounded variance assumptions. The work makes a significant theoretical advancement by demonstrating that strong last-iterate guarantees are compatible with single-sample online models under highly challenging noise conditions, supported by empirical evidence on synthetic problems.
The paper proposes Generalized Optimistic Methods with Anchoring (GOMA) for solving monotone variational inequalities (VIs) in min-max optimization. GOMA combines three key ideas: two-time-scale optimistic updates (from generalized optimistic methods), and an anchoring term (inspired by Halpern iteration). The method is presented in a general form (Eq. 7) with separate step sizes for exploration and update, and an anchoring coefficient. In the deterministic setting, GOMA is analyzed under two parameter setups (larger update step or larger exploration step), both achieving the optimal accelerated last-iterate rate of $O(1/k^2)$ on the squared gradient norm for monotone Lipschitz operators. The proof relies on a potential-based analysis, which is a standard and robust technique. A notable aspect is the claim of a "pseudo fixed-step size scheme" that simplifies hyperparameter tuning compared to some prior methods. For the stochastic setting, the paper introduces a simplified single-call variant of GOMA (Eq. 16) by setting the optimistic update coefficient to zero, effectively replacing extrapolation with anchoring to the initial point. This variant is analyzed under state-dependent noise (Assumption 1) where the variance can grow with the squared norm of the operator, a challenging setting. The proof strategy involves comparing noisy iterates to a deterministic reference trajectory and bounding the mean-square deviation. Theorem 3.1 establishes a last-iterate convergence rate of $O(1/N)$ on the expected squared operator norm $E\|G(x_N)\|^2$. This is a significant theoretical contribution, as the paper claims it's the first such guarantee for stochastic monotone Lipschitz VIs in the unconstrained setting without variance reduction or growing batches, and under unbounded variance. A critical issue is the inconsistency in reporting the stochastic convergence rate. The abstract states $O(1/\sqrt{k})$ on the squared gradient norm, which implies $O(1/k^{1/4})$ on the gradient norm. Theorem 3.1, however, states $E\|G(x_N)\|^2 = O(1/N)$, which implies $O(1/\sqrt{N})$ on the gradient norm. The comparison table (Table 1) and parts of the discussion further add to this confusion, sometimes stating $O(1/k^{1/4})$ on $E\|G(x_k)\|$ and sometimes $O(1/k)$ on $E\|G(x_k)\|^2$ (which are inconsistent with each other). Assuming Theorem 3.1 is the most accurate statement of the result, the rate is $O(1/N)$ on $E\|G(x_N)\|^2$, which is a strong result given the challenging assumptions.
The experimental evaluation is conducted on toy problems, which is common for theoretical optimization papers. 1. **Negative-Comonotone Quadratic Saddle Point (Deterministic)**: This experiment uses a problem instance outside the theoretical scope (negative comonotonicity vs. monotonicity), but it's a standard benchmark for comparing VI algorithms. GOMA and FEG show accelerated convergence, while others diverge. GOMA empirically achieves a better constant factor than FEG. 2. **Stochastic Bilinear Game (Bounded Variance)**: On a low-dimensional bilinear game with additive Gaussian noise ($=1$), GOMA significantly outperforms baselines (DSEG, FEG, E-Halpern, RAIN++, Nesterov), achieving the fastest convergence and a residual an order of magnitude smaller. This supports the claim of robustness without variance reduction. 3. **Finite-Sum Saddle-Point Problem (State-Dependent Variance)**: On a higher-dimensional finite-sum problem with multiplicative noise ($>1$), GOMA and RAIN++ show convergence, while DSEG stagnates. This experiment directly validates GOMA's ability to handle state-dependent, unbounded variance, a key theoretical claim. Overall, the experiments, despite being on synthetic problems, effectively demonstrate the empirical advantages of GOMA, particularly in stochastic settings with challenging noise characteristics, aligning well with the theoretical claims.
The paper provides algorithmic details, step size choices, and parameter schedules for GOMA. For baselines, it refers to existing implementations or settings from prior work. However, specific hyperparameters for all methods are deferred to the appendix, and no code repository is provided. While the theoretical derivations are detailed, the lack of a public code release or highly detailed hyperparameter tuning instructions (beyond the appendix reference) might hinder direct reproducibility for practitioners.
1. **Stochastic Rate Inconsistency**: As noted, there is a significant discrepancy in the reported stochastic convergence rates across the abstract, main text, theorem statement, and comparison table. This undermines the clarity and rigor of the paper's central stochastic contribution. Assuming the theorem ($O(1/N)$ on $E\|G(x_N)\|^2$) is correct, the other statements are misleading. 2. **Slower Optimal Rate**: The paper acknowledges that GOMA's stochastic rate ($O(1/N)$ on $E\|G(x_N)\|^2$) does not match the optimal $O(1/N)$ rate (on $E\|G(x_N)\|^2$) achieved by methods using variance reduction or growing batches. Closing this gap without such mechanisms remains an open question. 3. **Toy Experiments**: The empirical validation is limited to synthetic and relatively low-dimensional problems. Scaling GOMA to large-scale deep learning applications (e.g., adversarial training) and demonstrating its practical benefits there would strengthen the work. 4. **Unconstrained Setting**: The analysis is restricted to unconstrained VIs. Extending it to constrained settings, where the convergence measure often shifts to the gap function, is an open direction. 5. **Monotonicity Assumption**: The theoretical guarantees rely on the monotonicity of the operator, which is a strong assumption not always met in practical deep learning min-max problems.
This paper contributes to the fundamental understanding and development of optimization algorithms for variational inequalities and min-max optimization, which are crucial in various machine learning applications like adversarial training, GANs, and multi-agent reinforcement learning. By providing a method that offers last-iterate convergence in challenging stochastic settings (single-call, no variance reduction, no growing batches, unbounded variance), GOMA could enable more efficient and stable training of models in online or resource-constrained environments. The explicit acknowledgment of AI assistant use in proof development is also a noteworthy aspect regarding research methodology. The impact statement correctly identifies the potential for more efficient use of computing resources but also cautions about the Jevons paradox. This paper introduces Generalized Optimistic Method with Anchoring (GOMA), a novel first-order method for monotone variational inequalities that achieves optimal $O(1/k^2)$ last-iterate convergence in the deterministic setting and, critically, provides the first last-iterate $O(1/N)$ convergence guarantee on the expected squared operator norm for stochastic monotone Lipschitz VIs in the unconstrained setting without variance reduction, growing batches, or bounded variance assumptions. The work makes a significant theoretical advancement by demonstrating that strong last-iterate guarantees are compatible with single-sample online models under highly challenging noise conditions, supported by empirical evidence on synthetic problems.
Recently, Muon has gained substantial attention as an appealing alternative to Adam-like optimizers, with many works highlighting its advantages through spectral normalization and improved conditioning. Yet this positive theoretical narrative contrasts with its empirical performance in large language model (LLM) training, where Muon's gains over Adam/AdamW are often mixed, schedule-sensitive, and not uniformly superior. To address this gap, we develop a trajectory-level theory characterizing both the strengths and limitations of Muon. We introduce a mixed-spiked matrix sensing model whose sensing operator decomposes into signal, spike, and bulk components, capturing a mixture of anisotropic structure and long-tail information reminiscent of LLM training. On top of it, we adopted a river-valley perspective in which we view the landscape as composed of a river direction flowing to the desired solution and hill directions encoding nuisance or task-irrelevant information. In the momentum-free setting, we show that Muon moves faster along the information-bearing river direction during early optimization, but can converge much more slowly near the river bottom than gradient descent. We then extend the river-valley perspective to general nonconvex objectives with momentum by studying points on the spectral river. There, while Muon converges faster early on, its orthogonalized update removes residual scale information, making it prone to overshooting and oscillation near the target solution. Together, these results suggest that our characterizations extend beyond spiked matrix sensing and motivate switching to GD-like refinement optimizers in the final phase, rather than relying only on a fixed learning-rate schedule for Muon. We also provide preliminary evidence supporting this two-stage approach in language model training experiments.
Primary: University of Pennsylvania
All Institutions: University of Pennsylvania, City University of Hong Kong, Shanghai University of Finance and Economics
This paper provides a rigorous theoretical framework explaining the strengths and limitations of the Muon optimizer, proposing a two-stage optimization strategy supported by preliminary LLM experiments. The work introduces a novel mixed-spiked matrix sensing model and leverages a river-valley perspective to characterize Muon's fast early exploration and late-stage convergence difficulties, offering valuable insights for the design of more effective training schedules for large language models.
The paper develops a sophisticated theoretical framework to analyze the Muon optimizer, particularly its behavior in anisotropic landscapes reminiscent of LLM training. The core methodology involves introducing a novel "mixed-spiked matrix sensing (MS) model" where the sensing operator decomposes into signal, spike, and bulk components. This model is well-motivated by empirical observations of covariance spectra in deep learning. The authors then adopt a "river-valley perspective," a geometric view that decomposes the optimization landscape into a "river" direction (aligned with meaningful progress) and "hill" directions (nuisance information). This perspective is applied to both a simplified, momentum-free Muon and extended to generalized nonconvex objectives with momentum. The analysis uses invariant manifolds to reduce matrix-valued dynamics to low-dimensional scalar systems, enabling tractable analysis of continuous and discrete-time dynamics for both vanilla GD and simplified Muon. Key theoretical results (Theorems 1, 2, 3) rigorously characterize Muon's early-stage fast exploration and late-stage convergence difficulties (overshooting, oscillation) compared to GD. The extension to generalized settings using a "spectral river" further strengthens the broader applicability of their insights. The mathematical derivations are thorough and provide a deep understanding of the underlying mechanisms.
The experimental evaluation, while described as "preliminary," provides valuable empirical evidence supporting the theoretical claims. The authors train a 250M-parameter LLaMA-style decoder-only Transformer from scratch on OpenWebText2, a relevant and challenging setting for LLM research. They compare Muon-only baselines with various learning rate schedules against a proposed two-stage hybrid approach (Muon followed by AdamW). The results demonstrate that constant-LR Muon indeed exhibits the fastest initial loss decrease, consistent with its early-stage exploratory power. Crucially, the "Muon -> AdamW" hybrid strategy leads to more stable loss trajectories and achieves lower final validation loss compared to Muon-only baselines, even with tuned schedules. This directly supports the theoretical recommendation of using Muon for early exploration and switching to a GD-like optimizer for late-stage refinement. The inclusion of experiments with different switching times and post-switch AdamW LR schedules further strengthens the robustness of their findings. While the scale of the model (250M) is not "large" by today's cutting-edge LLM standards, it is sufficiently large to demonstrate the practical relevance of the theoretical insights.
The paper provides a project website (https://muon-river-valley.github.io/) which typically includes code and experimental details, enhancing reproducibility. The experimental setup details are reasonably well-described, including model architecture (LLaMA-style decoder-only Transformer), parameter count (250M), tokenizer (GPT-2), dataset (OpenWebText2), and training iterations (4k). Learning rate schedules (cosine, linear, cos_inf) and switching points are also mentioned. While not all hyperparameter details are in the main text, the appendix and project website are expected to fill these gaps. The theoretical derivations are detailed in the appendix, allowing for verification.
The primary theoretical analysis relies on a simplified, momentum-free Muon and a specific mixed-spiked MS model, although the paper attempts to generalize these insights to more complex settings. The empirical evidence, while supportive, is explicitly stated as "preliminary" and conducted on a 250M-parameter model, which is modest compared to state-of-the-art LLMs. Further large-scale experiments on diverse architectures and tasks would strengthen the practical implications. The paper also acknowledges that the river-valley decomposition is only one lens and suggests integrating it with other phenomena like edge-of-stability behavior as future work, indicating a limitation in the current scope of analysis.
This paper significantly advances the theoretical understanding of spectral optimizers like Muon, which have gained attention but lacked a comprehensive explanation for their mixed empirical performance. The "river-valley perspective" and the mixed-spiked MS model provide valuable tools for analyzing optimization landscapes in deep learning, particularly in the context of anisotropic gradients observed in LLMs. The practical implication of a two-stage optimization strategy (Muon for exploration, GD-like for refinement) could lead to more efficient and stable training schedules for large models, reducing the need for extensive learning rate tuning. This work has the potential to influence the design and application of future optimizers and contribute to a more principled approach to deep learning training. This paper provides a rigorous theoretical framework explaining the strengths and limitations of the Muon optimizer, proposing a two-stage optimization strategy supported by preliminary LLM experiments. The work introduces a novel mixed-spiked matrix sensing model and leverages a river-valley perspective to characterize Muon's fast early exploration and late-stage convergence difficulties, offering valuable insights for the design of more effective training schedules for large language models.
Model-free reinforcement learning algorithms such as Proximal Policy Optimization (PPO) treat the environment as a black box, estimating policy gradients from sampled rewards; this process demands millions of interactions and relies on high-variance advantage estimates. When environment dynamics are differentiable, the return is an end-to-end differentiable function of the policy parameters, enabling exact gradient computation via backpropagation through simulation. We term this approach Analytic Policy Gradients (APG) and evaluate it against PPO on four continuous control tasks of increasing dynamical complexity: a one-dimensional point-mass target-reaching task, a 2D point-mass navigation task with obstacle avoidance, a 2D rigid-body T-block pushing task, and a 7-DOF Franka FR3 end-effector reaching task. Both algorithms share identical model architectures, observation normalization, and optimizer settings. To decouple sample efficiency from compute efficiency, we design a multi-axis evaluation protocol that records performance against environment steps and gradient steps. We report a segmented backpropagation scheme with MC and critic-based bootstrap modes that mitigates gradient degradation on long-horizon tasks, and present ablations over segment length and bootstrap strategy.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong
This paper has a significant positive broader impact, primarily within the differentiable reinforcement learning and robotics communities. 1. **Advancing Differentiable RL**: It provides compelling empirical evidence for the sample efficiency benefits of leveraging differentiable environment dynamics, addressing a critical bottleneck in real-world RL applications. 2. **Practical Tools and Enablement**: The Warp--PyTorch gradient bridge is a crucial engineering contribution that makes complex, GPU-accelerated physics engines (like NVIDIA Newton/Warp) more accessible for differentiable RL research within the widely used PyTorch framework. This can accelerate progress in areas such as robot manipulation and locomotion. 3. **Improved Evaluation Standards**: The unified benchmarking harness and multi-axis evaluation protocol set a higher standard for comparing model-free and differentiable RL algorithms, promoting more rigorous and fair assessments across the field. 4. **Guidance for Practitioners**: The detailed ablation studies on bootstrap strategies and segment lengths offer practical, actionable advice for researchers and engineers designing differentiable RL systems, helping them make more informed choices for robust training. 5. **Open Science Contribution**: The release of the full codebase and environment suite fosters open research, enabling others to reproduce, verify, and build upon this work, accelerating collective progress in the field. This paper rigorously validates the benefits of Analytic Policy Gradients in differentiable continuous control. It provides a robust benchmarking framework, introduces a practical gradient bridge for complex physics engines, and offers valuable insights into segmented backpropagation strategies, significantly advancing the practical applicability and understanding of differentiable reinforcement learning.
The paper presents Analytic Policy Gradients (APG) as a method for continuous control, leveraging differentiable environment dynamics to compute exact policy gradients via backpropagation through simulation. While the core concept of APG is not new, the paper's strength lies in its meticulous methodological contributions and rigorous implementation. A unified benchmarking harness is developed, allowing for a highly controlled comparison between APG and PPO by ensuring identical actor-critic architectures, observation normalization, and optimizer settings. This standardization is crucial for drawing fair conclusions about the gradient source's impact. The paper adopts a segmented backpropagation scheme to address vanishing/exploding gradients in long-horizon tasks. A key methodological contribution is the detailed exploration and comparison of two bootstrap modes for these segments: Monte Carlo (MC) bootstrap and critic-based bootstrap. The MC bootstrap, which pre-computes future returns from detached rewards, is shown to be a more robust option for shorter segment lengths, providing valuable practical guidance. A significant engineering contribution is the custom `torch.autograd.Function` that bridges NVIDIA Warp/Newton's tape-based autodiff with PyTorch's autograd. This "gradient bridge" enables APG to be applied to complex, GPU-accelerated physics engines that do not natively expose PyTorch-compatible derivatives, thereby expanding the practical applicability of differentiable RL to more realistic and complex robotic tasks like the 7-DOF Franka arm. The use of the reparameterization trick for action sampling ensures proper gradient flow through stochastic policies. Overall, the methodology is sound, well-explained, and effectively tackles practical challenges in implementing differentiable RL.
The experimental evaluation is exceptionally thorough and well-designed. The authors evaluate APG against PPO on four continuous control tasks of increasing dynamical complexity: a 1D point-mass, a 2D point-mass navigation with obstacles, a 2D rigid-body T-block pushing task, and a 7-DOF Franka FR3 end-effector reaching task. This diverse suite effectively demonstrates APG's performance across various scenarios. A key strength of the evaluation is the multi-axis logging protocol, which records performance against both environment steps (measuring sample efficiency) and gradient steps (measuring compute efficiency). This approach is critical for a fair comparison, as APG and PPO consume these resources at different rates. Results are reported as mean ± standard deviation over multiple random seeds, enhancing statistical reliability. Performance thresholds and success rates are clearly defined, providing a comprehensive view of agent capabilities beyond just episodic return. The results consistently show that APG achieves higher final episodic returns and often higher success rates than PPO, particularly on simpler tasks. More importantly, APG demonstrates substantial sample efficiency gains, requiring significantly fewer gradient steps (up to 15.9x fewer on FrankaReach) and environment steps to reach comparable performance thresholds. This strongly validates the benefit of lower-variance analytic gradients. The ablation studies on PointMassNavigate are particularly insightful. They clearly demonstrate that MC bootstrap is robust across varying segment lengths, degrading gracefully even at very short horizons. In contrast, critic bootstrap is highly sensitive to segment length, collapsing entirely at short lengths due to unstable value targets and only becoming competitive at longer segments. This finding provides crucial practical guidance for practitioners. The successful application of the Warp-PyTorch gradient bridge on the FrankaReach task further validates its feasibility and impact.
Reproducibility is a standout feature of this paper. The authors have made their entire implementation, including environment definitions, training scripts, and plotting utilities, open-source on GitHub. They provide detailed instructions, a `requirements.txt` for dependencies, and explicit commands to reproduce each figure and table presented in the paper. The unified benchmarking harness itself contributes significantly to reproducibility by standardizing the comparison between algorithms. This commitment to open science is exemplary and greatly enhances the credibility and utility of the research.
The paper transparently discusses several important limitations inherent to the Analytic Policy Gradients approach: 1. **Environment Differentiability**: APG fundamentally requires the environment dynamics and reward function to be differentiable. This restricts its application to specific simulators and excludes real-world training or environments with non-differentiable elements (e.g., discrete contact events, complex procedural generation). 2. **Gradient Chain Length Issues**: Despite the use of segmented backpropagation, long episodes can still lead to vanishing or exploding gradients. The effectiveness of APG remains sensitive to the choice of segment length and bootstrap strategy, as demonstrated by the ablation studies. 3. **Compute Overhead**: Maintaining the full computation graph during environment rollouts incurs higher memory and computational overhead compared to model-free methods like PPO, which use detached rollouts. This can be a practical concern for very complex environments or extremely long horizons. 4. **Model Bias (for future work)**: While the current work uses ground-truth differentiable dynamics, the authors acknowledge that extending APG to learned differentiable world models would introduce model bias, which could potentially counteract the variance reduction benefits.
This paper has a significant positive broader impact, primarily within the differentiable reinforcement learning and robotics communities. 1. **Advancing Differentiable RL**: It provides compelling empirical evidence for the sample efficiency benefits of leveraging differentiable environment dynamics, addressing a critical bottleneck in real-world RL applications. 2. **Practical Tools and Enablement**: The Warp--PyTorch gradient bridge is a crucial engineering contribution that makes complex, GPU-accelerated physics engines (like NVIDIA Newton/Warp) more accessible for differentiable RL research within the widely used PyTorch framework. This can accelerate progress in areas such as robot manipulation and locomotion. 3. **Improved Evaluation Standards**: The unified benchmarking harness and multi-axis evaluation protocol set a higher standard for comparing model-free and differentiable RL algorithms, promoting more rigorous and fair assessments across the field. 4. **Guidance for Practitioners**: The detailed ablation studies on bootstrap strategies and segment lengths offer practical, actionable advice for researchers and engineers designing differentiable RL systems, helping them make more informed choices for robust training. 5. **Open Science Contribution**: The release of the full codebase and environment suite fosters open research, enabling others to reproduce, verify, and build upon this work, accelerating collective progress in the field. This paper rigorously validates the benefits of Analytic Policy Gradients in differentiable continuous control. It provides a robust benchmarking framework, introduces a practical gradient bridge for complex physics engines, and offers valuable insights into segmented backpropagation strategies, significantly advancing the practical applicability and understanding of differentiable reinforcement learning.
Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in research. Using a formal psychometric framework, we show that these profiles are largely a measurement artifact. Administering a battery of personality and risk-preference instruments spanning self-reports and behavioral tasks to 56 instruction-tuned LLMs alongside large human reference samples, we report four findings. First, differences between models are driven not by the traits an instrument targets but by a directional response bias, a tendency to respond toward one end of the scale, or one labeled option, regardless of item content; a variance decomposition attributes 81-90% of between-model variation to this bias, against 9-16% in humans. Second, the bias declines with model capability but is not eliminated by it. Third, because bias rather than trait drives responding, an instrument's apparent reliability is almost entirely predicted by its response orthogonality, a term we coin for the proportion of items for which trait and bias point in opposite directions. Fourth, the profile a model appears to have shifts with the items used and can be manufactured through item selection. These results demonstrate that the apparent psychological profiles of LLMs are artifacts of the instrument used to measure them, not properties of the models themselves. As instruments borrowed from human psychology are rarely fully orthogonal and may inherently lack validity for LLMs, we call for dedicated assessments centered on response orthogonality.
Primary: Max Planck Institute for Human Development
All Institutions: Max Planck Institute for Human Development, University of Konstanz, Barcelona Supercomputing Center, University of Basel
This paper provides a critical psychometric analysis demonstrating that apparent psychological profiles in LLMs are largely artifacts of response bias rather than genuine traits, fundamentally challenging current evaluation practices and calling for more rigorous, orthogonal measurement frameworks in AI research.
The paper employs a rigorous psychometric framework to deconstruct the measurement of LLM "personalities." By modeling responses as a function of latent trait and response bias, and utilizing the concept of "response orthogonality" (the proportion of reverse-keyed items), the authors provide a mathematically sound method to separate genuine trait variance from systematic response artifacts. This approach is theoretically robust and directly addresses a fundamental confound in current LLM evaluation methodologies.
The experimental design is comprehensive, testing 56 instruction-tuned LLMs against a battery of 29 instruments (personality and risk preference) and comparing them against large human reference samples. The results are striking and consistent: LLMs show positive forward-reverse correlations (indicating bias dominance) whereas humans show negative correlations (indicating trait dominance). The variance decomposition showing 81-90% of LLM variation is bias-driven is a powerful empirical finding. The robustness checks across prompting conditions, model sizes, and elicitation methods strengthen the validity of the conclusions.
The paper provides detailed methodological descriptions, including the specific instruments used, the prompting strategies, and the mathematical derivations for the measurement model. The code repository is explicitly linked, ensuring high reproducibility. The use of standard psychometric instruments and clear definitions of variables enhances the clarity of the experimental setup.
The study focuses primarily on post-trained models and treats each model as a single respondent, which may not capture within-model variability or the nuances of different prompting strategies (e.g., persona adoption). The proprietary model sample size is small (N=10), limiting statistical power for those specific comparisons. Additionally, the study is limited to personality and risk domains; while the authors argue for broader applicability, empirical validation in other domains (e.g., moral reasoning, cognitive biases) is left for future work.
This paper has significant implications for the field of AI safety, alignment, and the use of LLMs as proxies for human participants in research. It challenges the validity of current LLM profiling practices and calls for a re-evaluation of how we measure and interpret LLM behaviors. The concept of response orthogonality offers a new standard for designing valid evaluation instruments for AI systems. This paper provides a critical psychometric analysis demonstrating that apparent psychological profiles in LLMs are largely artifacts of response bias rather than genuine traits, fundamentally challenging current evaluation practices and calling for more rigorous, orthogonal measurement frameworks in AI research.
Speculative inference accelerates large language model (LLM) decoding but provides no inherent safety guarantees. Existing safety defenses are largely incompatible with speculative inference: they either introduce additional computation or disrupt the draft-verify mechanism, negating acceleration benefits. This reveals a fundamental incompatibility between current safety methods and speculative decoding. We propose SafeSpec, a safety-aware speculative inference framework that integrates risk estimation directly into the verification process. SafeSpec attaches a lightweight latent safety head to the target model to jointly evaluate semantic validity and safety in a single forward pass. When unsafe generations are detected, SafeSpec applies rollback and safety-guided reflective multi-sampling to recover safe continuations rather than terminating generation. We model jailbreak attacks as distributional shifts over generative trajectories, where adversarial prompts increase the probability of harmful continuations without eliminating safe ones. Under this model, SafeSpec performs risk-aware trajectory recovery within the speculative decoding process. Across multiple models and adversarial benchmarks, SafeSpec achieves a substantially improved safety-efficiency trade-off. On Qwen3-32B, SafeSpec reduces attack success rates by 15% while preserving a 2.06x inference speedup on benign workloads, demonstrating that speculative acceleration and inference-time safety can be jointly optimized.
Primary: Zhejiang University
All Institutions: Zhejiang University, Huawei, Harbin Institute of Technology, Shenzhen
SafeSpec has a significant positive broader impact by enabling the deployment of safer and more efficient large language models. By addressing the fundamental incompatibility between speculative inference and existing safety defenses, it paves the way for LLMs to be used in more sensitive and performance-critical applications. This framework can help mitigate the risks of harmful content generation, thereby increasing public trust and responsible AI deployment. The approach of recovering safe continuations rather than hard refusal also improves user experience and model helpfulness. The paper does not highlight any specific negative societal consequences beyond the general risks associated with LLMs, which its work aims to mitigate. SafeSpec introduces a novel safety-aware speculative inference framework that elegantly integrates risk estimation and recovery directly into the LLM decoding process. This work makes a substantial technical contribution by demonstrating a superior safety-efficiency trade-off through a lightweight latent safety head and a dynamic rollback-and-reflect multi-sampling mechanism, offering a practical and impactful solution for deploying safer and faster LLMs in real-world applications.
The methodology of SafeSpec is well-conceived and addresses a critical gap in LLM deployment: integrating safety guarantees into speculative inference without negating its acceleration benefits. The core innovation lies in its dual-head verification mechanism. By attaching a lightweight, boundary-aligned latent safety head to the target model, SafeSpec enables simultaneous assessment of semantic validity and safety in a single forward pass. This design is elegant as it leverages the target model's existing computation for quality scoring, incurring negligible additional overhead for safety checks. The boundary-aligned extraction of hidden states for the safety head is a clever detail, preventing interference from the quality scoring prompt. The training methodology for the safety head, using step-wise prefix construction and a guard model for labeling, is sound for aligning the head with the inference process. The "rollback-and-reflect" mechanism, coupled with safety-guided multi-sampling, is a significant departure from traditional hard refusal strategies. Framing jailbreak attacks as distributional shifts where harmful continuations become more probable but safe ones are not entirely eliminated provides a strong theoretical underpinning for the multi-sampling approach. The rollback to a previous, potentially "cleaner" state, combined with a reflection prompt, effectively reshapes the sampling space, increasing the probability of finding a safe continuation. This soft intervention strategy is crucial for maintaining utility and helpfulness, avoiding the common pitfall of over-refusal. The probabilistic view of multi-sampling is clearly articulated, demonstrating how increasing sample size $K$ improves the chance of recovery.
The experimental evaluation is comprehensive and rigorous. The authors use two distinct model families (Qwen3-32B and DeepSeek-R1-Distill-Llama-70B) with appropriate draft models, demonstrating the framework's scalability and versatility. Evaluation metrics cover three critical dimensions: defense against seven advanced adversarial attacks (ASR), over-refusal rates (XSTest), and general capabilities/efficiency (GSM8K, MATH, GPQA-diamond, and inference speedup). This multi-faceted evaluation provides a holistic view of SafeSpec's performance. SafeSpec consistently achieves state-of-the-art defense performance, significantly reducing ASR (e.g., 15% on Qwen3-32B) while preserving substantial inference speedups on benign workloads (2.06x on Qwen3-32B, 1.76x on DeepSeek-70B). Crucially, it maintains low over-refusal rates and negligible accuracy degradation on general reasoning tasks, showcasing a superior safety-efficiency trade-off compared to strong baselines like SafeDecoding and SecDecoding. The ablation studies are well-designed, clearly demonstrating the necessity and synergistic effect of both the reflection prompt and multi-sampling. The comparison with a hard refusal strategy effectively highlights the benefits of SafeSpec's recovery mechanism. Hyperparameter analysis provides valuable insights into the trade-offs involved with sample size, safety threshold, and quality threshold. The detailed latency breakdown in the appendix is particularly insightful, transparently explaining the performance characteristics on benign vs. adversarial inputs and justifying the reduced throughput on jailbreak inputs as a feature of the defense. The comparison with a standalone guard model further validates SafeSpec's efficiency and user experience advantages.
The paper demonstrates good reproducibility. Code is made available on GitHub. The appendix provides detailed information on evaluation datasets, jailbreak prompt construction, quality scoring prompt, safety head configurations (architecture, parameter counts), and training setup (data sources, sampling, hyperparameters, data isolation). Layer choice ablation and per-benchmark sensitivity analysis for quality threshold are also included, providing further confidence in the design choices. The use of a fixed random seed is also mentioned.
1. **Reliance on Guard Model for Labeling**: The training data for the safety head is labeled using Qwen3Guard-Gen-8B. The performance and biases of this external guard model could implicitly limit the safety head's effectiveness and generalization, especially if the guard model itself is imperfect or susceptible to certain attacks. 2. **Heuristic Nature of Reflection Prompt**: While effective, the reflection prompt is a handcrafted heuristic. Its optimal design might be sensitive to the target model or specific attack types, and its generalizability across all future attacks is not guaranteed. 3. **Performance on Adversarial Inputs**: Although justified as a necessary cost for safety, the significant slowdown on jailbreak inputs (throughput below 1x) means that if an attacker can consistently trigger Safety Mode, they can effectively degrade the system's performance, even if they don't get a harmful response. This could be a denial-of-service vector. 4. **Adversarial Attacks on Safety Head**: As the safety head is a lightweight classifier, it might be susceptible to direct adversarial attacks designed to bypass it, rather than just the main LLM. The paper does not explore this. 5. **Fixed Rollback State**: The rollback mechanism reverts to the "previous state." For deeply embedded or multi-turn attacks, a single step rollback might not always be sufficient to reach a truly "clean" context, potentially requiring more sophisticated context recovery.
SafeSpec has a significant positive broader impact by enabling the deployment of safer and more efficient large language models. By addressing the fundamental incompatibility between speculative inference and existing safety defenses, it paves the way for LLMs to be used in more sensitive and performance-critical applications. This framework can help mitigate the risks of harmful content generation, thereby increasing public trust and responsible AI deployment. The approach of recovering safe continuations rather than hard refusal also improves user experience and model helpfulness. The paper does not highlight any specific negative societal consequences beyond the general risks associated with LLMs, which its work aims to mitigate. SafeSpec introduces a novel safety-aware speculative inference framework that elegantly integrates risk estimation and recovery directly into the LLM decoding process. This work makes a substantial technical contribution by demonstrating a superior safety-efficiency trade-off through a lightweight latent safety head and a dynamic rollback-and-reflect multi-sampling mechanism, offering a practical and impactful solution for deploying safer and faster LLMs in real-world applications.
Agentic AI systems increasingly rely on language-model components to interpret instructions, process external data, invoke tools, and coordinate with other agents. These capabilities make prompt-injection and jailbreak attacks more consequential, especially as attackers adopt model-guided automation to scale probing, prompt refinement, and response evaluation. This work analyzes the resulting attack-defense setting through a probabilistic model of a target system, its defense mechanism, and the attacker's automated judge. Our analysis shows that conventional detect-and-block defenses can allow attacker success rate (ASR) to approach one as the query budget grows, since predictable refusals provide useful feedback to automated search. We then examine detect-and-misdirect, where detected malicious interactions receive controlled, non-operational responses designed to induce false-positive errors in the attacker's judge. This strategy reduces the positive predictive value of attacker-selected candidates and yields a bounded asymptotic ASR. We evaluate a proof-of-concept realization of this strategy through Contextual Misdirection via Progressive Engagement (CMPE), a lightweight conversational misdirection method designed to replace predictable refusal text with safe but strategically misleading responses in automated jailbreak settings. On jailbreak benchmarks, CMPE reduces estimated ASR upper bounds by up to two orders of magnitude and nearly eliminates verified attack success in end-to-end PAIR and GPTFuzz attack runs.
Primary: Keysight Technologies Inc.
All Institutions: Keysight Technologies Inc.
This work has significant broader impact for the security and robustness of agentic AI systems. 1. **Paradigm Shift in Defense**: It proposes a fundamental shift from reactive blocking to proactive misdirection, offering a new conceptual framework for designing defenses against automated adversarial attacks. This could inspire new research directions in active defense strategies for LLMs. 2. **Enhanced Agentic AI Security**: As agentic AI systems become more prevalent, their susceptibility to automated attacks is a critical concern. The detect-and-misdirect strategy provides a practical and effective method to improve the resilience of these systems, making them safer for deployment in real-world applications. 3. **Improved Red Teaming**: The insights into how automated judges can be misled can inform the development of more robust red teaming methodologies, pushing attackers to develop more sophisticated (and costly) verification mechanisms. 4. **Understanding LLM Limitations**: The work highlights and leverages specific limitations of LLM-based judges, particularly their reliance on heuristic cues. This contributes to a deeper understanding of how LLMs interpret and evaluate text, which is valuable for both defense and general LLM development. 5. **Deterrence**: By making apparent successes less trustworthy, misdirection can serve as a psychological and operational deterrent, increasing the cost and uncertainty for attackers. This paper introduces a theoretically grounded and empirically validated "detect-and-misdirect" defense strategy that significantly reduces the success rate of model-guided automated attacks on agentic AI systems. Through a probabilistic model, the authors demonstrate the inherent vulnerability of conventional detect-and-block defenses to iterative search and show how misdirection, by inducing false positives in the attacker's judge, can bound asymptotic attack success. The practical instantiation, Contextual Misdirection via Progressive Engagement (CMPE), is shown to be highly effective in end-to-end evaluations against state-of-the-art attack frameworks, nearly eliminating verified attack success and causing premature termination, thereby offering a crucial new paradigm for enhancing the security and robustness of autonomous AI.
The paper introduces a novel "detect-and-misdirect" defense strategy against model-guided automated attacks on agentic AI systems, contrasting it with conventional "detect-and-block" approaches. The methodology is robust, starting with a probabilistic model of the attack-defense setting. This model rigorously demonstrates a fundamental limitation of detect-and-block defenses: predictable refusals provide useful feedback to automated search, allowing attacker success rate (ASR) to approach one as the query budget grows. This theoretical insight is crucial. The proposed detect-and-misdirect strategy is then formalized, showing that by introducing misdirection-induced false positives (MI-FP) in the attacker's automated judge, the positive predictive value (PPV) of attacker-selected candidates is reduced, leading to a bounded asymptotic ASR. This theoretical underpinning is a significant strength. The paper then instantiates this strategy with Contextual Misdirection via Progressive Engagement (CMPE), a lightweight conversational mechanism. CMPE comprises three components: a positive-intent preamble, safe context expansion via prompt-reshaping (token-level transformations, lexical injection, shuffling), and a follow-up question. This design specifically aims to appear cooperative and semantically plausible to an LLM-based judge, while containing no genuinely harmful content, thus exploiting known limitations of such judges that often rely on heuristic cues (tone, structure, perceived intent) rather than strict semantic correctness. The CMPE algorithm is clearly described, enhancing its practicality and reproducibility. The methodology effectively bridges theoretical analysis with a concrete, implementable defense.
The experimental evaluation is comprehensive and compelling, validating the proposed strategy through both simulated ASR bounds and end-to-end attack runs. 1. **Judge Error Rate Estimation**: The authors evaluate CMPE's ability to induce misdirection-induced false positives (MI-FP) across a diverse set of six modern automated judge models (including rubric-based LLM judges like StrongREJECT and PAIR, and classifier judges like HB-FT-LLaMA2-13B, GPTFuzz-RoBERTa, and Llama-Guard-3-8B). This is performed on 500 high-risk jailbreak prompts from the AdvBench dataset. Each prompt-response pair is evaluated 10 times to account for stochasticity. The results clearly show that CMPE responses consistently receive high harmfulness scores from judges, demonstrating its effectiveness in generating MI-FPs. 2. **Simulated ASR Evaluation**: Using these per-sample judge error estimates, the paper computes simulated maximum ASR upper bounds for various attacker-defender judge configurations. The results show that CMPE substantially reduces the estimated ASR upper bound, often by one to two orders of magnitude, compared to the detect-and-block baseline. This directly supports the theoretical prediction that misdirection degrades the attacker's PPV and bounds their success. 3. **End-to-End Attack Framework Evaluation**: This is the most impactful part of the evaluation. CMPE is tested against two representative model-guided attack frameworks, GPTFuzz and PAIR, using both an aligned victim model (Vicuna-13b-v1.5) and a refusal-suppressed model (NeuralDaredevil-8B-abliterated). The experiments emulate a realistic agentic security setting. The results are striking: CMPE nearly eliminates verified attack success (reducing ASR from 10-20% to 0-2%) and causes the automated attack frameworks to terminate prematurely due to accepting misdirection responses as successful. This demonstrates that CMPE effectively disrupts the attack loop by making apparent successes untrustworthy. The use of manual validation with a secondary LLM judge for final verification adds credibility to the reported true positive rates. The experimental setup is well-controlled, with local defense and attack components hosted on separate systems.
The paper provides sufficient details to facilitate reproducibility. The probabilistic model is clearly defined with equations. The CMPE algorithm is presented in detail, including its three components and an example. Specific models used for response generation (NeuralDaredevil-8B-abliterated) and judging (various LLM and classifier judges, including their backend models) are named, along with the dataset (AdvBench) and its source. The experimental setup for both simulations and end-to-end runs (number of prompts, iterations, victim models, attacker models, defense models, validation judges, hardware) is described. URLs for the AdvBench dataset and the NeuralDaredevil model are provided. This level of detail is commendable and supports the reproducibility of the work.
1. **Attacker Adaptation**: While the paper discusses potential attacker adaptations (e.g., judge ensembling, stricter calibration), it acknowledges that these introduce trade-offs (e.g., increased false negatives for the attacker). However, the arms race between attackers and defenders is continuous, and more sophisticated misdirection detection methods might emerge. 2. **Generality of Misdirection**: CMPE is a specific instantiation of the detect-and-misdirect strategy. While effective, developing equally lightweight and effective misdirection for other types of prompt injection or agentic attacks might require different approaches. 3. **Complexity of Misdirection Generation**: Although CMPE is described as lightweight, generating consistently plausible and misleading responses without inadvertently triggering harmful behavior or being easily detectable as non-operational could become challenging for more complex attack scenarios or highly sophisticated attackers. 4. **Focus on Jailbreak**: The evaluation primarily focuses on jailbreak attacks. While the theoretical framework is general, the CMPE instantiation and empirical validation are specific to jailbreaking. Its effectiveness against other prompt injection variants (e.g., data exfiltration, tool misuse) would need further investigation. 5. **Human Oversight in Validation**: The final validation of true positives still requires manual inspection and a secondary LLM judge, highlighting the inherent difficulty in fully automating the verification of malicious intent, even for the defender.
This work has significant broader impact for the security and robustness of agentic AI systems. 1. **Paradigm Shift in Defense**: It proposes a fundamental shift from reactive blocking to proactive misdirection, offering a new conceptual framework for designing defenses against automated adversarial attacks. This could inspire new research directions in active defense strategies for LLMs. 2. **Enhanced Agentic AI Security**: As agentic AI systems become more prevalent, their susceptibility to automated attacks is a critical concern. The detect-and-misdirect strategy provides a practical and effective method to improve the resilience of these systems, making them safer for deployment in real-world applications. 3. **Improved Red Teaming**: The insights into how automated judges can be misled can inform the development of more robust red teaming methodologies, pushing attackers to develop more sophisticated (and costly) verification mechanisms. 4. **Understanding LLM Limitations**: The work highlights and leverages specific limitations of LLM-based judges, particularly their reliance on heuristic cues. This contributes to a deeper understanding of how LLMs interpret and evaluate text, which is valuable for both defense and general LLM development. 5. **Deterrence**: By making apparent successes less trustworthy, misdirection can serve as a psychological and operational deterrent, increasing the cost and uncertainty for attackers. This paper introduces a theoretically grounded and empirically validated "detect-and-misdirect" defense strategy that significantly reduces the success rate of model-guided automated attacks on agentic AI systems. Through a probabilistic model, the authors demonstrate the inherent vulnerability of conventional detect-and-block defenses to iterative search and show how misdirection, by inducing false positives in the attacker's judge, can bound asymptotic attack success. The practical instantiation, Contextual Misdirection via Progressive Engagement (CMPE), is shown to be highly effective in end-to-end evaluations against state-of-the-art attack frameworks, nearly eliminating verified attack success and causing premature termination, thereby offering a crucial new paradigm for enhancing the security and robustness of autonomous AI.
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced Large Language Model (LLM) reasoning; however, it faces a fundamental optimization instability: uniform token updates precipitate entropy collapse, leading to premature convergence to suboptimal strategies, whereas excessive Shannon Entropy maximization can cause entropy explosion, driving blind exploration toward incoherent reasoning chains. To resolve this dichotomy, we introduce the Independent Combinatorial Tokens (ICT) framework, which shifts the optimization focus from scalar uncertainty to the distributional properties of token logits. By leveraging the Jensen-Shannon (JS) divergence between token logits distributions, ICT identifies tokens with distinctive distributional patterns as critical branching points for guiding effective exploration in LLM reasoning. Our theoretical analysis, grounded in both Shannon and second-order Rényi entropy, proves that selectively updating on these tokens regulates policy concentration: it reduces the overall distribution uncertainty measured by Shannon entropy, while controlling probability concentration captured by second-order Rényi entropy. This dual effect prevents over-concentrated token generation from weakening exploration and effectively stabilizes the training landscape. Empirical results demonstrate that updating only the top 10% of unique tokens on Qwen2.5 (0.5B/1.5B/7B) models yields an average pass@4 improvement of 4.58%, with a maximum gain of 14.9%, over GRPO, 20-Entropy, and STAPO baselines across seven benchmarks spanning math, commonsense, and Olympiad-level problems.
Primary: Hong Kong University of Science and Technology
All Institutions: Hong Kong University of Science and Technology, Unknown Institution 2, Unknown Institution 3
The ICT framework offers a principled solution to a fundamental instability in RLVR, a critical technique for aligning LLMs with objective correctness in complex domains like mathematics and programming. By enabling more stable and effective exploration, ICT can lead to: 1. **Improved LLM Reasoning**: The demonstrated gains in Pass@4 suggest that models trained with ICT can explore more diverse and correct reasoning paths, leading to higher-quality solutions. 2. **Enhanced Training Efficiency**: The "less is more" finding, where updating only 10% of tokens yields superior results, implies potential for significant computational savings during RL fine-tuning, making it more accessible and scalable. 3. **Deeper Understanding of LLM Dynamics**: The shift from scalar entropy to distributional properties provides a new lens for understanding how LLMs make decisions and explore, potentially inspiring further research into information-theoretic approaches for policy optimization. 4. **Foundation for Future RLVR Algorithms**: The token-centric foundation established by this work could serve as a building block for the next generation of RLVR algorithms, moving beyond uniform updates to more intelligent, selective gradient application. This paper introduces the Independent Combinatorial Tokens (ICT) framework, a novel approach to stabilize Reinforcement Learning with Verifiable Rewards (RLVR) for LLM reasoning by focusing sparse updates on "unique tokens" identified via Jensen-Shannon divergence of logit distributions. The work provides a strong theoretical foundation, meticulously analyzing entropy dynamics with second-order Rényi entropy, and demonstrates significant, consistent empirical gains across multiple LLM scales and diverse reasoning benchmarks, offering a principled and efficient method for enhancing LLM exploration and reasoning capabilities.
The paper introduces the Independent Combinatorial Tokens (ICT) framework to address the fundamental optimization instability (entropy collapse/explosion) in Reinforcement Learning with Verifiable Rewards (RLVR) for LLM reasoning. The core idea is to shift the optimization focus from scalar uncertainty (Shannon Entropy) to the distributional properties of token logits. Specifically, ICT leverages Jensen-Shannon (JS) divergence to identify "unique tokens" whose logit distributions significantly deviate from the sequence-level average distribution. These unique tokens are posited as critical branching points for guiding effective exploration. The theoretical analysis is a major strength. It rigorously grounds the ICT framework in both Shannon and second-order Rényi entropy ($H_2$). The paper meticulously derives the gradient dynamics of $H_2$ and introduces the concept of "strategy purity" (collision probability) to formalize entropy bifurcation into regimes of collapse (updating high-probability tokens) and explosion (updating low-probability tokens). It proves that selectively updating unique tokens, which are shown to reside near the strategy purity threshold, regulates policy concentration by reducing overall Shannon entropy while controlling probability concentration via $H_2$. The appendix provides extensive derivations, including the homogeneity of $H_1$ and $H_2$ gradients under certain conditions, and a formal bridge connecting JS-unique tokens to these critical branching points. The ICT framework then integrates this insight into a sparse policy gradient estimator built upon Group Relative Policy Optimization (GRPO). An ICT distributional selector constructs a binary mask, retaining only the top-k percentile of tokens based on their JS uniqueness scores. This sparse mask is applied to the GRPO objective, ensuring that optimization resources are focused on high-information learning signals. The methodology is well-articulated, providing a principled approach to stabilize RLVR training.
The experimental evaluation is comprehensive and robust. The authors evaluate ICT on the Qwen2.5 series of models (0.5B, 1.5B, 7B), demonstrating scalability across different model sizes. Seven benchmarks are used, spanning diverse reasoning tasks including math (GSM8K, Math500, AIME23/24/25), commonsense (GPQA), and general knowledge (MMLU-Stem). This broad evaluation scope strengthens the generalization claims. ICT is compared against strong baselines: GRPO (the backbone RLVR algorithm), 20-Entropy, and STAPO. The results consistently show that ICT achieves the highest average Pass@1, Pass@4, and Total scores across all model scales and benchmarks. The average Pass@4 improvement of 4.58% (with a maximum gain of 14.9%) over baselines is significant, especially considering that only the top 10% of unique tokens are updated. This "less is more" finding is compelling. A key empirical finding is the differential improvement in Pass@4 versus Pass@1, indicating enhanced exploration capacity. Ablation studies further validate the theory: 1. **Update Ratios**: Comparing different sparsity ratios (10%, 20%, 50%, 100%) shows that updating only the top 10% of unique tokens yields the best performance, aligning with the hypothesis that focusing on critical branching points is optimal. 2. **Composition of Unique Tokens**: The analysis reveals that the ratio of high-entropy to low-entropy tokens among the selected unique tokens is approximately 1:1 (1.03 on GSM8K, 0.99 on MATH). This empirically confirms the theoretical prediction that unique tokens are drawn from both entropy collapse (Regime H) and entropy explosion (Regime L) regimes, thereby maintaining balanced entropy dynamics. The experiments are well-designed to support the theoretical claims and demonstrate practical efficacy.
The paper states that the training pipeline is built upon VeRL and closely follows the GRPO training recipe, with the only difference being the sparse updates. It mentions using the mean across 5 independent random seeds and provides details on datasets and baselines. More implementation details are said to be in the Appendix, which does provide extensive theoretical derivations but not explicit code or hyperparameter tables. While the methodology is clearly described, the absence of a public code repository or a detailed hyperparameter table in the main paper or appendix makes full reproducibility challenging without significant effort to replicate the VeRL/GRPO setup and then implement ICT.
1. **Strictness of H1/H2 Homogeneity Condition**: The paper acknowledges that the condition for co-directional Shannon entropy gradients ($(a) > e^{-1} \approx 0.37$) is restrictive for typical LLM token probabilities. While the $H_2$ analysis is unconditionally valid and an extended homogeneity for top-k tokens is argued, this still highlights a nuance in the theoretical claims. 2. **First-Order Approximation**: The theoretical derivations rely on a first-order Taylor approximation for entropy change, which assumes infinitesimally small step sizes. While justified for small learning rates, aggressive learning rates or large advantage spikes could lead to non-negligible higher-order effects. 3. **Computational Overhead**: Although the paper claims negligible computational overhead for JS divergence computation due to parallel batch processing, it still represents an additional step in the training loop. The primary savings come from the sparse backward pass, but the forward pass still involves this calculation. 4. **Generalizability Beyond Reasoning**: While the paper demonstrates strong results on reasoning tasks, the applicability of "unique tokens" identified via JS divergence to other LLM tasks (e.g., creative writing, summarization, dialogue) is not explored. 5. **Code Availability**: The lack of a publicly available code repository is a common limitation for arXiv papers and hinders direct reproducibility and adoption by the community.
The ICT framework offers a principled solution to a fundamental instability in RLVR, a critical technique for aligning LLMs with objective correctness in complex domains like mathematics and programming. By enabling more stable and effective exploration, ICT can lead to: 1. **Improved LLM Reasoning**: The demonstrated gains in Pass@4 suggest that models trained with ICT can explore more diverse and correct reasoning paths, leading to higher-quality solutions. 2. **Enhanced Training Efficiency**: The "less is more" finding, where updating only 10% of tokens yields superior results, implies potential for significant computational savings during RL fine-tuning, making it more accessible and scalable. 3. **Deeper Understanding of LLM Dynamics**: The shift from scalar entropy to distributional properties provides a new lens for understanding how LLMs make decisions and explore, potentially inspiring further research into information-theoretic approaches for policy optimization. 4. **Foundation for Future RLVR Algorithms**: The token-centric foundation established by this work could serve as a building block for the next generation of RLVR algorithms, moving beyond uniform updates to more intelligent, selective gradient application. This paper introduces the Independent Combinatorial Tokens (ICT) framework, a novel approach to stabilize Reinforcement Learning with Verifiable Rewards (RLVR) for LLM reasoning by focusing sparse updates on "unique tokens" identified via Jensen-Shannon divergence of logit distributions. The work provides a strong theoretical foundation, meticulously analyzing entropy dynamics with second-order Rényi entropy, and demonstrates significant, consistent empirical gains across multiple LLM scales and diverse reasoning benchmarks, offering a principled and efficient method for enhancing LLM exploration and reasoning capabilities.
A/B testing has become the gold standard for data-driven decision-making in large-scale online experimentation, providing critical guidance for feature launch, pricing optimization, and user experience enhancement. To maximize statistical sensitivity, many technology companies routinely employ Controlled-experiment Using Pre-Experiment Data (CUPED), a technique that achieves substantial variance reduction while preserving the unbiasedness of estimating the average treatment effect. Despite its widespread adoption, several critical methodological and practical nuances of CUPED remain underexplored. This paper systematically addresses five frequently encountered yet overlooked questions regarding the application of CUPED. First, we provide a comparative analysis of various post-CUPED estimators to identify the optimal adjustment specification. Second, we evaluate the validity of regression-based adjustments and delineate robust variance estimation methods tailored for such frameworks. Finally, we extend our investigation to complex but common scenarios, including multi-arm experiments and two-stage sampling designs. Our findings reveal that in these settings, naive reliance on standard variance estimators can lead to severely misleading inferences. By offering rigorous theoretical insights and extensive experimental validation, this work deepens the conceptual understanding of CUPED. Notably, the recommended methodologies have been successfully deployed and integrated into ByteDance's experimentation platform.
Primary: Zhongtai Securities Institute for Financial Studies
All Institutions: Zhongtai Securities Institute for Financial Studies, Shandong University, ByteDance
This paper has a significant broader impact on the practice of online experimentation across the technology industry. By systematically addressing critical, yet often misunderstood, aspects of CUPED, it provides a robust framework for ensuring trustworthy A/B testing. 1. **Improved Decision-Making**: By preventing inflated Type I errors and increasing statistical power, the methodologies will lead to more reliable causal inferences, enabling companies to make better, data-driven decisions on feature launches, pricing, and user experience. 2. **Enhanced Efficiency**: The recommendations for optimal adjustment specifications and variance reduction in complex settings will allow experiments to detect smaller effects with fewer users or in shorter durations, accelerating product iteration and reducing opportunity costs. 3. **Standardization of Best Practices**: The paper's clear guidelines and rigorous validations can help standardize CUPED implementation across the industry, moving away from ad-hoc heuristics towards statistically sound practices. 4. **Educational Value**: It serves as a valuable resource for practitioners and researchers, deepening their conceptual understanding of CUPED and its nuances in various experimental designs. The successful integration of these methodologies into ByteDance's platform demonstrates their immediate and large-scale utility, paving the way for wider adoption and a higher standard of rigor in online experimentation. This paper provides a rigorous and highly practical investigation into the nuances of CUPED for online A/B testing, offering critical insights and actionable recommendations for ensuring trustworthy inference in complex experimental designs. Through a combination of robust theoretical proofs and extensive empirical validation using real-world data from ByteDance, the authors systematically address common pitfalls in variance estimation and adjustment specification, particularly in multi-arm and two-stage sampling scenarios, thereby significantly enhancing the reliability and efficiency of large-scale experimentation.
The paper adopts a design-based framework under a completely randomized design, which is standard and robust for A/B testing. It systematically addresses five key questions regarding CUPED, focusing on practical nuances often overlooked in industry. The methodology involves a rigorous comparative analysis of various CUPED estimators: full-sample, pooled-sample, split-sample, and regression-based (with and without interaction). A significant strength is the detailed evaluation of inferential validity, particularly concerning variance estimation. The paper highlights the critical need for robust variance estimators (sandwich estimators) in regression-based CUPED, especially under heteroscedasticity and imbalanced group sizes, directly addressing Freedman's critique. The core methodological contributions lie in extending CUPED analysis to complex, yet common, scenarios: multi-arm experiments and two-stage sampling designs. For multi-arm experiments, the paper theoretically proves that using the full-sample covariate mean for adjustment is more efficient than local (pairwise) adjustments and derives a necessary variance correction for the local approach to maintain inferential validity. For two-stage sampling, it demonstrates that split-sample CUPED retains its efficiency advantages but requires a variance correction to account for the compounded randomness from the initial sampling stage. The theoretical results are presented as theorems with detailed proofs provided in the appendix, demonstrating a high level of mathematical rigor. The paper also provides a nuanced discussion on the choice between model-free and regression-based approaches, emphasizing computational efficiency and metric structure over perceived robustness differences.
The experimental evaluation is comprehensive and highly relevant to real-world applications. It combines simulation studies with validation using real-world data from ByteDance's experimentation platform. 1. **Type I Error Rate and Power Simulations**: For regression-based CUPED, simulations demonstrate the unreliability of standard OLS variance estimators in imbalanced scenarios, showing inflated Type I errors or excessive conservatism. In contrast, the sandwich estimator consistently maintains stable, conservative control. Similar simulations are conducted for multi-arm and two-stage sampling, clearly illustrating the under-coverage of confidence intervals by naive variance estimators and the effectiveness of the proposed corrections. 2. **Real-World Data Validation**: The paper uses proprietary data from ByteDance's core business metrics (e.g., GMV, user feedback) to empirically validate the theoretical findings. For multi-arm experiments, it shows that the full-sample covariate mean consistently achieves higher variance reduction compared to the corrected local approach. For two-stage sampling, the empirical coverage rates confirm the theoretical predictions regarding the necessity of variance corrections. The experiments are well-designed, covering various allocation schemes and sampling probabilities, which enhances the generalizability of the findings within industrial settings. The use of a large number of replications ($10^4$ to $10^5$) ensures statistical reliability of the simulation results. The direct application and integration into ByteDance's platform serve as a strong testament to the practical utility and validity of the proposed methodologies.
The paper provides a good level of detail for reproducibility of its theoretical claims. All theorems are accompanied by formal proofs in the appendix, allowing independent verification. For the simulation experiments, the data generation process is explicitly described, including distributions and parameters, which should enable reproduction of the simulation results. While the real-world data from ByteDance is proprietary and thus not directly reproducible by external researchers, the methodologies applied to this data are clearly articulated. The overall clarity of the methodological descriptions and the theoretical backing contribute positively to reproducibility.
1. **Focus on Mean-Based Metrics**: While the paper acknowledges ratio metrics and suggests model-free approaches with the delta method, a deeper dive into the specific challenges and optimal CUPED strategies for various complex ratio metrics (e.g., conversion rates, revenue per user) could be beneficial, as these are prevalent in online experimentation. 2. **Generalizability of ByteDance's Scenarios**: While the use of ByteDance's data is a strength, the specific characteristics of their platform and user behavior might influence the magnitude of the observed effects (e.g., variance reduction ratios). While the theoretical results are general, the empirical gains might vary across different platforms. 3. **Computational Cost of Sandwich Estimators**: While sandwich estimators are theoretically robust, their computational cost can be higher than standard OLS estimators, especially with very large datasets or complex models. The paper doesn't extensively discuss the practical implications of this trade-off for real-time A/B testing platforms. 4. **Assumptions of Design-Based Framework**: The paper operates under a design-based framework. While robust, exploring the implications or comparisons with super-population inference frameworks, especially for the two-stage sampling where `p` approaches 0, could offer a more complete picture.
This paper has a significant broader impact on the practice of online experimentation across the technology industry. By systematically addressing critical, yet often misunderstood, aspects of CUPED, it provides a robust framework for ensuring trustworthy A/B testing. 1. **Improved Decision-Making**: By preventing inflated Type I errors and increasing statistical power, the methodologies will lead to more reliable causal inferences, enabling companies to make better, data-driven decisions on feature launches, pricing, and user experience. 2. **Enhanced Efficiency**: The recommendations for optimal adjustment specifications and variance reduction in complex settings will allow experiments to detect smaller effects with fewer users or in shorter durations, accelerating product iteration and reducing opportunity costs. 3. **Standardization of Best Practices**: The paper's clear guidelines and rigorous validations can help standardize CUPED implementation across the industry, moving away from ad-hoc heuristics towards statistically sound practices. 4. **Educational Value**: It serves as a valuable resource for practitioners and researchers, deepening their conceptual understanding of CUPED and its nuances in various experimental designs. The successful integration of these methodologies into ByteDance's platform demonstrates their immediate and large-scale utility, paving the way for wider adoption and a higher standard of rigor in online experimentation. This paper provides a rigorous and highly practical investigation into the nuances of CUPED for online A/B testing, offering critical insights and actionable recommendations for ensuring trustworthy inference in complex experimental designs. Through a combination of robust theoretical proofs and extensive empirical validation using real-world data from ByteDance, the authors systematically address common pitfalls in variance estimation and adjustment specification, particularly in multi-arm and two-stage sampling scenarios, thereby significantly enhancing the reliability and efficiency of large-scale experimentation.
Testing conditional independence is fundamental yet intrinsically difficult: without additional assumptions, Type I error control is impossible in general. The "Model-X'' paradigm addresses this difficulty by assuming exact knowledge of a relevant conditional distribution. While small deviations from this assumption can sometimes be tolerated in classical one-shot testing, existing sequential conditional independence tests typically require the Model-X conditional to be known exactly, making them fragile when it must instead be estimated. We propose a new approach that is substantially more robust to such estimation error. Our method applies testing-by-betting to an adaptively optimized Kernel Conditional Independence statistic, together with a normalization scheme and a truncate-and-shift calibration strategy. These modifications greatly reduce Type I error inflation while preserving high power across high-dimensional synthetic benchmarks and real-world fairness tasks, outperforming existing sequential Model-X approaches. Code is available at https://github.com/he-zh/SKCI.
Primary: University of British Columbia
All Institutions: University of British Columbia, Alberta Machine Intelligence Institute
[One sentence main contribution]. This paper introduces SKCI, a robust sequential conditional independence test using adaptive betting and kernel methods that maintains valid Type I error control even when the conditional distribution is estimated online. [Comprehensive analysis of the technical contribution, methodology, and significance to the field]. The work represents a significant advancement in sequential hypothesis testing, particularly for conditional independence, a notoriously difficult problem. By integrating testing-by-betting with adaptive kernel methods and rigorous calibration techniques, the authors address a critical gap in the literature where existing methods fail under distributional estimation error. The theoretical guarantees and strong empirical performance make this a valuable contribution to the statistical learning and machine learning communities, offering a practical tool for real-world applications requiring reliable, anytime-valid inference.
The paper proposes a novel sequential testing framework for Conditional Independence (CI) called SKCI, addressing the fragility of existing Model-X based sequential tests when the conditional distribution $P(A|C)$ must be estimated rather than known. The core innovation lies in combining testing-by-betting with an adaptively optimized Kernel Conditional Independence (KCI) statistic. Key methodological contributions include: 1) A self-normalized payoff function using a cross-U-statistic structure to handle scale invariance; 2) A "shift-and-truncate" mechanism to ensure the wealth process remains a valid supermartingale (or close to it) despite estimation errors in the conditional mean embeddings; 3) A Gaussian approximation strategy to estimate the necessary shift parameter for calibration; and 4) An adaptive optimization loop for kernel hyperparameters and betting fractions using empirical log-wealth proxies. The theoretical analysis provides a finite-sample bound on Type I error inflation, decomposing the drift into Gaussian approximation error and calibration mismatch, which is a rigorous and non-trivial theoretical contribution.
The experimental evaluation is comprehensive and convincing. The authors test SKCI against strong baselines (e-CRT, DAVT, EC2ST) across multiple regimes: Oracle (known conditional), Pretrained (offline estimated conditional), and Online (sequential estimation). They use challenging synthetic benchmarks (Gaussian, CI Hardness, RatInABox) and real-world applications (dSprites, Car Insurance Discrimination). The results demonstrate that SKCI significantly outperforms baselines in terms of Type I error control in the Online and Pretrained settings, where other methods suffer from severe inflation or loss of power. The inclusion of fairness and biological data adds practical relevance. The ablation studies and sensitivity analysis support the theoretical claims regarding batch size and regularization.
The paper provides a clear algorithm description, detailed theoretical proofs in the appendix, and a link to the source code. The experimental setup is well-described, including data splits and hyperparameter selection strategies. The code availability ensures high reproducibility.
The method relies on kernel ridge regression for conditional mean embeddings, which can scale poorly with very large datasets ($O(N^3)$ or $O(N^2)$ depending on implementation). The Gaussian approximation for the shift parameter is an assumption that may not hold perfectly in finite samples with heavy-tailed distributions, although the theory bounds this error. The adaptive optimization of kernel parameters adds computational overhead per batch compared to fixed-kernel methods.
Conditional independence testing is fundamental for causal discovery, fairness auditing, and robust machine learning. By providing a robust sequential test that works with estimated conditionals, this work enables more reliable and flexible inference in online settings, such as real-time fairness monitoring or adaptive experimental design. This has positive societal implications by improving the reliability of automated decision-making systems. [One sentence main contribution]. This paper introduces SKCI, a robust sequential conditional independence test using adaptive betting and kernel methods that maintains valid Type I error control even when the conditional distribution is estimated online. [Comprehensive analysis of the technical contribution, methodology, and significance to the field]. The work represents a significant advancement in sequential hypothesis testing, particularly for conditional independence, a notoriously difficult problem. By integrating testing-by-betting with adaptive kernel methods and rigorous calibration techniques, the authors address a critical gap in the literature where existing methods fail under distributional estimation error. The theoretical guarantees and strong empirical performance make this a valuable contribution to the statistical learning and machine learning communities, offering a practical tool for real-world applications requiring reliable, anytime-valid inference.
Math and science reasoning benchmarks rely on pass@k, the fraction of sampled chains that reach gold, as the canonical per-example difficulty signal. The same signal drives RL with verifiable rewards, math data curation, synthetic curricula, and verifier training. We show this proxy has a persistent blind spot on its hardest stratum: on the eight free-form math cells we test (GSM8K and MATH across four open-weight models), 10.3-22.9% of the examples that no sampling seed solves in six tries are instead solved at matched compute by a six-chain deterministic regime. These are greedy decoding plus five cheap residual-stream perturbations applied via activation grafting, while greedy alone solves at most 6% on these math cells. Recovery scales with the additional budget, across perturbations whose mechanistic distinctness we verify across all twelve cells (cross-kind fix-set Jaccard <= 0.47 in every setup). Activation grafting is used as an intervention on internal representations, not a decoding method; we use it purely as a diagnostic and diversification tool, and our recovered items show that the pass@k= 0 % stratum is structurally identifiable in the residual stream rather than that the unmodified model reaches them under ordinary inference.
Primary: Sapienza University of Rome
All Institutions: Sapienza University of Rome, Not Diamond, Paradigma
This paper has significant broader impact across several areas of ML research and practice: 1. **LLM Evaluation**: It fundamentally challenges the reliance on pass@k as the sole or primary signal for per-example difficulty, especially in reasoning tasks. This could lead to more nuanced and robust difficulty estimation methods for benchmarks. 2. **Data Curation and Synthetic Curricula**: Pipelines that filter out or downweight problems based on pass@k=0 (e.g., for RL with verifiable rewards, math data curation, synthetic curricula) are shown to discard a non-trivial fraction of problems that the model *can* solve. This implies wasted compute and potentially biased datasets, leading to less effective training. 3. **Verifier and Reward Model Training**: Datasets for training verifiers and reward models, built from sampled-chain correctness, will inherit this blind spot. Items that are solvable deterministically but missed by sampling contribute only negative examples, potentially misguiding the verifier. 4. **Interpretability and Mechanistic Understanding**: The use of activation grafting as a diagnostic tool provides a concrete method for probing the "reachability" of solutions within the residual stream, offering insights into how internal representations can be perturbed to unlock different behaviors. 5. **Resource Efficiency**: By identifying that a fraction of "hard" problems are merely "unreached," the paper suggests that auditing these items with cheap deterministic perturbations can improve data quality and reduce the need for generating more samples or discarding valuable data. The impact statement correctly notes that this is a diagnostic study, not a new inference method, and does not pose dual-use risks. Its primary benefit is to improve the rigor and efficiency of LLM development and evaluation. This paper presents a critical diagnostic study revealing a "sampling blind spot" in pass@k-based difficulty estimation for math reasoning tasks. Through rigorous experimentation across multiple LLMs and benchmarks, it demonstrates that a significant fraction (10-29%) of problems deemed "hardest" by sampling are, in fact, solvable by the same model under a matched-compute deterministic regime using activation grafting, challenging a fundamental assumption in LLM evaluation and data curation.
The paper introduces a novel diagnostic methodology to investigate the "sampling blind spot" in pass@k-based difficulty estimation for math reasoning tasks. The core idea is to use activation grafting as an intervention on internal representations, rather than a decoding method, to explore deterministic trajectories that are distinct from standard stochastic sampling. The methodology is sound and well-justified: 1. **Problem Framing**: Clearly defines the pass@k=0 stratum as the target for investigation, representing examples deemed "hardest" by sampling-only methods. 2. **Activation Grafting as Diagnostic**: This is a key methodological strength. By replacing the last prompt-token hidden state with various cheap synthetic vectors (zero, random, BOS-token, etc.), the authors create distinct deterministic decoding paths. This allows them to probe whether the model *can* solve these problems if its internal state is slightly perturbed, without changing the model's weights or the decoding algorithm (greedy). 3. **Matched Compute**: The comparison is rigorously set up by matching the number of forward passes between the sampling regime (k samples) and the deterministic regime (greedy + k-1 grafts). This ensures a fair comparison of "compute budget." 4. **Mechanistic Distinctness**: The authors meticulously verify that the different graft types indeed lead to mechanistically distinct trajectories, as evidenced by low cross-kind fix-set Jaccard similarity and analysis of hidden-state divergence. This is crucial for arguing that the deterministic regime explores genuinely different solution paths, not just noisy variations of one. 5. **Robustness Checks**: The methodology includes checks for robustness against sampling temperature and layer selection for grafting, further strengthening the claims.
The experimental evaluation is comprehensive and robust, covering a good range of models and benchmarks. 1. **Models and Benchmarks**: Experiments are conducted on four open-weight instruction-tuned models (Qwen-2.5-3B, Llama-3.2-3B, Llama-3.1-8B, Mistral-Nemo-12B) and three reasoning benchmarks (GSM8K, MATH, MMLU-Pro). This breadth demonstrates the generality of the findings across different model sizes and reasoning domains. The focus on free-form math (GSM8K, MATH) where the effect is largest is appropriate. 2. **Key Findings**: * **Greedy Competitiveness**: Shows that greedy decoding can be competitive or even better than single-sample accuracy, challenging a common assumption. * **Persistent Blind Spot**: Demonstrates that the pass@k=0 stratum is substantial (5.1-43.5% of prompts at k=6) and persists even with additional samples, indicating it's not an artifact of undersampling. * **Deterministic Recovery**: The central finding is that a six-chain deterministic regime (greedy + five grafts) recovers 10.3-22.9% of the pass@k=0 examples on free-form math cells (10-29% across all 12 cells). This is a significant fraction of items previously deemed "unsolvable." * **Scaling and Diversity**: Recovery scales with the deterministic budget, and the distinctness of grafts (low Jaccard index) confirms that different grafts probe different subsets of the problem space. 3. **Mechanistic Analysis**: Detailed analysis of hidden-state divergence and attention weight changes provides strong evidence that grafts inject content vectors that propagate through the residual stream, rather than merely rerouting attention. This supports the claim of distinct mechanistic axes. 4. **Practical Utility**: Two deployable recipes are presented: a matched-cost substitution (replacing one sample with an `avg` graft for better coverage) and a label-free curation flag (using chain disagreement to identify recoverable items). These demonstrate direct applicability of the diagnostic insights. 5. **Quantitative Rigor**: Results are presented with clear percentages and absolute counts, and statistical significance is implicitly supported by the consistent trends across many setups. The Jaccard similarity metric is well-chosen to quantify fix-set diversity.
The paper provides a good level of detail for reproducibility: * **Models and Benchmarks**: Specific models and benchmarks are named. * **Decoding Parameters**: Sampling temperature (T=0.7, p_top=0.9) and max_new_tokens are specified. * **Grafting Details**: The layer (26) and position (last prompt token) for grafting are fixed, and the types of graft vectors (zero, random, BOS-token, etc.) are described. The process of applying grafts via `register_forward_hook` is mentioned. * **Compute Matching**: The definition of "matched compute" is clear. * **Worked Examples**: Concrete examples of recovered items with explanations of where sampling failed and grafts succeeded are provided, aiding understanding. While the exact code for activation grafting and evaluation is not provided in the paper text, the methodological descriptions are sufficiently detailed for an experienced researcher to reimplement the experiments.
The authors acknowledge several limitations, and the review aligns with them: 1. **Scope of Models/Benchmarks**: The study covers 3B-12B open-weight models and three reasoning benchmarks. While substantial, it doesn't cover larger frontier models or other domains (e.g., code generation, creative writing). 2. **Unreached vs. Intrinsically Hard**: Even with 8 deterministic chains, 66-88% of the pass@k=0 stratum remains unreached. These could be genuinely hard or reachable by other diversity axes not probed. The paper is careful not to overstate the "easy" claim for all unreached items. 3. **Point Estimates**: The recovery rates are point estimates without bootstrap confidence intervals or repeated-seed-set variance, which means per-cell rates should be interpreted with some uncertainty, though the qualitative direction holds. 4. **Label-Free Identification Precision**: The label-free probe for identifying recoverable items works well on free-form math but degrades on multiple-choice benchmarks due to chance agreement. A calibrated precision guarantee would require a small labeled dev set. 5. **Small Strata Noise**: For very small pass@k=0 strata (e.g., 51-58 examples), absolute recovery counts are small, leading to a higher noise floor in recovery rates. 6. **Reverse Asymmetry**: The paper explicitly states it does not quantify items the deterministic regime misses but sampling reaches, focusing only on the direction relevant to auditing current pass@k practices. This is a reasonable scope choice but still a limitation for a complete picture of decoding regime differences.
This paper has significant broader impact across several areas of ML research and practice: 1. **LLM Evaluation**: It fundamentally challenges the reliance on pass@k as the sole or primary signal for per-example difficulty, especially in reasoning tasks. This could lead to more nuanced and robust difficulty estimation methods for benchmarks. 2. **Data Curation and Synthetic Curricula**: Pipelines that filter out or downweight problems based on pass@k=0 (e.g., for RL with verifiable rewards, math data curation, synthetic curricula) are shown to discard a non-trivial fraction of problems that the model *can* solve. This implies wasted compute and potentially biased datasets, leading to less effective training. 3. **Verifier and Reward Model Training**: Datasets for training verifiers and reward models, built from sampled-chain correctness, will inherit this blind spot. Items that are solvable deterministically but missed by sampling contribute only negative examples, potentially misguiding the verifier. 4. **Interpretability and Mechanistic Understanding**: The use of activation grafting as a diagnostic tool provides a concrete method for probing the "reachability" of solutions within the residual stream, offering insights into how internal representations can be perturbed to unlock different behaviors. 5. **Resource Efficiency**: By identifying that a fraction of "hard" problems are merely "unreached," the paper suggests that auditing these items with cheap deterministic perturbations can improve data quality and reduce the need for generating more samples or discarding valuable data. The impact statement correctly notes that this is a diagnostic study, not a new inference method, and does not pose dual-use risks. Its primary benefit is to improve the rigor and efficiency of LLM development and evaluation. This paper presents a critical diagnostic study revealing a "sampling blind spot" in pass@k-based difficulty estimation for math reasoning tasks. Through rigorous experimentation across multiple LLMs and benchmarks, it demonstrates that a significant fraction (10-29%) of problems deemed "hardest" by sampling are, in fact, solvable by the same model under a matched-compute deterministic regime using activation grafting, challenging a fundamental assumption in LLM evaluation and data curation.
Multi-turn tool-use RL is bottlenecked by the rapid depletion of informative samples in static datasets. We observe that the gradient signal in GRPO concentrates on tasks with the highest rollout reward variance, a consequence of the Popoviciu upper bound. Consequently, samples near the agent's capability boundary -- where successes and failures are roughly balanced -- contribute disproportionately large policy gradients. As training progresses, this boundary continuously shifts, which gradually depletes the pool of informative samples in a static dataset. We propose RODS (Reward-driven Online Data Synthesis) to resolve this depletion. RODS closes the loop between RL training and data generation by repurposing the progress reward variance as a practical, zero-cost boundary detector that requires no extra inference beyond the rollouts already computed for training. It continuously identifies such boundary samples, synthesizes new multi-turn variants matching their structural complexity (e.g., API topology and dependency depth) via a skill-aligned resampling pipeline, and manages a dynamic replay buffer that co-evolves with the policy. Starting from 400 human seeds and maintaining an active training pool of ~800 samples, RODS achieves comparable performance to a 17K-sample offline pipeline while requiring roughly 20x fewer trajectories, and improves over fixed-data RL and environment augmentation in our controlled setting.
Primary: Ant Group
All Institutions: Ant Group, Inclusion AI, Shanghai Innovation Institute, Westlake University, Zhejiang University
This paper presents a highly impactful and efficient method for online data synthesis in multi-turn tool-use agents, demonstrating that strategic, variance-driven data curation can drastically reduce the sample complexity of RL training while maintaining or improving performance.
The paper introduces RODS (Reward-Driven Online Data Synthesis), a method designed to address the "sample depletion" problem in multi-turn tool-use reinforcement learning (RL). The core theoretical insight leverages Popoviciu’s inequality to argue that GRPO (Group Relative Policy Optimization) gradients are dominated by samples with high reward variance—specifically, those near the agent's current capability boundary. RODS operationalizes this by using the variance of progress rewards from existing rollouts as a zero-cost proxy for identifying these "boundary" samples. It then synthesizes new, structurally similar multi-turn trajectories to replenish the training pool. The methodology is elegant in its simplicity: it repurposes existing rollout data for data curation without requiring additional expensive inference passes or complex reward models. The approach effectively closes the loop between policy improvement and data generation, maintaining a dynamic replay buffer that co-evolves with the policy.
The empirical evaluation demonstrates that RODS achieves performance comparable to a large-scale (17K sample) offline training pipeline while using only ~800 active samples and 400 human seeds. This represents a significant efficiency gain (roughly 20x fewer trajectories). The paper compares RODS against fixed-data RL baselines and environment augmentation techniques, showing consistent improvements. The results suggest that the quality and strategic selection of training data (via boundary detection) are more critical than sheer volume in the context of tool-use agents. The controlled setting validates the hypothesis that static datasets become uninformative as the policy improves, and that online synthesis of boundary samples mitigates this degradation.
The authors provide a GitHub repository link (https://github.com/inclusionAI/AWorld-RL/tree/main/RODS) and model weights (HuggingFace), which strongly supports reproducibility. The method relies on standard RL components (GRPO, rollouts) and a clear algorithmic step for data synthesis, making it relatively straightforward to implement for other researchers. The use of open-source models (Qwen3-4B) further aids in independent verification.
The primary limitation is the dependency on the quality of the "skill-aligned resampling pipeline." If the mechanism for synthesizing new variants fails to preserve the structural complexity or semantic validity of the original boundary samples, the benefits may diminish. Additionally, the approach assumes that reward variance is a reliable indicator of the capability boundary, which might not hold in all environments with sparse or noisy rewards. The evaluation is currently limited to specific tool-use benchmarks; generalization to other multi-turn decision-making tasks (e.g., complex reasoning without tools) is not fully explored. The "zero-cost" claim is relative; while it doesn't require extra *inference* for reward modeling, the synthesis step does require computational resources.
This work has significant implications for making RL for LLMs more scalable and cost-effective. By reducing the dependency on massive static datasets and expensive data collection pipelines, RODS lowers the barrier to entry for training capable agents. It shifts the focus from data quantity to data *stratification* and *dynamism*. This could accelerate the development of autonomous agents in resource-constrained settings. However, as with all RL methods, there are risks related to reward hacking or over-optimization on specific synthetic patterns, which should be monitored in broader deployments. This paper presents a highly impactful and efficient method for online data synthesis in multi-turn tool-use agents, demonstrating that strategic, variance-driven data curation can drastically reduce the sample complexity of RL training while maintaining or improving performance.
Looped Transformers scale latent computation by repeatedly applying shared blocks, but sequential looping increases latency and KV-cache memory with the loop count. Parallel loop Transformers (PLT) alleviate this cost through cross-loop position offsets (CLP) and shared-KV gated sliding-window attention, making loop count a practical design choice. We therefore study PLT loop-count selection through a gain--cost view: an extra loop may refine representations, but CLP also introduces a positional mismatch at each loop boundary. We instantiate this study by training LoopCoder-v2, a family of 7B PLT coders with different loop counts, from scratch on 18T tokens, followed by matched instruction tuning and evaluation. Empirically, the two-loop variant delivers broad gains over the non-looped baseline across code generation, code reasoning, agentic software engineering, and tool-use benchmarks, improving SWE-bench Verified from 43.0 to 64.4 points and Multi-SWE from 14.0 to 31.0 points. In contrast, variants with three or more loops regress, revealing a strongly non-monotonic loop-count effect. Our diagnostics show that loop 2 provides the main productive refinement, while later loops yield diminishing, oscillatory updates and reduced representational diversity. Because the CLP-induced mismatch remains roughly fixed as refinement gains shrink, the offset cost increasingly dominates. This gain--cost trade-off explains PLT's saturation at two loops and provides diagnostics for loop-count selection.
Primary: Unknown
All Institutions: Unknown
LoopCoder-v2 establishes a critical "gain-cost" framework for Parallel Loop Transformers, empirically proving that two loops represent the optimal trade-off between representational refinement and positional mismatch, leading to substantial gains in code generation and agentic benchmarks.
The paper introduces LoopCoder-v2, a family of 7B parameter Parallel Loop Transformers (PLT) designed to scale latent computation efficiently. The core methodological contribution lies in the systematic study of loop-count selection within the PLT architecture. Unlike standard Transformers that stack layers sequentially, PLT applies shared blocks in parallel with cross-loop position offsets (CLP) and shared-KV gated sliding-window attention. The authors train variants with loop counts of 1, 2, 3, and 4 from scratch on 18T tokens. The methodology is rigorous in its ablation of the "gain-cost" trade-off: while additional loops theoretically refine representations, the fixed positional mismatch introduced by CLP becomes increasingly detrimental as refinement gains diminish. The paper provides a unified forward-pass algorithm and detailed architectural configurations, offering a clear technical framework for understanding how parallel looping affects representation learning in large language models.
The experimental evaluation is comprehensive and compelling. The authors demonstrate that the two-loop variant (LoopCoder-v2-2L) delivers broad gains over the non-looped baseline (1L) across multiple benchmarks, including code generation, code reasoning, agentic software engineering, and tool-use. Notably, SWE-bench Verified scores improved from 43.0 to 64.4, and Multi-SWE from 14.0 to 31.0. Crucially, the paper identifies a strongly non-monotonic effect: variants with three or more loops regress in performance. This finding is significant because it challenges the assumption that "more computation is always better" and provides empirical evidence for the saturation point of PLT architectures. The diagnostics linking performance drops to reduced representational diversity and oscillatory updates add depth to the analysis. The use of 18T tokens for pretraining ensures the models are well-calibrated for these high-stakes benchmarks.
The paper provides the forward-pass pseudocode and mentions the data composition (1:1 text-to-code ratio, breakdown of programming languages). However, as an arXiv preprint with "Unknown" institution and no provided project URL or code link in the text, full reproducibility is currently limited by the lack of accessible code and exact hyperparameter settings for the 18T token training run. The description of the G-SWA fusion and CLP mechanism is sufficiently detailed for a competent researcher to implement, but the sheer scale of training (18T tokens) makes independent verification of the pretraining phase difficult without significant resources.
The primary limitation is the non-monotonic performance curve, which caps the utility of PLT at two loops for this specific architecture and training regime. The paper does not explore whether different CLP strategies or attention mechanisms could mitigate the positional mismatch in deeper loops. Additionally, the evaluation is heavily focused on code-centric tasks; while general language capabilities are likely affected, the specific impact on non-code reasoning tasks is less emphasized. The "Unknown" institution and lack of code availability also limit immediate community adoption and verification.
This work has significant implications for efficient LLM training and inference. By demonstrating that parallel looping can effectively scale latent computation with lower latency and memory overhead than sequential stacking, it offers a viable alternative to increasing model depth. The finding that there is an optimal loop count (2) helps practitioners avoid inefficient over-parameterization. The improvements in agentic software engineering benchmarks suggest that these architectural changes can lead to more capable AI assistants for complex coding tasks, with potential downstream applications in automated software development and debugging. LoopCoder-v2 establishes a critical "gain-cost" framework for Parallel Loop Transformers, empirically proving that two loops represent the optimal trade-off between representational refinement and positional mismatch, leading to substantial gains in code generation and agentic benchmarks.
Autoformalization, translating natural-language mathematics into formal proof assistants, is bottlenecked not by translation fluency but by \emph{faithfulness}: a formal statement can typecheck and be provable, yet still encode a different theorem than the source intended. We introduce \emph{Bidirectional Provability Fingerprinting} (\bpf{}), a framework that certifies faithfulness by characterizing each candidate through its forward and backward consequence neighborhoods in the ambient theory and matching these against probes derived from the natural-language statement. We further introduce four novel components: (i) \emph{Counterfactual Probe Generation} (\cpg{}), a contrastive procedure that synthesizes probes targeting specific drift directions; (ii) the \emph{Equivalence Spectrum}, a continuous faithfulness score that replaces brittle binary verdicts; (iii) \emph{Adaptive Probe Budget Allocation} (\apba{}), an information-theoretic budget router; and (iv) \emph{Faithfulness-Guided Decoding} (\fgd{}), which uses \bpf{} signals as a reward during autoformalization. We prove a \emph{drift detection theorem} and a \emph{PAC-faithfulness} result establishing that the equivalence class of a natural language statement is learnable from $\mathcal{O}(\log(1/δ)/\varepsilon)$ probes under mild assumptions. We release \driftbench{}, a benchmark of $2{,}183$ NL/Lean~4 pairs with controlled drift labels across six subfields of mathlib4. \bpf{}\,+\,\cpg{} detects $89.6\%$ of drifted formalizations at a $3.0\%$ false-positive rate-against $41.2\%$ for typecheck and $63.3\%$ for LLM-judge baselines, and \fgd{} reduces the rate at which a state-of-the-art autoformalizer emits drifted statements by $47\%$. https://pmlrbd.github.io/BPF/
Primary: Istanbul Technical University
All Institutions: Istanbul Technical University, Jashore University of Science and Technology
This paper introduces a significant advancement in the field of autoformalization by proposing a framework that certifies semantic equivalence between natural-language and formal mathematical statements. The comprehensive methodology and rigorous experimental evaluation demonstrate its potential to improve the reliability of AI-assisted formal mathematics, making it a highly impactful contribution to machine learning research.
The paper presents a novel framework, Bidirectional Provability Fingerprinting (BPF), which addresses the critical issue of faithfulness in autoformalization. The methodology is robust, introducing several innovative components such as Counterfactual Probe Generation and Adaptive Probe Budget Allocation. The framework's reliance on consequence neighborhoods to certify semantic equivalence is a significant conceptual advancement, moving beyond traditional typechecking and provability metrics. The introduction of continuous faithfulness scores through the Equivalence Spectrum adds a layer of sophistication to the evaluation of formalizations.
The experiments are thorough, utilizing a well-constructed benchmark dataset (driftbench) with 2,183 natural language/Lean 4 pairs. The results demonstrate a substantial improvement in detecting drifted formalizations compared to existing methods, with empirical evidence supporting the effectiveness of the proposed techniques. The paper also includes rigorous evaluations against multiple baselines, showcasing the practical benefits of the BPF framework.
The paper provides sufficient details on the methodology and experimental setup, including the algorithms used and the nature of the datasets. However, the lack of a publicly available code repository or demo limits the ease of reproducibility. The authors do release a benchmark, which is a positive aspect for future research.
The paper acknowledges limitations such as the inability to detect convention drift and the reliance on an entailment oracle, which may not always be complete. Additionally, the potential for residual drift in certified statements is highlighted, emphasizing the need for expert review in practical applications.
The framework has the potential to significantly enhance the reliability of AI-assisted formal mathematics, addressing a critical bottleneck in the field. However, the authors caution against uncritical use of the certifier, as it could propagate subtle errors into formal libraries if not properly validated by experts. This paper introduces a significant advancement in the field of autoformalization by proposing a framework that certifies semantic equivalence between natural-language and formal mathematical statements. The comprehensive methodology and rigorous experimental evaluation demonstrate its potential to improve the reliability of AI-assisted formal mathematics, making it a highly impactful contribution to machine learning research.
Many societal decisions are settled by contests of persuasion. Conversational AI is a powerful new entrant in these contests, but whether it can out-persuade skilled and highly incentivized humans has remained unclear. Here, in a series of four preregistered experiments (n = 18,978 conversations from 6,923 people), we pitted AI systems against a range of human persuaders, including laypeople, winners of a separately preregistered four-round online persuasion tournament, professional canvassers, and world championship debaters. We found that AI systems were reliably more persuasive than expert humans, even when expert humans chose their issues, researched in advance, underwent hours of live, structured practice, and were incentivized with ÂŁ1,000 cash bonuses. In a follow-up study, AI's advantage persisted after experts received a coaching tool that let them practice against the AI that beat them, review their performance history, and see what AI would have said at key moments. We found converging evidence that AI's advantage stemmed from rapidly deploying larger quantities of information: after coaching, expert humans could tie an AI constrained to respond at human speeds and with human-length messages. In a final study, we show that AI's advantage extends to consequential real-world behavior: AI was nearly 3x more effective than professional canvassers from a UK fundraising firm at raising real-money donations to Save the Children. Together, these results establish that frontier AI systems out-persuade expert humans in conversation, with significant implications for political communication.
Primary: Oxford Internet Institute
All Institutions: Oxford Internet Institute, UK AI Security Institute
This paper establishes that advanced AI systems can out-persuade expert humans in conversational contexts, raising important questions about the future of persuasion in society. The comprehensive methodology and significant findings contribute to our understanding of AI's capabilities and the ethical implications of its use in persuasive communication.
The methodology employed in this research is robust, utilizing a series of four preregistered experiments with a large sample size (n = 18,978 conversations). The design effectively compares AI systems with various classes of human persuaders, including elite debaters and professional canvassers. The use of real-time randomization and a custom web application enhances the experimental rigor. Additionally, the study employs a comprehensive set of measures to assess persuasive effectiveness, including pre- and post-conversation attitude ratings and detailed analyses of conversational content. The inclusion of coaching interventions and throughput constraints adds depth to the exploration of AI's persuasive capabilities.
The experiments are well-structured, with clear research questions and hypotheses. Each study builds on the previous one, allowing for a thorough investigation of AI's persuasive advantage. The findings demonstrate that AI consistently outperforms expert human persuaders across different contexts, including real-world fundraising scenarios. The statistical analyses are appropriate, and the results are presented with confidence intervals and significance levels, enhancing the credibility of the findings. However, the reliance on a specific AI model (Claude Opus) raises questions about generalizability to other models.
The paper provides a clear description of the experimental procedures, participant recruitment, and data analysis methods, which are essential for reproducibility. The authors have made their data and code available on GitHub, allowing other researchers to replicate the findings. However, the specific configurations and prompts used for the AI models could be better detailed to ensure full transparency in replicating the AI's performance.
While the study presents compelling evidence of AI's persuasive capabilities, it has limitations. The experiments are conducted in controlled environments that may not fully capture the complexities of real-world persuasion. The study also focuses on text-based interactions, which may not generalize to other modalities such as audio or video. Additionally, the potential for AI-generated misinformation or manipulation in persuasive contexts is not thoroughly addressed, raising ethical concerns.
The implications of this research are significant, as it suggests that AI can surpass expert human persuaders in various contexts, including political communication and fundraising. This capability could lead to a shift in power dynamics, favoring those with access to advanced AI technologies. While AI could enhance advocacy for under-resourced actors, it also raises ethical concerns about the potential for manipulation and misinformation. The findings call for careful consideration of the societal impacts of AI in persuasive roles. This paper establishes that advanced AI systems can out-persuade expert humans in conversational contexts, raising important questions about the future of persuasion in society. The comprehensive methodology and significant findings contribute to our understanding of AI's capabilities and the ethical implications of its use in persuasive communication.
Biophysical neuron models link measurements of neural activity to underlying cellular mechanisms. Yet, a central challenge is that the kinetics of many ion channels are poorly characterized, and practical simplifications -- omitting channels or reducing morphological detail -- introduce systematic gaps between model and biology. Bridging these gaps requires approaches that can flexibly discover unmodeled dynamics while preserving mechanistic interpretability. Here, we introduce a hybrid modeling framework that embeds neural ordinary differential equations into conductance-based biophysical models to capture unknown currents or mis-specified channel kinetics. By parameterizing the neural ODE in terms of voltage-dependent steady-state and time-constant functions, we recover interpretable gating dynamics directly from voltage recordings without assuming a functional form. We show that the hybrid model fits the gating kinetics of 2400 ion channel models and recovers unknown gating dynamics from single current-clamp recordings, generalizing to out-of-distribution stimulus regimes under realistic inputs and parameter misspecification. We also use our method to reduce a multicompartment model of a cortical neuron into a single-compartment hybrid model with a learned axial current, yielding up to an order of magnitude lower computational cost. Together, our results establish a plug-and-play framework for selectively replacing unknown components of conductance-based models with neural ODEs while preserving their mechanistic structure.
Primary: Max Planck Institute for Biological Intelligence
All Institutions: Max Planck Institute for Biological Intelligence, Department Empirical Inference, Excellence Cluster Machine Learning, Hertie Institute for AI in Brain Health, Machine Learning in Science, Max Planck Institute for Intelligent Systems, TĂĽbingen AI Center, University of TĂĽbingen
The paper presents a novel hybrid modeling framework that integrates neural ODEs into biophysical neuron models, enabling the recovery of unknown ionic and axial current dynamics from voltage recordings while preserving mechanistic interpretability. This work represents a significant advancement in computational neuroscience, offering a practical tool for mechanism discovery and model simplification.
The paper introduces a hybrid modeling framework that integrates neural ordinary differential equations (ODEs) with conductance-based biophysical neuron models. This approach allows for the flexible discovery of unmodeled dynamics while maintaining mechanistic interpretability. The methodology effectively addresses the challenge of poorly characterized ion channel kinetics and morphological simplifications by enabling the recovery of gating dynamics directly from voltage recordings without assuming a specific functional form. The use of neural ODEs enhances the expressiveness of the model while preserving the underlying biophysical structure, which is a significant advancement in the field of computational neuroscience.
The experiments conducted in the paper are rigorous and comprehensive, demonstrating the effectiveness of the proposed hybrid model across various scenarios. The authors validate their approach by fitting the gating kinetics of 2400 ion channel models and recovering unknown gating dynamics from single current-clamp recordings. The model's performance is evaluated under realistic conditions, including noise and parameter misspecification, showcasing its robustness and generalization capabilities. The reduction of a multicompartment model to a single-compartment hybrid model with significant computational savings further emphasizes the practical implications of the work.
The authors provide a detailed description of their methodology, including network architecture, training procedures, and experimental setups, which enhances reproducibility. The code and instructions for running the experiments are available on GitHub, facilitating further exploration and validation by the research community. However, the reliance on simulated data for validation may limit immediate applicability to real-world scenarios, necessitating additional work to address challenges in real electrophysiological recordings.
One limitation identified in the study is the inherent underconstrained nature of recovering gating dynamics from voltage alone, particularly for channels with multiple gates. The authors acknowledge that multiplicative interactions can lead to different combinations of gating functions producing similar currents without additional data. Additionally, the framework's performance on real-world data remains to be fully validated, as the current work primarily focuses on simulated datasets.
The hybrid modeling framework has the potential to significantly impact the field of computational neuroscience by providing a tool for mechanism discovery in biophysical neuron models. Its ability to recover interpretable gating dynamics and reduce model complexity can facilitate more efficient simulations and analyses in neuroscience research. Furthermore, the approach may inspire similar hybrid methodologies in other scientific domains where mechanistic models are combined with data-driven techniques. The implications extend to applications in brain health, neuroinformatics, and the development of more accurate neural models for various research and clinical purposes. The paper presents a novel hybrid modeling framework that integrates neural ODEs into biophysical neuron models, enabling the recovery of unknown ionic and axial current dynamics from voltage recordings while preserving mechanistic interpretability. This work represents a significant advancement in computational neuroscience, offering a practical tool for mechanism discovery and model simplification.
Aligning neural activity across subjects offers the promise of discovering shared computational principles and generalizable decoders. However, traditional alignment methods require shared stimuli across subjects, a constraint that limits applicability to naturalistic paradigms with limited or non-overlapping data. We introduce a Multi-Encoder-Decoder Variational Autoencoder (MED-VAE) that achieves cross-subject alignment without shared stimuli by anchoring representations to a common scaffold provided by a pretrained ANN. Using the Natural Scenes Dataset, we show that MED-VAE creates common latent spaces with superior semantic organisation, achieving higher cross-subject alignment than common methods while maintaining robust generalisation to held-out stimuli where traditional methods degrade. Reconstructing from these common spaces back to each subject's original neural space, MED-VAE preserves equal stimulus-driven signal in its cross-subject latent space. Finally, we show that this superior alignment directly enables cross-subject neural prediction, as demonstrated via cross-subject image decoding. In summary, we introduce a framework to identify generalisable common subspaces for cross-subject predictions and downstream tasks, demonstrated here for visual cortex responses to static images.
Primary: University of Oxford
All Institutions: Big Data Institute, Centre for Neural Circuits and Behaviour, Department of Medicine, Department of Physiology Anatomy and Genetics, University of Oxford
The main contribution of this paper is the introduction of MED-VAE, a novel architecture that enables cross-subject neural alignment without shared stimuli, demonstrating superior semantic organization and generalization capabilities. This work significantly advances the field of systems neuroscience by providing a robust method for aligning neural data across individuals, paving the way for more comprehensive analyses of brain function and individual differences.
The proposed Multi-Encoder-Decoder Variational Autoencoder (MED-VAE) introduces a novel architecture that bypasses the need for shared stimuli in cross-subject neural alignment by leveraging a pretrained artificial neural network (ANN) as a scaffold. This method allows for implicit alignment pressure, creating a common latent space that captures shared computational structures across subjects. The architecture's design is well-justified, with a clear explanation of how the various components interact to achieve alignment, and the use of a variational autoencoder framework is appropriate for the task. The integration of multiple encoders and decoders tailored to individual subjects while maintaining a shared latent space is a significant methodological advancement.
The experiments conducted using the Natural Scenes Dataset are robust, with a clear focus on evaluating the performance of MED-VAE against traditional methods such as Shared Response Models and Procrustes analysis. The metrics employed, including component-wise correlation, Representational Similarity Analysis (RSA), and cross-trial evaluations, provide a comprehensive assessment of alignment quality and generalization capabilities. The results demonstrate that MED-VAE achieves superior performance in preserving semantic organization and cross-subject alignment, with rigorous statistical analysis supporting the findings.
The paper includes a URL to the code repository, which enhances the reproducibility of the results. The methodology is described in sufficient detail, allowing other researchers to replicate the experiments. However, the reliance on specific datasets and pretrained models may limit broader applicability without further validation on diverse datasets.
The study is limited by its focus on a single dataset, which may restrict the generalizability of the findings. Additionally, while the architecture is designed to accommodate heterogeneous inputs, the actual implementation has only been tested on subjects from the same dataset. Future work should explore cross-study alignment with varying acquisition parameters. The quality of alignment is also dependent on the pretrained ANN's representational capabilities, which may not be universally applicable across different domains.
The implications of this work are significant for population-level neuroscience, enabling the analysis of neural data across subjects without the need for shared stimuli. This could facilitate new insights into universal coding mechanisms and individual differences in neural processing. The framework has potential applications in brain-computer interfaces and clinical settings, where aligning neural responses from different individuals can enhance predictive modeling and data augmentation strategies. The main contribution of this paper is the introduction of MED-VAE, a novel architecture that enables cross-subject neural alignment without shared stimuli, demonstrating superior semantic organization and generalization capabilities. This work significantly advances the field of systems neuroscience by providing a robust method for aligning neural data across individuals, paving the way for more comprehensive analyses of brain function and individual differences.
In large language model (LLM) serving, each request accumulates persistent graphics processing unit (GPU) memory during service as its key-value cache grows with every generated token. Under high concurrency, aggregate memory usage therefore increases endogenously over time: the service process itself creates future capacity pressure. When memory capacity is exceeded, systems evict active requests, discarding cached state and restarting them later, which wastes computation and reduces throughput. We develop a discrete-time dynamical model of memory-constrained LLM inference that captures admission, memory growth, and eviction under continuous batching. In the saturated-input regime, the system admits both eviction-free fixed points and limit cycles with evictions. For homogeneous workloads, we show that the eviction-free equilibrium is unstable and that, except for a Lebesgue-measure-zero exact-capture set, the system converges to a unique worst-case limit cycle that is asymptotically stable outside this exceptional set, with throughput losses as large as 50%. For heterogeneous workloads, we prove a stability criterion in the two-class common-input setting and explain how the survival-polynomial mechanism generalizes to multiple classes and heterogeneous-input lengths. Under an input-dominated scaling regime, coprime decoding lengths stabilize the eviction-free equilibrium, while non-coprime lengths create synchronized modes that drive instability. These results characterize when workload heterogeneity desynchronizes completions and helps stabilize memory-constrained serving. More broadly, we identify service-induced congestion as a structural instability mechanism and derive scheduling design principles for sustaining high throughput.
Primary: Massachusetts Institute of Technology
All Institutions: Massachusetts Institute of Technology, Columbia Business School, Columbia University, Peking University
The main contribution of this paper is the development of a discrete-time dynamical model that characterizes memory-constrained LLM inference, providing insights into service-induced congestion and guiding scheduling design principles for high throughput. This work is significant as it addresses a critical challenge in LLM deployment, offering a theoretical foundation that could lead to practical improvements in system performance.
The paper introduces a discrete-time dynamical model to analyze memory-constrained LLM inference, focusing on the interactions between request admission, memory growth, and eviction under continuous batching. The methodology is rigorous, employing mathematical constructs to derive stability criteria and characterize system behavior under different workload conditions. The use of concepts like eviction-free fixed points and limit cycles is innovative, particularly in the context of service-induced congestion, which is a relatively unexplored area in LLM serving.
The paper includes a thorough experimental evaluation that supports the theoretical findings. It discusses the implications of homogeneous versus heterogeneous workloads and demonstrates the impact of different batching strategies on system stability and throughput. The results indicate significant throughput losses under certain conditions, which are quantitatively backed by the model's predictions. However, the paper could benefit from more extensive empirical validation with real-world LLM serving scenarios.
While the theoretical framework is well-defined, the paper lacks detailed implementation specifics that would facilitate reproducibility. There are no provided code repositories or datasets, which limits the ability of other researchers to replicate the experiments or build upon this work.
One limitation is the focus on theoretical modeling without extensive empirical validation in real-world settings. Additionally, the paper primarily addresses specific workload conditions, which may not generalize across all LLM serving scenarios. The reliance on mathematical constructs may also limit accessibility for practitioners who are less familiar with advanced dynamical systems.
The findings have significant implications for the deployment of LLMs in resource-constrained environments, particularly in cloud services where memory management is crucial. The insights into service-induced congestion could lead to improved scheduling and resource allocation strategies, enhancing the efficiency of LLM serving systems. This work could influence future research directions in optimizing LLM performance under varying workload conditions. The main contribution of this paper is the development of a discrete-time dynamical model that characterizes memory-constrained LLM inference, providing insights into service-induced congestion and guiding scheduling design principles for high throughput. This work is significant as it addresses a critical challenge in LLM deployment, offering a theoretical foundation that could lead to practical improvements in system performance.
Protein language models (pLMs) can generate novel protein sequences with properties beyond those observed in nature, yet the mechanisms underlying protein generation remain poorly understood. Existing mechanistic interpretability methods based on sparse autoencoders and transcoders primarily focus on protein representation learning models and do not capture the computation required for autoregressive generation. Here, we introduce ProGenMech, a mechanistic interpretability framework for generative protein language models that extends cross-layer transcoders (CLTs) to ProGen3, a sparse Mixture-of-Experts model trained for both causal generation and span infilling. Unlike per-layer approaches, CLTs reconstruct each layer using sparse latent variables from all preceding layers, enabling faithful recovery of inter-layer generative computation. We further develop a zero-shot circuit discovery framework to identify sparse latent circuits responsible for protein generation and fitness prediction. In causal generation and zero-shot fitness estimation tasks, ProGenMech outperforms local transcoder baselines in recovering ProGen3's probability distribution and functional scoring behavior, while matching the original model's generative distribution in span infilling tasks. Moreover, the recovered circuits reveal biologically meaningful motifs and functional regions associated with conserved sequence patterns and protein fitness landscapes, establishing a foundation for interpretable and steerable protein generation.
Primary: Georgia Institute of Technology
All Institutions: Georgia Institute of Technology
This paper presents a significant contribution to the field of machine learning by advancing the interpretability of protein language models through the introduction of the ProGenMech framework. The methodology is innovative, and the experimental results are promising, suggesting that this work could have a lasting impact on both machine learning and biological research.
The proposed ProGenMech framework introduces a novel approach to mechanistic interpretability in protein language models by utilizing cross-layer transcoders (CLTs) that leverage sparse latent variables from all preceding layers. This contrasts with existing methods that typically analyze individual layers in isolation. The methodology is well-structured, providing a clear rationale for the use of CLTs and the zero-shot circuit discovery framework. The integration of these components allows for a more comprehensive understanding of the generative processes within autoregressive models, which is a significant advancement in the field.
The experiments conducted demonstrate the effectiveness of ProGenMech in various tasks, including causal generation and fitness estimation. The results indicate that ProGenMech outperforms local transcoder baselines, which is a strong indicator of its utility. The paper provides sufficient detail on the experimental setup, including the datasets used and the metrics for evaluation, which enhances the credibility of the findings.
While the paper outlines the methodology and results, it lacks specific implementation details and code availability, which are critical for reproducibility. The absence of a project URL or demo limits the ability of other researchers to validate the findings independently. This is a significant drawback, as reproducibility is crucial in machine learning research.
One limitation of the study is the reliance on a specific model (ProGen3) for evaluation, which may not generalize to all protein language models. Additionally, while the framework shows promise in identifying biologically meaningful motifs, the biological validation of these findings is not discussed in detail. The paper could benefit from a more thorough exploration of potential limitations in the interpretability of the discovered circuits.
The implications of this research are substantial, particularly in the fields of bioengineering and synthetic biology. By enhancing the interpretability of protein language models, the proposed framework could facilitate advancements in protein design and the understanding of biological mechanisms. The potential applications in discovering new proteins with desirable properties could lead to significant breakthroughs in medicine and biotechnology. This paper presents a significant contribution to the field of machine learning by advancing the interpretability of protein language models through the introduction of the ProGenMech framework. The methodology is innovative, and the experimental results are promising, suggesting that this work could have a lasting impact on both machine learning and biological research.
Recent theoretical progress has established conditions under which machine learning models can efficiently predict ground-state properties of gapped local Hamiltonians when trained on quantum-generated data. Previous experimental demonstrations in this paradigm, however, have largely been limited to small systems or highly structured states, due to the difficulty of preparing many-body ground states on quantum processors. In this work, we demonstrate learning from experimental quantum data generated from approximate ground states of the two-dimensional Heisenberg XXZ model with system sizes up to 115 qubits. We construct a dataset of single-site expectation values, two-point correlations, and 12-body loop correlations across the antiferromagnetic phase. We then train neural networks on this data and show that they can accurately predict spatially resolved observables for previously unseen Hamiltonian parameters, both within the training distribution and in an out-of-distribution regime approaching the phase boundary. Our results demonstrate the practical realization of learning from quantum data for an interacting two-dimensional many-body system at scale, motivating a path toward regimes where quantum processors could provide training data beyond the reach of classical approximation methods.
Primary: University of Oxford
All Institutions: University of Oxford, IBM Quantum, IBM Research, IBM Research Europe, IBM Research Europe — Zurich, T. J. Watson Research Center, The Hartree Centre
This paper makes a substantial contribution to the intersection of machine learning and quantum computing by demonstrating a practical workflow for learning ground state observables from quantum data, thereby advancing our understanding of many-body systems and paving the way for future research in quantum machine learning.
The paper presents a novel methodology that combines sample-based Krylov diagonalization with a basis optimization procedure to generate approximate ground states of the Heisenberg XXZ model on quantum hardware. This approach is significant as it addresses the challenges of preparing accurate many-body states and estimating observables on noisy quantum processors. The integration of machine learning to predict ground-state observables from quantum-generated data is a compelling advancement in the field, especially given the scale of the systems studied (up to 115 qubits). The use of neural networks to learn from this data and the demonstration of generalization capabilities both within and outside the training distribution are noteworthy contributions.
The experiments are well-designed, utilizing quantum processors to generate data and rigorously comparing the results against classical benchmarks (DMRG). The authors construct a comprehensive dataset of observables and demonstrate the predictive capabilities of their trained models across various Hamiltonian parameters. The results indicate a high level of accuracy in predictions, even in out-of-distribution scenarios, which is a strong empirical finding that supports the proposed methodology.
The paper provides sufficient methodological detail to enable reproducibility, including descriptions of the quantum workflows, neural network architectures, and training procedures. However, the lack of publicly available code or data limits the ease of reproducibility. The authors mention that the code and data can be requested, which is a positive step but does not fully satisfy the reproducibility criterion.
One limitation is the reliance on quantum hardware, which may introduce noise and errors that affect the quality of the generated data. Additionally, while the paper demonstrates generalization capabilities, the performance near phase boundaries could be further explored, as this is a critical area for many-body physics. The authors also acknowledge that further optimization of the parameters used in their quantum circuits could enhance the quality of the generated states.
This work has significant implications for the fields of condensed matter physics, quantum chemistry, and quantum information science. By demonstrating that machine learning can effectively leverage quantum-generated data, the research opens pathways for future studies that could utilize quantum processors to explore complex many-body systems that are currently intractable with classical methods. The potential to learn from quantum data at scale could revolutionize how researchers approach problems in these domains. This paper makes a substantial contribution to the intersection of machine learning and quantum computing by demonstrating a practical workflow for learning ground state observables from quantum data, thereby advancing our understanding of many-body systems and paving the way for future research in quantum machine learning.
Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, yet existing work largely focuses on reward engineering while diverse training data remains scarce. We revisit this problem from a data-centric perspective and show that a simple yet effective data recipe alone, paired with a minimal outcome-based GRPO setup, suffices to substantially improve long-context reasoning. Our recipe targets three complementary task families -- retrieval, multi-evidence synthesis, and reasoning -- for which we construct and curate eight datasets totaling ~14K examples. Experiments on three models (Qwen3-4B/8B/30B-A3B) yield average gains of +7.2/+3.2/+6.4 points across seven long-context benchmarks, surpassing prior RL training sets. We further demonstrate that these gains transfer to agentic tasks, where continuing RL training on an agent-tuned model with our data recipe improves GAIA by +4.8 and BrowseComp by +7.0 points. We will release our datasets to facilitate future research.
Primary: Tsinghua University
All Institutions: Tsinghua University
This paper presents a compelling data-centric solution to long-context reinforcement learning, demonstrating that a carefully curated mixture of retrieval, synthesis, and reasoning tasks can significantly enhance model performance without complex reward engineering. The rigorous evaluation and transfer to agentic tasks make it a valuable contribution to the field, with high potential for adoption and further research.
The paper proposes a data-centric approach to long-context reinforcement learning (RL), arguing that diverse, high-quality training data is more critical than complex reward engineering. The core methodological contribution is a curated mixture of eight datasets (~14K examples) spanning three complementary task families: Retrieval (FuzzyNeedle, MultiNeedle), Multi-evidence Synthesis (CrossEntity, WebSearch, MultiQuery, KeyChain, LongDocQA), and Reasoning (LongMath). The authors employ a minimal outcome-based GRPO setup, demonstrating that this specific data recipe yields significant gains without auxiliary process rewards. The methodology is sound, leveraging synthetic data generation guided by LLMs to create "hard" samples that target specific failure modes of current long-context models (e.g., lexical shortcuts, incomplete coverage). The ablation studies effectively isolate the contribution of each task family, providing strong empirical support for the hypothesis that these three abilities are complementary and necessary for robust long-context reasoning.
The experimental evaluation is comprehensive and rigorous. The authors test their method on three Qwen3 model variants (4B, 8B, 30B-A3B) across seven long-context benchmarks, including multi-hop QA, holistic reasoning, and synthetic reasoning tasks. The results show consistent improvements over base models and prior RL training sets (DocQA-RL, KeyChain). Notably, the gains transfer to agentic tasks (GAIA, BrowseComp), suggesting broader utility. The evaluation also includes an analysis of generalization to contexts longer than the training distribution (up to 230K tokens), which is a crucial and impressive finding. The ablation studies on task balancing and reward design further strengthen the validity of the claims. The use of LLM-as-a-judge for certain metrics is noted, but the consistency of results across different evaluation protocols mitigates concerns.
The paper provides detailed descriptions of the data construction pipelines, including the specific datasets used, the synthetic generation prompts (implied by the description), and the RL training hyperparameters (GRPO, batch sizes, learning rates). The authors commit to releasing the datasets, which is a significant positive factor for reproducibility. The training setup is described in sufficient detail for replication by other researchers with similar computational resources. The use of standard frameworks (Miles, Megatron-LM, SGLang) also aids reproducibility.
The primary limitation is the scale of the training data (~14K examples), which is small compared to the pre-training data of the base models. While effective for RL fine-tuning, the generalizability to even larger models or different model families is not fully explored. The synthetic nature of most datasets raises questions about their alignment with real-world long-context distributions, although the transfer to agentic tasks suggests some degree of realism. The reliance on LLM-as-a-judge for evaluation and reward calculation introduces potential biases, although the authors attempt to mitigate this with rule-based extraction where possible. The computational cost, while manageable for a research lab, is non-trivial.
This work has significant implications for the development of long-context LLMs and autonomous agents. By demonstrating that a simple data recipe can outperform complex reward engineering, it shifts the focus of the community towards data curation and quality. The release of the datasets will facilitate further research in this area. The transfer to agentic tasks highlights the potential for improving real-world AI systems that operate in long-context environments. However, the reliance on synthetic data and LLM judges warrants careful consideration regarding bias and robustness. This paper presents a compelling data-centric solution to long-context reinforcement learning, demonstrating that a carefully curated mixture of retrieval, synthesis, and reasoning tasks can significantly enhance model performance without complex reward engineering. The rigorous evaluation and transfer to agentic tasks make it a valuable contribution to the field, with high potential for adoption and further research.
Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to select useful experience, act on it, write reusable knowledge, and maintain a growing repository. We introduce OPD-Evolver, a slow-fast co-evolution framework that cultivates such an agent evolver through on-policy self-distillation. In the fast loop, OPD-Evolver interacts with a four-level memory hierarchy to read, use, write, and maintain experience for rapid test-time evolution. In the slow loop, outcome-calibrated memory attribution and privileged hindsight distill these four abilities into the deployable policy. Across multi-domain benchmarks, OPD-Evolver surpasses memory systems such as ReasoningBank by up to 11.5%, and training-based methods such as Skill0 by ~5.8%. Further analysis shows that OPD-Evolver internalizes high-value experience and memory management, enabling OPD-Evolver-9B to challenge giant counterparts such as Qwen3.5-397B-A17B and Step-3.5-Flash, pointing beyond memory-augmented agents toward genuinely qualified agent evolvers.
Primary: ByteDance Inc
All Institutions: ByteDance Inc, NUS-Lab
OPD-Evolver introduces a novel slow-fast co-evolution framework that trains agents to holistically manage their memory lifecycle through on-policy self-distillation with privileged hindsight, achieving state-of-the-art performance on self-evolving benchmarks with significantly smaller models.
The paper proposes OPD-Evolver, a framework for "self-evolving" agents that integrates a four-level memory hierarchy (trajectories, tips, skills, tools) with a slow-fast co-evolution loop. The fast loop handles online interaction and memory management, while the slow loop uses outcome-calibrated memory attribution and privileged hindsight distillation to train the agent's selection, execution, writing, and maintenance capabilities via on-policy distillation. The methodology is technically sound and addresses a critical gap in current agentic systems: the disconnect between storing experience and learning to evolve from it. The introduction of "privileged hindsight" where the teacher has access to future utility and repository diagnostics is a novel twist on standard on-policy distillation, allowing the student to learn not just task execution but meta-cognitive skills for memory management.
The experimental evaluation is comprehensive, covering multiple benchmarks (LifelongAgentBench, MemoryArena, AMA-Bench, InterCode, MiniHack) and comparing against strong memory-augmented baselines (ReasoningBank, EvolveR, MemEvolve) and training-based methods (Skill0, GRPO). The results show significant improvements, particularly on the 9B model which challenges much larger 196B+ models. The ablation studies effectively isolate the contribution of each component (selection, writing, maintenance, attribution), providing strong evidence for the design choices. The comparison with giant counterparts is impressive and highlights the efficiency of the approach.
The paper provides detailed descriptions of the training data sources (AWM, Nemotron-Terminal-Corpus, EnvScaler), hyperparameters, and evaluation protocols. The use of standard backbones (Qwen3-4B, Qwen3.5-9B) and open benchmarks enhances reproducibility. The algorithm and loss functions are clearly defined.
The paper relies on a specific four-level memory hierarchy which may not generalize perfectly to all domains without adaptation. The "privileged hindsight" in the slow loop requires logging detailed interaction histories and future outcomes, which might be computationally expensive or infeasible in some real-time, non-interactive settings. The evaluation is primarily on coding and reasoning tasks; performance on open-ended creative or social tasks is less clear.
This work advances the field of self-evolving agents by providing a practical framework for internalizing memory management skills. It reduces the reliance on large context windows and external memory systems, potentially lowering inference costs and improving privacy. It sets a new standard for what constitutes a "qualified" agent evolver, moving beyond simple retrieval-augmented generation to holistic lifecycle management of experience. OPD-Evolver introduces a novel slow-fast co-evolution framework that trains agents to holistically manage their memory lifecycle through on-policy self-distillation with privileged hindsight, achieving state-of-the-art performance on self-evolving benchmarks with significantly smaller models.
Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime interactions must jointly produce coherent gameplay. We formalize end-to-end game generation as the problem of producing a complete game artifact that realizes a specification through observable player-game interaction in a target environment. We argue that evaluating this setting requires three desiderata: Engine Grounding, Artifact Completeness, and Interactive Verification. We propose an interaction-grounded evaluation framework that assesses executable gameplay through replayed demonstrations and rubric-guided multimodal judging. We instantiate this framework as GameCraft-Bench, a benchmark comprising 140 Godot tasks across 15 game families. Evaluations of frontier coding agents show that end-to-end game generation remains highly challenging: the strongest agent achieves only 41.46%, and most agents score below 40%. Further analysis reveals that while agents often implement recognizable mechanics, they struggle to deliver complete games with sufficient content, functional visual feedback, and coherent presentation. See https://tongxuluo.github.io/gamecraft-bench-website for demos, code, and data.
Primary: Shenzhen Loop Area Institute
All Institutions: Shenzhen Loop Area Institute, The Chinese University of Hong Kong
GameCraft-Bench establishes a new standard for evaluating coding agents by requiring them to produce complete, playable games in a real engine, revealing that current frontier models remain far from solving the complex, multi-faceted challenge of end-to-end interactive system generation.
The paper introduces a rigorous evaluation framework for end-to-end game generation, addressing a critical gap in current coding agent benchmarks. By defining three desiderata—Engine Grounding, Artifact Completeness, and Interactive Verification—it shifts the focus from static code correctness to dynamic, interactive system behavior. The methodology involves creating a benchmark (GameCraft-Bench) with 140 tasks in the Godot engine, requiring agents to produce complete, launchable projects along with replayable interaction traces. The evaluation uses a multimodal judge to score gameplay videos against hidden rubrics, ensuring that the generated artifacts are not just runnable but playable and coherent. This approach is novel in its specific focus on the full lifecycle of game creation within a real engine, rather than isolated mechanics or web-based prototypes.
The authors evaluate seven frontier coding agents (including Claude Code, Codex, Kimi Code, and Code Buddy) on the benchmark. The results show that even the strongest agents achieve only ~41% score, highlighting the significant challenge of end-to-end game generation. The analysis breaks down performance by categories (Core Mechanics, Content Depth, Functional Visuals, Art and Presentation), revealing that agents struggle more with content depth and presentation than with basic mechanics. The paper also provides diagnostic insights, such as the correlation between visual debugging and success, and the decoupling of tool usage volume from task quality. The experiments are well-designed, with clear metrics and a robust evaluation pipeline.
The paper provides detailed implementation details, including the Godot version, runtime environment, and submission format. The benchmark tasks are structured with clear specifications and rubrics. The use of replayable traces ensures that the evaluation is deterministic and reproducible. The code and data are made available via the project website, facilitating replication. The multimodal judge's stability is also tested, showing consistent scores across repeated evaluations.
The benchmark is limited to 2D games in the Godot engine, excluding 3D games and other major engines like Unity or Unreal. The evaluation relies on visual evidence, so audio-dependent aspects are not directly assessed. The multimodal judge may have biases or limitations in visual understanding, although the paper attempts to calibrate this with human annotators. The benchmark does not measure subjective fun or creativity, focusing instead on specification adherence and playability.
This work has significant implications for the development of autonomous coding agents, particularly in creative and interactive domains. By establishing a rigorous benchmark for end-to-end game generation, it provides a clear target for future research and helps identify specific weaknesses in current models. The findings suggest that while agents can produce functional code, they struggle with the holistic integration of mechanics, content, and presentation required for a complete game. This insight can guide the development of more robust and capable agents. The benchmark itself serves as a valuable resource for the community, enabling standardized comparison of different approaches to game generation. GameCraft-Bench establishes a new standard for evaluating coding agents by requiring them to produce complete, playable games in a real engine, revealing that current frontier models remain far from solving the complex, multi-faceted challenge of end-to-end interactive system generation.
On-policy self-distillation (OPSD) has proven effective for post-training large language models (LLMs), yet its application to diffusion LLMs (dLLMs) remains unexplored. Existing OPSD methods are inherently autoregressive-centric. They inject privileged information via left-to-right prefix conditioning with token-level divergence supervision, a design that fundamentally conflicts with the arbitraryorder generation of dLLMs. We introduce d-OPSD, the first OPSD framework tailored for dLLMs. Our approach makes two core contributions. First, we reframe self-teacher construction by using self-generated answers as suffix conditioning, enabling the student model to learn from "self future-experience" rather than privileged prefixes. Second, we shift supervision from token-level to step-level, aligning training with the iterative denoising process of dLLMs. Experiments across four reasoning benchmarks show that d-OPSD consistently outperforms RLVR and SFT baselines with superior sample efficiency, requiring only around 10% of the optimization steps by RLVR and opening a promising pathway for dLLM posttraining. The code is available at https://github.com/xingzhejun/d-OPSD.
Primary: unknown
All Institutions: unknown
This paper presents d-OPSD, the first on-policy self-distillation framework for diffusion LLMs, which adapts teacher construction to suffix conditioning and supervision to step-level divergence, achieving superior sample efficiency and reasoning performance over RLVR baselines.
The paper proposes d-OPSD, a novel on-policy self-distillation framework tailored for diffusion LLMs (dLLMs). The core methodological innovation lies in two key adaptations to the dLLM architecture: (1) reframing the self-teacher construction to use self-generated answers as *suffix* conditioning (leveraging the bidirectional/arbitrary-order nature of dLLMs) rather than the standard autoregressive *prefix* conditioning, and (2) shifting divergence supervision from token-level to step-level to align with the iterative denoising process. This is a theoretically sound and logically consistent adaptation of on-policy distillation to a non-autoregressive paradigm. The insight that suffix conditioning allows the student to learn from its own "future experience" is compelling and distinct from existing AR-based OPSD methods.
The experimental evaluation is robust within the specific domain of dLLM post-training. The authors compare d-OPSD against strong RLVR baselines (diffu-GRPO, VRPO) and SFT baselines across four reasoning benchmarks (GSM8K, MATH500, Countdown, Sudoku). The results demonstrate consistent performance improvements over baselines, with a particularly striking claim of superior sample efficiency (converging in ~10% of the optimization steps required by RLVR). The ablation studies are thorough, covering retaining ratios, top-k subset selection, divergence objectives (KL vs Reverse KL), and clipping strategies. The "toy verification" effectively validates the self-teacher's capability. However, the evaluation is limited to a single base model (LLaDA-8B-Instruct), which restricts the generalizability of the findings to other dLLM architectures.
The paper provides a GitHub repository link and detailed implementation notes in the appendix, including LoRA settings, optimizer details, and specific engineering tricks like input concatenation to manage memory. The description of the step-level divergence and teacher construction is mathematically precise. The code availability significantly enhances reproducibility.
The primary limitation is the lack of evaluation on multiple dLLM backbones. Since dLLMs are a nascent field with varying architectures (block-diffusion, continuous diffusion, etc.), validating d-OPSD on only one model (LLaDA) is a risk. Additionally, the paper acknowledges a failure mode of policy collapse, which is common in RL/RLVR but suggests stability might still be an issue requiring careful hyperparameter tuning (e.g., clipping). The claim of "first" OPSD for dLLMs is strong, but the field is moving fast, and related off-policy self-distillation works (d3LLM, Cd4LM) are mentioned, distinguishing the on-policy nature clearly.
This work opens a new pathway for post-training dLLMs, potentially making them more efficient to train by reducing the need for expensive RLVR rollouts. By demonstrating that dLLMs can effectively leverage on-policy self-distillation, it bridges a gap between autoregressive and diffusion-based language modeling paradigms. The efficiency gains (10% steps) could lower the carbon footprint and cost of training high-performance reasoning models. This paper presents d-OPSD, the first on-policy self-distillation framework for diffusion LLMs, which adapts teacher construction to suffix conditioning and supervision to step-level divergence, achieving superior sample efficiency and reasoning performance over RLVR baselines.
Knowledge distillation transfers a teacher's competence to a small student but is brittle in the small-student regime: forcing the student to imitate logits from a much larger teacher concentrates it on the teacher's sharpest modes, hurting generalization on benchmark families beyond the training corpus. Reinforcement learning (RL) avoids logit imitation by training on the student's own rollouts. However, on questions where every rollout fails-yielding zero advantage and being silently discarded-injecting a stronger teacher's response into the policy gradient breaks the on-policy assumption and induces drift. We introduce Zone of Proximal Policy Optimization (ZPPO), inspired by Vygotsky's zone of proximal development, which keeps the teacher inside the prompt rather than the policy gradient. On hard questions, ZPPO constructs two reformulated prompts: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates the student must discriminate, and a Negative Candidate-included Question (NCQ) aggregates the student's wrong rollouts into a single prompt to surface their shared failure modes. A prompt replay buffer recirculates each hard question until it either graduates-the student's mean rollout accuracy on it reaches half- or is FIFO-evicted under finite capacity, amplifying BCQ and NCQ inside the student's current zone of proximal development. On the Qwen3.5 family at four student scales (0.8B-9B) with a 27B teacher, post-trained as vision-language models and evaluated on a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), ZPPO outperforms off/on-policy distillation and GRPO, with the largest gains at the smallest scale.
Primary: unknown
All Institutions: unknown
This paper introduces a novel prompt-based reinforcement learning framework, ZPPO, that effectively leverages teacher knowledge through structured candidate discrimination and failure aggregation, demonstrating significant improvements in small model performance across diverse multimodal benchmarks.
The paper proposes Zone of Proximal Policy Optimization (ZPPO), a method designed to address the brittleness of knowledge distillation in small student models and the on-policy violation issues in standard RL fine-tuning. The core innovation lies in keeping the teacher's influence within the prompt context rather than the gradient update. Specifically, it constructs two types of reformulated prompts for "hard" questions: Binary Candidate-included Questions (BCQ), which force the student to discriminate between a correct teacher response and an incorrect student rollout, and Negative Candidate-included Questions (NCQ), which aggregate student failures to highlight shared error modes. A prompt replay buffer is used to recirculate these hard questions until the student improves or they are evicted. The methodology is theoretically grounded in Vygotsky's Zone of Proximal Development, offering a psychologically inspired mechanism for curriculum learning within RLHF/RLAIF frameworks. The approach is novel in its specific construction of candidate pairs and negative aggregation, aiming to stabilize training where advantage estimates are noisy or zero.
The evaluation covers the Qwen3.5 family at four scales (0.8B to 9B) using a 27B teacher model. The benchmark suite is comprehensive, including 16 VLM, 10 LLM, and 5 Video benchmarks, totaling 31 tasks. The results indicate that ZPPO outperforms off-policy distillation, on-policy distillation, and GRPO (Group Relative Policy Optimization). The gains are most pronounced at the smallest scale (0.8B), which aligns with the paper's premise that small students benefit most from structured teacher guidance. The inclusion of vision-language and video benchmarks adds significant breadth, demonstrating the method's applicability beyond pure text. However, the specific baseline comparisons could be more detailed regarding computational costs and convergence speed, which are critical for RL methods.
The paper provides an algorithm description and mentions hyperparameters in the appendix. The use of a replay buffer and specific prompt construction methods (BCQ/NCQ) is clearly defined. However, the exact implementation details of the "Negative Candidate-included Question" aggregation and the specific thresholds for the replay buffer eviction (FIFO capacity) need to be clearly documented for full reproducibility. The reliance on a specific teacher model (Qwen3.5 27B) and dataset composition is noted, but the exact data splits and preprocessing steps for the multimodal benchmarks should be explicitly detailed to ensure other researchers can replicate the setup.
The paper acknowledges limitations in Section 7, likely related to the computational overhead of generating multiple candidate prompts and the potential for the replay buffer to become a bottleneck if not managed efficiently. The method's effectiveness is tied to the quality of the teacher; if the teacher is not significantly better than the student, the "hard" questions may not be well-defined. Additionally, the performance on very easy questions might not see significant improvement, as the replay buffer focuses on hard cases. The generalization to domains where the teacher's knowledge is sparse or incorrect is also a potential limitation not fully explored.
ZPPO offers a pathway to more efficient and robust fine-tuning of small language and vision-language models, which are crucial for edge deployment and cost-effective AI services. By improving the performance of smaller models, it could democratize access to high-quality AI capabilities. The method's focus on stability and generalization addresses key pain points in current RLHF pipelines, potentially leading to more reliable and safer AI systems. However, the increased complexity in training loops and prompt generation could lead to higher energy consumption if not optimized, which is a consideration for broader environmental impact. This paper introduces a novel prompt-based reinforcement learning framework, ZPPO, that effectively leverages teacher knowledge through structured candidate discrimination and failure aggregation, demonstrating significant improvements in small model performance across diverse multimodal benchmarks.
Long-horizon robot manipulation policies trained with reward shaping can still exploit dense rewards through inefficient interaction, while rare efficient behaviors may be forgotten during training. We argue that temporal efficiency itself provides a powerful and underutilized source of self-supervision for reinforcement learning. We introduce Temporal Self-Imitation Learning (TSIL), a reinforcement learning framework that mines temporally efficient successful trajectories generated during learning and converts them into reusable supervision for future policy improvement. TSIL progressively refines learning using configuration-conditioned adaptive temporal targets derived from fast successful trajectories, while preserving and replaying efficient behaviors through efficiency-weighted self-imitation learning. Across 15 distinct long-horizon manipulation tasks, TSIL consistently improves learning efficiency, task-completion efficiency, revisitation of fast successful behaviors, and robustness to unstable training conditions. More broadly, our results suggest that the temporal structure of successful behavior itself provides a scalable self-supervisory signal for reinforcement learning beyond manually engineered reward shaping alone.
Primary: Duke University
All Institutions: Duke University
TSIL offers a powerful paradigm shift in how self-supervision can be generated for reinforcement learning, moving beyond manually engineered reward shaping. By demonstrating that the temporal structure of successful behavior itself provides a scalable self-supervisory signal, it opens new avenues for designing more efficient, robust, and autonomous learning systems for complex robotic tasks. This could significantly reduce the burden of reward engineering, accelerate policy learning, and lead to more dexterous and capable robots in real-world applications. The principles could also inspire similar approaches in other long-horizon decision-making problems beyond robotics. This paper introduces Temporal Self-Imitation Learning (TSIL), a novel reinforcement learning framework that leverages temporal efficiency as a self-supervisory signal to improve long-horizon robot manipulation. The method's innovative use of configuration-conditioned adaptive temporal targets and efficiency-weighted self-imitation learning, coupled with extensive empirical validation across 15 distinct tasks, demonstrates significant improvements in learning efficiency, task-completion efficiency, and robustness, positioning it as a highly impactful contribution to the field of robotics and reinforcement learning.
The proposed Temporal Self-Imitation Learning (TSIL) framework presents a well-conceived approach to address critical challenges in long-horizon robot manipulation: inefficient reward exploitation and the forgetting of rare, efficient behaviors. TSIL's core innovation lies in leveraging temporal efficiency itself as a self-supervisory signal. This is achieved through two main mechanisms: 1. **Configuration-conditioned adaptive temporal targets:** Instead of relying on static reward shaping, TSIL dynamically derives temporal targets from the fastest successful trajectories observed so far, conditioned on the current state (configuration). This makes the learning targets progressively more challenging and context-aware, pushing the policy towards increasingly efficient solutions. This adaptive mechanism is a significant improvement over fixed reward functions, which can often be exploited or become suboptimal as the policy improves. 2. **Efficiency-weighted self-imitation learning:** TSIL explicitly preserves and replays these fast, successful behaviors. By weighting the imitation loss based on the temporal efficiency of past trajectories, it prioritizes learning from the most optimal experiences. This directly combats the problem of catastrophic forgetting of rare but highly effective actions, ensuring that the policy continuously refines its understanding of efficient pathways. The methodology is coherent, directly targets known limitations of existing RL approaches in complex robotic tasks, and offers a scalable way to generate self-supervision.
The experimental evaluation is exceptionally strong, claiming consistent improvements across "15 distinct long-horizon manipulation tasks." This breadth of evaluation is crucial for demonstrating the generalizability and robustness of the TSIL framework beyond specific, hand-picked scenarios. The metrics of interest—learning efficiency, task-completion efficiency, revisitation of fast successful behaviors, and robustness to unstable training conditions—are all highly relevant and impactful for practical robot learning. The abstract's claim of "consistently improves" suggests statistically significant and repeatable gains, which is a high bar for empirical success in this domain. If these claims hold, the empirical evidence strongly supports the method's effectiveness and practical utility, making it a significant contribution to the field.
The mention of a project URL (`https://generalroboticslab.com/TSIL`) is a strong positive indicator for reproducibility. Project pages often include code implementations, detailed experimental setups, datasets, and potentially pre-trained models or videos, which are essential for researchers to verify and build upon the work. The structured nature of the paper (Method, Experiments sections) also implies a detailed description of the algorithm and experimental protocols.
While the paper presents a very strong case, potential limitations might include: 1. **Initial Success Requirement:** TSIL relies on mining "fast successful trajectories." If initial task success is extremely rare or non-existent, the method might struggle to bootstrap. 2. **Computational Overhead:** Mining, storing, and adaptively managing a growing set of efficient trajectories, especially in high-dimensional state spaces, could introduce computational overhead. 3. **Definition of "Configuration-conditioned":** The complexity of defining and implementing "configuration-conditioned" targets might vary significantly with the task and state representation, potentially requiring careful engineering. 4. **Generalizability beyond temporal efficiency:** While temporal efficiency is critical, some tasks might have other primary optimization criteria (e.g., energy consumption, safety, precision) that TSIL, in its current form, might not directly optimize.
TSIL offers a powerful paradigm shift in how self-supervision can be generated for reinforcement learning, moving beyond manually engineered reward shaping. By demonstrating that the temporal structure of successful behavior itself provides a scalable self-supervisory signal, it opens new avenues for designing more efficient, robust, and autonomous learning systems for complex robotic tasks. This could significantly reduce the burden of reward engineering, accelerate policy learning, and lead to more dexterous and capable robots in real-world applications. The principles could also inspire similar approaches in other long-horizon decision-making problems beyond robotics. This paper introduces Temporal Self-Imitation Learning (TSIL), a novel reinforcement learning framework that leverages temporal efficiency as a self-supervisory signal to improve long-horizon robot manipulation. The method's innovative use of configuration-conditioned adaptive temporal targets and efficiency-weighted self-imitation learning, coupled with extensive empirical validation across 15 distinct tasks, demonstrates significant improvements in learning efficiency, task-completion efficiency, and robustness, positioning it as a highly impactful contribution to the field of robotics and reinforcement learning.
Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models (VLAs) and video world-action models (WAMs) inherit strong semantic or temporal priors from large-scale foundation models, but they still operate primarily on 2D image frames or 2D-derived latent spaces, leaving implicit the 3D geometry required for contact-rich manipulation. We propose the Geometric Action Model (GAM), a language-conditioned manipulation policy that directly repurposes a pretrained geometric foundation model (GFM) as a shared substrate for perception, temporal prediction, and action decoding. GAM splits the GFM at an intermediate layer: the shallow layers serve as an observation encoder, and a causal future predictor inserted at the split layer forecasts future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining GFM blocks for feature propagation and decoding, allowing a single backbone to produce both future geometry and actions. This design equips the GFM with language-conditioned temporal world modeling through minimal architectural modification while preserving its rich geometric priors. Across a broad suite of simulation and real-robot manipulation benchmarks, GAM is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines.
Primary: ETH AI Center
All Institutions: ETH AI Center
The main contribution of this paper is the introduction of the Geometric Action Model, which effectively combines geometric reasoning with language-conditioned action prediction, demonstrating superior performance in robotic manipulation tasks. This work represents a significant advancement in the field of robotics, addressing critical challenges in the integration of perception and action in complex environments.
The proposed Geometric Action Model (GAM) introduces a novel architecture that leverages a pretrained geometric foundation model (GFM) to integrate perception, temporal prediction, and action decoding in a single framework. By splitting the GFM and inserting a causal transformer, GAM effectively addresses the spatial ambiguities that traditional models face, allowing for more accurate and robust manipulation policies. This approach is innovative as it combines language conditioning with geometric reasoning, which is crucial for real-world robotic applications. The methodology is well-structured and presents a clear advancement over existing models.
The paper reports extensive evaluations on both simulation and real-robot manipulation benchmarks, demonstrating that GAM outperforms current state-of-the-art models in terms of accuracy, robustness, speed, and efficiency. The experiments are comprehensive, covering a variety of scenarios that highlight the model's capabilities. However, specific details regarding the datasets used and the metrics for evaluation could be elaborated upon to strengthen the findings.
The paper does not provide explicit links to code or datasets, which raises concerns regarding reproducibility. While the methodology is described in detail, the absence of accessible resources limits the ability of other researchers to replicate the results. Including a project URL or supplementary materials with implementation details would significantly enhance reproducibility.
The authors acknowledge that the language reasoning and commonsense capabilities of GAM are constrained by the frozen text encoder. They suggest that integrating a larger language model or an external reasoning module could be beneficial, indicating a clear avenue for future work. Additionally, the paper could benefit from a discussion on potential failure cases or scenarios where the model may not perform optimally.
The GAM framework has significant implications for the field of robotics, particularly in enhancing the capabilities of robots to understand and interact with their environments in a more human-like manner. Its ability to unify geometric reasoning with action prediction could lead to advancements in various applications, including assistive robotics, autonomous vehicles, and interactive systems. The work could inspire further research into multimodal models that integrate language, vision, and action. The main contribution of this paper is the introduction of the Geometric Action Model, which effectively combines geometric reasoning with language-conditioned action prediction, demonstrating superior performance in robotic manipulation tasks. This work represents a significant advancement in the field of robotics, addressing critical challenges in the integration of perception and action in complex environments.
Robots deployed in the real world must plan motions across diverse scenarios without per-scenario retuning. End-to-end reinforcement learning (RL) can generalize across scenarios but often becomes brittle under distribution shift, reward misspecification, and stochastic interactions. Model predictive path integral (MPPI) control enables strong real-time refinement without gradients, but its performance depends on a well-shaped sampling prior, while manually designing the priors does not scale to multi-scenario deployment. We present HOLO-MPPI (High-level Offline, Low-level Online MPPI), a multi-scenario motion planning framework that combines high-level policy learning with low-level stochastic optimal control. Offline, we learn a high-level policy that proposes scenario-robust plans in an abstract action space, with a learned world model for online rollout. Online, the policy serves as a data-driven prior generator that parameterizes MPPI's sampling distribution conditioned on the current observation and goal. MPPI then optimizes low-level control sequences around this prior in real time to adapt to local disturbances. We instantiate HOLO-MPPI in autonomous driving by designing an effective high-level action space and tailored model architectures. Our evaluation across diverse driving scenarios shows that HOLO-MPPI improves upon MPPI and end-to-end RL baselines while maintaining real-time control.
Primary: Massachusetts Institute of Technology
All Institutions: Massachusetts Institute of Technology, Honda Research Institute
The paper presents HOLO-MPPI, a novel hierarchical framework for multi-scenario motion planning that combines high-level policy learning with low-level optimization, demonstrating significant improvements in performance and robustness in autonomous driving tasks.
The paper introduces HOLO-MPPI, a hierarchical framework that effectively combines high-level policy learning with low-level stochastic optimal control for multi-scenario motion planning. The methodology is well-structured, leveraging reinforcement learning to generate robust priors that guide the MPPI optimization process. This dual-layer approach allows for real-time adaptation and enhances the robustness of motion planning across diverse scenarios, which is a significant advancement in the field of robotics. The use of an abstract action space for high-level planning is particularly innovative, as it enables the system to generalize across different driving scenarios without the need for scenario-specific tuning.
The experiments are comprehensive, evaluating the proposed method against multiple baselines in a realistic autonomous driving benchmark. The results demonstrate clear improvements in success rates and control smoothness compared to both vanilla MPPI and end-to-end reinforcement learning approaches. The paper provides sufficient detail on the experimental setup, including the scenarios tested and the metrics used for evaluation, which strengthens the credibility of the findings.
The paper lacks a dedicated section for reproducibility, and there are no URLs provided for code or data repositories. While the methodology is described in detail, the absence of supplementary materials may hinder the ability of other researchers to replicate the results. Providing access to the code and datasets would significantly enhance reproducibility.
One limitation noted is the reliance on a learned world model that may not fully capture the dynamics of neighboring vehicles in complex environments. This could affect performance in more interactive scenarios, such as urban driving. Additionally, while the hierarchical approach shows promise, it may require careful tuning and validation in real-world applications to ensure reliability.
The proposed method has the potential to significantly impact the field of robotics, particularly in autonomous driving applications. By enabling robots to operate effectively across diverse scenarios without extensive retuning, this framework could facilitate the deployment of autonomous systems in real-world environments, improving safety and efficiency. The hierarchical approach may also inspire further research into combining learning and optimization techniques in other robotic domains. The paper presents HOLO-MPPI, a novel hierarchical framework for multi-scenario motion planning that combines high-level policy learning with low-level optimization, demonstrating significant improvements in performance and robustness in autonomous driving tasks.
Federated fine-tuning of large language models using parameter-efficient methods such as LoRA enables privacy-preserving adaptation of foundation models. Heterogeneous hardware resources introduce challenges, as clients with different adapter ranks cannot be directly aggregated. While existing methods enable aggregation under heterogeneous ranks, they fail to control how information is distributed across rank dimensions, leading to suboptimal use of shared low-rank representations. Instead, we propose PreLort: a nested low-rank formulation for federated LoRA that organizes adapter dimensions into a prefix hierarchy. Our approach ensures that lower-rank dimensions encode task-relevant information, while higher-rank dimensions capture additional capacity. Building on this, we introduce (i) a segment-wise aggregation rule that averages only over clients contributing to each rank segment, avoiding dilution from zero-padded lower-rank clients, and (ii) a prefix-nested training strategy that optimizes each adapter under multiple rank truncations, encouraging useful signal to concentrate in low-rank prefix dimensions. Together, these components encourage a consistent low-rank prefix capturing the most task-relevant information, while higher-rank dimensions learn additional capacity. This allows low-rank clients to benefit from richer information contributed by higher-rank clients, as prefix dimensions are consistently learned and aggregated. Experiments demonstrate that our method consistently outperforms prior heterogeneous federated LoRA methods in accuracy and ROUGE-L, while achieving lower or comparable perplexity across multiple base models.
Primary: University of Cambridge
All Institutions: University of Cambridge
The paper presents PreLort, a federated LoRA fine-tuning method that effectively manages rank heterogeneity through a novel nested training and aggregation strategy, demonstrating significant improvements in model performance across multiple benchmarks. This contribution is poised to influence future research and applications in federated learning and parameter-efficient fine-tuning.
The proposed methodology, PreLort, introduces a novel nested low-rank formulation for federated LoRA that effectively addresses rank heterogeneity among clients. By organizing adapter dimensions into a prefix hierarchy and employing a segment-wise aggregation rule, the approach ensures that lower-rank dimensions capture task-relevant information while higher-rank dimensions provide additional capacity. This dual focus on training and aggregation is a significant advancement over existing methods, which primarily address aggregation without considering representation alignment during training.
The experiments are well-structured, utilizing standard instruction-tuning benchmarks such as Alpaca and Databricks-dolly-15, along with a classification dataset, 20 Newsgroups. The results demonstrate consistent improvements in accuracy and ROUGE-L metrics across multiple base models, indicating the effectiveness of the proposed method. The comparisons against several baseline methods, including ZeroPad, HetLoRA, and FLoRA, provide a comprehensive evaluation of PreLort's performance.
The paper provides sufficient implementation details, including hyperparameters, training setups, and evaluation metrics. However, the lack of a publicly available code repository or demo URL limits the ease of reproducibility for the broader research community.
One identified limitation is the dependency of the nested training strategy on sufficient local optimization steps, which may increase computational overhead. This trade-off between communication efficiency and training time could hinder practical deployment in resource-constrained environments. Additionally, the paper does not address potential scalability issues when applied to a significantly larger number of clients or more complex models.
The implications of this work are substantial, as it addresses a critical challenge in federated learning—rank heterogeneity—enabling more effective collaboration among clients with varying computational resources. This advancement could facilitate the deployment of large language models in real-world applications where privacy and resource constraints are paramount, thus enhancing the accessibility of advanced AI technologies. The paper presents PreLort, a federated LoRA fine-tuning method that effectively manages rank heterogeneity through a novel nested training and aggregation strategy, demonstrating significant improvements in model performance across multiple benchmarks. This contribution is poised to influence future research and applications in federated learning and parameter-efficient fine-tuning.