Last 7 Days (June 17 – June 23, 2026)
Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in research. Using a formal psychometric framework, we show that these profiles are largely a measurement artifact. Administering a battery of personality and risk-preference instruments spanning self-reports and behavioral tasks to 56 instruction-tuned LLMs alongside large human reference samples, we report four findings. First, differences between models are driven not by the traits an instrument targets but by a directional response bias, a tendency to respond toward one end of the scale, or one labeled option, regardless of item content; a variance decomposition attributes 81-90% of between-model variation to this bias, against 9-16% in humans. Second, the bias declines with model capability but is not eliminated by it. Third, because bias rather than trait drives responding, an instrument's apparent reliability is almost entirely predicted by its response orthogonality, a term we coin for the proportion of items for which trait and bias point in opposite directions. Fourth, the profile a model appears to have shifts with the items used and can be manufactured through item selection. These results demonstrate that the apparent psychological profiles of LLMs are artifacts of the instrument used to measure them, not properties of the models themselves. As instruments borrowed from human psychology are rarely fully orthogonal and may inherently lack validity for LLMs, we call for dedicated assessments centered on response orthogonality.
Primary: Max Planck Institute for Human Development
All Institutions: Max Planck Institute for Human Development, University of Konstanz, Barcelona Supercomputing Center, University of Basel
This paper provides a critical psychometric analysis demonstrating that apparent psychological profiles in LLMs are largely artifacts of response bias rather than genuine traits, fundamentally challenging current evaluation practices and calling for more rigorous, orthogonal measurement frameworks in AI research.
The paper employs a rigorous psychometric framework to deconstruct the measurement of LLM "personalities." By modeling responses as a function of latent trait and response bias, and utilizing the concept of "response orthogonality" (the proportion of reverse-keyed items), the authors provide a mathematically sound method to separate genuine trait variance from systematic response artifacts. This approach is theoretically robust and directly addresses a fundamental confound in current LLM evaluation methodologies.
The experimental design is comprehensive, testing 56 instruction-tuned LLMs against a battery of 29 instruments (personality and risk preference) and comparing them against large human reference samples. The results are striking and consistent: LLMs show positive forward-reverse correlations (indicating bias dominance) whereas humans show negative correlations (indicating trait dominance). The variance decomposition showing 81-90% of LLM variation is bias-driven is a powerful empirical finding. The robustness checks across prompting conditions, model sizes, and elicitation methods strengthen the validity of the conclusions.
The paper provides detailed methodological descriptions, including the specific instruments used, the prompting strategies, and the mathematical derivations for the measurement model. The code repository is explicitly linked, ensuring high reproducibility. The use of standard psychometric instruments and clear definitions of variables enhances the clarity of the experimental setup.
The study focuses primarily on post-trained models and treats each model as a single respondent, which may not capture within-model variability or the nuances of different prompting strategies (e.g., persona adoption). The proprietary model sample size is small (N=10), limiting statistical power for those specific comparisons. Additionally, the study is limited to personality and risk domains; while the authors argue for broader applicability, empirical validation in other domains (e.g., moral reasoning, cognitive biases) is left for future work.
This paper has significant implications for the field of AI safety, alignment, and the use of LLMs as proxies for human participants in research. It challenges the validity of current LLM profiling practices and calls for a re-evaluation of how we measure and interpret LLM behaviors. The concept of response orthogonality offers a new standard for designing valid evaluation instruments for AI systems. This paper provides a critical psychometric analysis demonstrating that apparent psychological profiles in LLMs are largely artifacts of response bias rather than genuine traits, fundamentally challenging current evaluation practices and calling for more rigorous, orthogonal measurement frameworks in AI research.
Speculative inference accelerates large language model (LLM) decoding but provides no inherent safety guarantees. Existing safety defenses are largely incompatible with speculative inference: they either introduce additional computation or disrupt the draft-verify mechanism, negating acceleration benefits. This reveals a fundamental incompatibility between current safety methods and speculative decoding. We propose SafeSpec, a safety-aware speculative inference framework that integrates risk estimation directly into the verification process. SafeSpec attaches a lightweight latent safety head to the target model to jointly evaluate semantic validity and safety in a single forward pass. When unsafe generations are detected, SafeSpec applies rollback and safety-guided reflective multi-sampling to recover safe continuations rather than terminating generation. We model jailbreak attacks as distributional shifts over generative trajectories, where adversarial prompts increase the probability of harmful continuations without eliminating safe ones. Under this model, SafeSpec performs risk-aware trajectory recovery within the speculative decoding process. Across multiple models and adversarial benchmarks, SafeSpec achieves a substantially improved safety-efficiency trade-off. On Qwen3-32B, SafeSpec reduces attack success rates by 15% while preserving a 2.06x inference speedup on benign workloads, demonstrating that speculative acceleration and inference-time safety can be jointly optimized.
Primary: Zhejiang University
All Institutions: Zhejiang University, Huawei, Harbin Institute of Technology, Shenzhen
SafeSpec has a significant positive broader impact by enabling the deployment of safer and more efficient large language models. By addressing the fundamental incompatibility between speculative inference and existing safety defenses, it paves the way for LLMs to be used in more sensitive and performance-critical applications. This framework can help mitigate the risks of harmful content generation, thereby increasing public trust and responsible AI deployment. The approach of recovering safe continuations rather than hard refusal also improves user experience and model helpfulness. The paper does not highlight any specific negative societal consequences beyond the general risks associated with LLMs, which its work aims to mitigate. SafeSpec introduces a novel safety-aware speculative inference framework that elegantly integrates risk estimation and recovery directly into the LLM decoding process. This work makes a substantial technical contribution by demonstrating a superior safety-efficiency trade-off through a lightweight latent safety head and a dynamic rollback-and-reflect multi-sampling mechanism, offering a practical and impactful solution for deploying safer and faster LLMs in real-world applications.
The methodology of SafeSpec is well-conceived and addresses a critical gap in LLM deployment: integrating safety guarantees into speculative inference without negating its acceleration benefits. The core innovation lies in its dual-head verification mechanism. By attaching a lightweight, boundary-aligned latent safety head to the target model, SafeSpec enables simultaneous assessment of semantic validity and safety in a single forward pass. This design is elegant as it leverages the target model's existing computation for quality scoring, incurring negligible additional overhead for safety checks. The boundary-aligned extraction of hidden states for the safety head is a clever detail, preventing interference from the quality scoring prompt. The training methodology for the safety head, using step-wise prefix construction and a guard model for labeling, is sound for aligning the head with the inference process. The "rollback-and-reflect" mechanism, coupled with safety-guided multi-sampling, is a significant departure from traditional hard refusal strategies. Framing jailbreak attacks as distributional shifts where harmful continuations become more probable but safe ones are not entirely eliminated provides a strong theoretical underpinning for the multi-sampling approach. The rollback to a previous, potentially "cleaner" state, combined with a reflection prompt, effectively reshapes the sampling space, increasing the probability of finding a safe continuation. This soft intervention strategy is crucial for maintaining utility and helpfulness, avoiding the common pitfall of over-refusal. The probabilistic view of multi-sampling is clearly articulated, demonstrating how increasing sample size $K$ improves the chance of recovery.
The experimental evaluation is comprehensive and rigorous. The authors use two distinct model families (Qwen3-32B and DeepSeek-R1-Distill-Llama-70B) with appropriate draft models, demonstrating the framework's scalability and versatility. Evaluation metrics cover three critical dimensions: defense against seven advanced adversarial attacks (ASR), over-refusal rates (XSTest), and general capabilities/efficiency (GSM8K, MATH, GPQA-diamond, and inference speedup). This multi-faceted evaluation provides a holistic view of SafeSpec's performance. SafeSpec consistently achieves state-of-the-art defense performance, significantly reducing ASR (e.g., 15% on Qwen3-32B) while preserving substantial inference speedups on benign workloads (2.06x on Qwen3-32B, 1.76x on DeepSeek-70B). Crucially, it maintains low over-refusal rates and negligible accuracy degradation on general reasoning tasks, showcasing a superior safety-efficiency trade-off compared to strong baselines like SafeDecoding and SecDecoding. The ablation studies are well-designed, clearly demonstrating the necessity and synergistic effect of both the reflection prompt and multi-sampling. The comparison with a hard refusal strategy effectively highlights the benefits of SafeSpec's recovery mechanism. Hyperparameter analysis provides valuable insights into the trade-offs involved with sample size, safety threshold, and quality threshold. The detailed latency breakdown in the appendix is particularly insightful, transparently explaining the performance characteristics on benign vs. adversarial inputs and justifying the reduced throughput on jailbreak inputs as a feature of the defense. The comparison with a standalone guard model further validates SafeSpec's efficiency and user experience advantages.
The paper demonstrates good reproducibility. Code is made available on GitHub. The appendix provides detailed information on evaluation datasets, jailbreak prompt construction, quality scoring prompt, safety head configurations (architecture, parameter counts), and training setup (data sources, sampling, hyperparameters, data isolation). Layer choice ablation and per-benchmark sensitivity analysis for quality threshold are also included, providing further confidence in the design choices. The use of a fixed random seed is also mentioned.
1. **Reliance on Guard Model for Labeling**: The training data for the safety head is labeled using Qwen3Guard-Gen-8B. The performance and biases of this external guard model could implicitly limit the safety head's effectiveness and generalization, especially if the guard model itself is imperfect or susceptible to certain attacks. 2. **Heuristic Nature of Reflection Prompt**: While effective, the reflection prompt is a handcrafted heuristic. Its optimal design might be sensitive to the target model or specific attack types, and its generalizability across all future attacks is not guaranteed. 3. **Performance on Adversarial Inputs**: Although justified as a necessary cost for safety, the significant slowdown on jailbreak inputs (throughput below 1x) means that if an attacker can consistently trigger Safety Mode, they can effectively degrade the system's performance, even if they don't get a harmful response. This could be a denial-of-service vector. 4. **Adversarial Attacks on Safety Head**: As the safety head is a lightweight classifier, it might be susceptible to direct adversarial attacks designed to bypass it, rather than just the main LLM. The paper does not explore this. 5. **Fixed Rollback State**: The rollback mechanism reverts to the "previous state." For deeply embedded or multi-turn attacks, a single step rollback might not always be sufficient to reach a truly "clean" context, potentially requiring more sophisticated context recovery.
SafeSpec has a significant positive broader impact by enabling the deployment of safer and more efficient large language models. By addressing the fundamental incompatibility between speculative inference and existing safety defenses, it paves the way for LLMs to be used in more sensitive and performance-critical applications. This framework can help mitigate the risks of harmful content generation, thereby increasing public trust and responsible AI deployment. The approach of recovering safe continuations rather than hard refusal also improves user experience and model helpfulness. The paper does not highlight any specific negative societal consequences beyond the general risks associated with LLMs, which its work aims to mitigate. SafeSpec introduces a novel safety-aware speculative inference framework that elegantly integrates risk estimation and recovery directly into the LLM decoding process. This work makes a substantial technical contribution by demonstrating a superior safety-efficiency trade-off through a lightweight latent safety head and a dynamic rollback-and-reflect multi-sampling mechanism, offering a practical and impactful solution for deploying safer and faster LLMs in real-world applications.
Long-horizon robot manipulation policies trained with reward shaping can still exploit dense rewards through inefficient interaction, while rare efficient behaviors may be forgotten during training. We argue that temporal efficiency itself provides a powerful and underutilized source of self-supervision for reinforcement learning. We introduce Temporal Self-Imitation Learning (TSIL), a reinforcement learning framework that mines temporally efficient successful trajectories generated during learning and converts them into reusable supervision for future policy improvement. TSIL progressively refines learning using configuration-conditioned adaptive temporal targets derived from fast successful trajectories, while preserving and replaying efficient behaviors through efficiency-weighted self-imitation learning. Across 15 distinct long-horizon manipulation tasks, TSIL consistently improves learning efficiency, task-completion efficiency, revisitation of fast successful behaviors, and robustness to unstable training conditions. More broadly, our results suggest that the temporal structure of successful behavior itself provides a scalable self-supervisory signal for reinforcement learning beyond manually engineered reward shaping alone.
Primary: Duke University
All Institutions: Duke University
TSIL offers a powerful paradigm shift in how self-supervision can be generated for reinforcement learning, moving beyond manually engineered reward shaping. By demonstrating that the temporal structure of successful behavior itself provides a scalable self-supervisory signal, it opens new avenues for designing more efficient, robust, and autonomous learning systems for complex robotic tasks. This could significantly reduce the burden of reward engineering, accelerate policy learning, and lead to more dexterous and capable robots in real-world applications. The principles could also inspire similar approaches in other long-horizon decision-making problems beyond robotics. This paper introduces Temporal Self-Imitation Learning (TSIL), a novel reinforcement learning framework that leverages temporal efficiency as a self-supervisory signal to improve long-horizon robot manipulation. The method's innovative use of configuration-conditioned adaptive temporal targets and efficiency-weighted self-imitation learning, coupled with extensive empirical validation across 15 distinct tasks, demonstrates significant improvements in learning efficiency, task-completion efficiency, and robustness, positioning it as a highly impactful contribution to the field of robotics and reinforcement learning.
The proposed Temporal Self-Imitation Learning (TSIL) framework presents a well-conceived approach to address critical challenges in long-horizon robot manipulation: inefficient reward exploitation and the forgetting of rare, efficient behaviors. TSIL's core innovation lies in leveraging temporal efficiency itself as a self-supervisory signal. This is achieved through two main mechanisms: 1. **Configuration-conditioned adaptive temporal targets:** Instead of relying on static reward shaping, TSIL dynamically derives temporal targets from the fastest successful trajectories observed so far, conditioned on the current state (configuration). This makes the learning targets progressively more challenging and context-aware, pushing the policy towards increasingly efficient solutions. This adaptive mechanism is a significant improvement over fixed reward functions, which can often be exploited or become suboptimal as the policy improves. 2. **Efficiency-weighted self-imitation learning:** TSIL explicitly preserves and replays these fast, successful behaviors. By weighting the imitation loss based on the temporal efficiency of past trajectories, it prioritizes learning from the most optimal experiences. This directly combats the problem of catastrophic forgetting of rare but highly effective actions, ensuring that the policy continuously refines its understanding of efficient pathways. The methodology is coherent, directly targets known limitations of existing RL approaches in complex robotic tasks, and offers a scalable way to generate self-supervision.
The experimental evaluation is exceptionally strong, claiming consistent improvements across "15 distinct long-horizon manipulation tasks." This breadth of evaluation is crucial for demonstrating the generalizability and robustness of the TSIL framework beyond specific, hand-picked scenarios. The metrics of interest—learning efficiency, task-completion efficiency, revisitation of fast successful behaviors, and robustness to unstable training conditions—are all highly relevant and impactful for practical robot learning. The abstract's claim of "consistently improves" suggests statistically significant and repeatable gains, which is a high bar for empirical success in this domain. If these claims hold, the empirical evidence strongly supports the method's effectiveness and practical utility, making it a significant contribution to the field.
The mention of a project URL (`https://generalroboticslab.com/TSIL`) is a strong positive indicator for reproducibility. Project pages often include code implementations, detailed experimental setups, datasets, and potentially pre-trained models or videos, which are essential for researchers to verify and build upon the work. The structured nature of the paper (Method, Experiments sections) also implies a detailed description of the algorithm and experimental protocols.
While the paper presents a very strong case, potential limitations might include: 1. **Initial Success Requirement:** TSIL relies on mining "fast successful trajectories." If initial task success is extremely rare or non-existent, the method might struggle to bootstrap. 2. **Computational Overhead:** Mining, storing, and adaptively managing a growing set of efficient trajectories, especially in high-dimensional state spaces, could introduce computational overhead. 3. **Definition of "Configuration-conditioned":** The complexity of defining and implementing "configuration-conditioned" targets might vary significantly with the task and state representation, potentially requiring careful engineering. 4. **Generalizability beyond temporal efficiency:** While temporal efficiency is critical, some tasks might have other primary optimization criteria (e.g., energy consumption, safety, precision) that TSIL, in its current form, might not directly optimize.
TSIL offers a powerful paradigm shift in how self-supervision can be generated for reinforcement learning, moving beyond manually engineered reward shaping. By demonstrating that the temporal structure of successful behavior itself provides a scalable self-supervisory signal, it opens new avenues for designing more efficient, robust, and autonomous learning systems for complex robotic tasks. This could significantly reduce the burden of reward engineering, accelerate policy learning, and lead to more dexterous and capable robots in real-world applications. The principles could also inspire similar approaches in other long-horizon decision-making problems beyond robotics. This paper introduces Temporal Self-Imitation Learning (TSIL), a novel reinforcement learning framework that leverages temporal efficiency as a self-supervisory signal to improve long-horizon robot manipulation. The method's innovative use of configuration-conditioned adaptive temporal targets and efficiency-weighted self-imitation learning, coupled with extensive empirical validation across 15 distinct tasks, demonstrates significant improvements in learning efficiency, task-completion efficiency, and robustness, positioning it as a highly impactful contribution to the field of robotics and reinforcement learning.
Graph convolutional networks (GCNs) have demonstrated significant success in capturing complex user-item relationships for collaborative filtering (CF). However, due to their reliance on extensive model training, training-free graph filtering (GF)-based CF methods have emerged as a promising alternative, offering computational efficiency by smoothing graph signals via matrix operations. In particular, polynomial GF-based approaches demonstrate improved accuracy through their ability to design more expressive and flexible filtering functions. Despite these advantages, existing GF methods suffer from a critical memory bottleneck: they necessitate storing the full item similarity graph, incurring prohibitive memory costs for large-scale datasets, which limits their practical applicability. To tackle this challenge, we propose Mem-GF (Memory-efficient GF), a new GF-based CF method that departs from conventional designs by principally leveraging the structure of Krylov subspaces as a core mechanism for approximating polynomial graph filters without explicitly storing the item similarity graph. We theoretically analyze the minimum Krylov subspace size that guarantees lossless approximation. Through extensive experiments, we demonstrate that Mem-GF achieves up to 5.74$\times$ lower memory usage and 4.38$\times$ speedup in runtime, while consistently exceeding the recommendation accuracy of state-of-the-art GF and GCN-based methods. Mem-GF robustly scales to datasets with tens of millions of interactions, establishing itself as a practically viable and theoretically grounded solution for efficient CF.
Primary: Yonsei University
All Institutions: Yonsei University
Mem-GF has a significant broader impact on the field of recommender systems and potentially other areas of graph machine learning. By effectively addressing the memory bottleneck, it makes high-accuracy, polynomial graph filtering techniques practically viable for large-scale collaborative filtering, a critical requirement for modern online platforms. This enables faster preprocessing, low-latency real-time inference, and superior recommendation quality on standard hardware, democratizing access to advanced graph-based CF. The principled use of Krylov subspaces as a core filtering mechanism, rather than merely a computational shortcut, could inspire similar memory-efficient approaches in other graph signal processing or graph machine learning contexts where large, implicitly defined matrices are a challenge. The strong theoretical grounding further enhances the trustworthiness and potential for generalization of this methodology. Mem-GF proposes a novel, memory-efficient, and training-free graph filtering method for collaborative filtering that leverages Krylov subspaces to approximate high-order polynomial graph filters without explicitly storing the full item similarity graph. This paper makes a substantial technical contribution by elegantly solving a critical memory bottleneck in graph filtering-based collaborative filtering, enabling scalable and high-accuracy recommendations on large datasets. The method's theoretical grounding, combined with comprehensive experimental validation demonstrating significant memory savings, speedups, and state-of-the-art accuracy, establishes Mem-GF as a practically viable and theoretically sound solution that will likely influence future research and deployment of graph-based recommender systems.
The paper effectively identifies a critical memory bottleneck in existing Graph Filtering (GF)-based Collaborative Filtering (CF) methods, which stem from the necessity to explicitly store a full item similarity graph $P$ of size $|I| \times |I|$. The proposed Mem-GF method offers an elegant and principled solution by leveraging Krylov subspaces. Instead of forming and storing $P$, Mem-GF approximates polynomial graph filters $f(P)r_u$ by projecting $P$ onto a user-specific Krylov subspace $K_K(P, r_u)$, generated by the user's interaction vector $r_u$. This projection is efficiently computed using the Lanczos algorithm, which yields an orthonormal basis $Q_u$ and a much smaller tridiagonal matrix $T_u$. The filtering operation is then performed in this reduced space as $\|r_u\|_2 Q_u f(T_u) e_1$. A key methodological strength is that the matrix-vector product $Pq_{u,j}$ required by Lanczos is computed as $R^T(Rq_{u,j})$, completely bypassing the explicit construction of $P$. The theoretical analysis provides a clear and strong guarantee: for a polynomial filter of degree $N$, setting the Krylov subspace size $K > N$ ensures lossless approximation under exact arithmetic. This theoretical foundation is crucial for understanding and applying the method. Furthermore, the ability to operate within a low-dimensional subspace grants Mem-GF the flexibility to design and utilize high-order polynomial filters (e.g., approximating a Gaussian filter), which are typically infeasible for conventional methods due to memory constraints, thereby enhancing filter expressiveness and accuracy. The "training-free" nature aligns with the paper's goal of computational efficiency.
The experimental evaluation is exceptionally comprehensive and provides strong empirical evidence for all claims. Experiments are conducted on three widely used CF benchmark datasets: Yelp, Amazon-book, and the large-scale MovieLens-20M, covering diverse scales and characteristics. A broad range of 21 baselines is included, encompassing various CF categories (MF, Autoencoder, GCN, Generative, LinkProp), with a particular focus on other GF-based methods. Key metrics such as memory usage (VRAM, RAM), runtime (preprocessing and inference), and recommendation accuracy (Recall@K, NDCG@K) are rigorously evaluated. The results are highly impactful: Mem-GF achieves up to 5.74x lower memory usage and 4.38x speedup during preprocessing, and a remarkable 26.2x speedup during inference. Crucially, these significant efficiency gains are accompanied by state-of-the-art recommendation accuracy, consistently outperforming both GF and GCN-based methods across most datasets and metrics. The scalability analysis on synthetic datasets further validates the method's linear complexity with respect to the number of users, items, and interactions, confirming its practical applicability for real-world, large-scale deployments. The empirical validation of the theoretical condition ($K > N$), along with analyses of different polynomial filters and hyperparameter sensitivity, adds to the robustness and thoroughness of the evaluation.
The paper demonstrates a strong commitment to reproducibility. A GitHub link to the source code (`https://github.com/jindeok/Mem-GF`) is provided, which is a critical component for enabling replication. Detailed hyperparameters for Mem-GF are explicitly stated for each dataset. Furthermore, the paper outlines the data splitting, evaluation protocols, hardware specifications (CPU, GPU, RAM), and software environment (PyTorch), along with the method for generating synthetic datasets. These comprehensive details provide sufficient information for researchers to reproduce the reported results.
While Mem-GF's "training-free" nature offers efficiency, it inherently implies less flexibility compared to learnable GCNs that can adapt their filters through end-to-end optimization. The polynomial coefficients are found by approximating a target frequency response, which is still a predefined approach rather than a fully learned one. The theoretical guarantee of lossless approximation holds under *exact arithmetic* and when the polynomial degree $N$ is less than the Krylov subspace size $K$. While the paper mentions that finite-precision arithmetic or $N \ge K$ might lead to instability, a deeper exploration of these practical implications beyond empirical observation would be beneficial. The method still requires tuning of hyperparameters such as $s$ (Hadamard power) and $\delta$ (damping factor for the Gaussian filter). Although Mem-GF enables user-specific filtering in the Krylov subspace, the underlying polynomial filter itself is still globally defined, rather than being truly personalized to each user's unique spectral characteristics.
Mem-GF has a significant broader impact on the field of recommender systems and potentially other areas of graph machine learning. By effectively addressing the memory bottleneck, it makes high-accuracy, polynomial graph filtering techniques practically viable for large-scale collaborative filtering, a critical requirement for modern online platforms. This enables faster preprocessing, low-latency real-time inference, and superior recommendation quality on standard hardware, democratizing access to advanced graph-based CF. The principled use of Krylov subspaces as a core filtering mechanism, rather than merely a computational shortcut, could inspire similar memory-efficient approaches in other graph signal processing or graph machine learning contexts where large, implicitly defined matrices are a challenge. The strong theoretical grounding further enhances the trustworthiness and potential for generalization of this methodology. Mem-GF proposes a novel, memory-efficient, and training-free graph filtering method for collaborative filtering that leverages Krylov subspaces to approximate high-order polynomial graph filters without explicitly storing the full item similarity graph. This paper makes a substantial technical contribution by elegantly solving a critical memory bottleneck in graph filtering-based collaborative filtering, enabling scalable and high-accuracy recommendations on large datasets. The method's theoretical grounding, combined with comprehensive experimental validation demonstrating significant memory savings, speedups, and state-of-the-art accuracy, establishes Mem-GF as a practically viable and theoretically sound solution that will likely influence future research and deployment of graph-based recommender systems.
We study first-order methods for solving monotone variational inequalities arising in min-max optimization. Classical approaches such as the extragradient method rely on two gradient queries per iteration, which limits their analysis and applicability in the online and stochastic settings. We propose a family of Generalized Optimistic Methods with Anchoring (GOMA), which combine two-time-scale optimistic updates with an anchoring term inspired by Halpern iteration. In the deterministic setting, GOMA achieves the optimal accelerated last-iterate rate $O(1/k^2)$ on the squared gradient norm for monotone Lipschitz operators. In the stochastic setting with unbounded variance, a simplified single-call variant of GOMA achieves a last-iterate convergence rate of $O(1/\sqrt{k})$ on the squared gradient norm. To the best of our knowledge, this is the first such guarantee for stochastic monotone Lipschitz variational inequalities in the unconstrained setting without variance reduction or growing batches.
Primary: Université de Montréal
All Institutions: Université de Montréal, Mila - Quebec AI Institute, Mohammed Bin Zayed University of Artificial Intelligence, CIFAR AI Chair
This paper contributes to the fundamental understanding and development of optimization algorithms for variational inequalities and min-max optimization, which are crucial in various machine learning applications like adversarial training, GANs, and multi-agent reinforcement learning. By providing a method that offers last-iterate convergence in challenging stochastic settings (single-call, no variance reduction, no growing batches, unbounded variance), GOMA could enable more efficient and stable training of models in online or resource-constrained environments. The explicit acknowledgment of AI assistant use in proof development is also a noteworthy aspect regarding research methodology. The impact statement correctly identifies the potential for more efficient use of computing resources but also cautions about the Jevons paradox. This paper introduces Generalized Optimistic Method with Anchoring (GOMA), a novel first-order method for monotone variational inequalities that achieves optimal $O(1/k^2)$ last-iterate convergence in the deterministic setting and, critically, provides the first last-iterate $O(1/N)$ convergence guarantee on the expected squared operator norm for stochastic monotone Lipschitz VIs in the unconstrained setting without variance reduction, growing batches, or bounded variance assumptions. The work makes a significant theoretical advancement by demonstrating that strong last-iterate guarantees are compatible with single-sample online models under highly challenging noise conditions, supported by empirical evidence on synthetic problems.
The paper proposes Generalized Optimistic Methods with Anchoring (GOMA) for solving monotone variational inequalities (VIs) in min-max optimization. GOMA combines three key ideas: two-time-scale optimistic updates (from generalized optimistic methods), and an anchoring term (inspired by Halpern iteration). The method is presented in a general form (Eq. 7) with separate step sizes for exploration and update, and an anchoring coefficient. In the deterministic setting, GOMA is analyzed under two parameter setups (larger update step or larger exploration step), both achieving the optimal accelerated last-iterate rate of $O(1/k^2)$ on the squared gradient norm for monotone Lipschitz operators. The proof relies on a potential-based analysis, which is a standard and robust technique. A notable aspect is the claim of a "pseudo fixed-step size scheme" that simplifies hyperparameter tuning compared to some prior methods. For the stochastic setting, the paper introduces a simplified single-call variant of GOMA (Eq. 16) by setting the optimistic update coefficient to zero, effectively replacing extrapolation with anchoring to the initial point. This variant is analyzed under state-dependent noise (Assumption 1) where the variance can grow with the squared norm of the operator, a challenging setting. The proof strategy involves comparing noisy iterates to a deterministic reference trajectory and bounding the mean-square deviation. Theorem 3.1 establishes a last-iterate convergence rate of $O(1/N)$ on the expected squared operator norm $E\|G(x_N)\|^2$. This is a significant theoretical contribution, as the paper claims it's the first such guarantee for stochastic monotone Lipschitz VIs in the unconstrained setting without variance reduction or growing batches, and under unbounded variance. A critical issue is the inconsistency in reporting the stochastic convergence rate. The abstract states $O(1/\sqrt{k})$ on the squared gradient norm, which implies $O(1/k^{1/4})$ on the gradient norm. Theorem 3.1, however, states $E\|G(x_N)\|^2 = O(1/N)$, which implies $O(1/\sqrt{N})$ on the gradient norm. The comparison table (Table 1) and parts of the discussion further add to this confusion, sometimes stating $O(1/k^{1/4})$ on $E\|G(x_k)\|$ and sometimes $O(1/k)$ on $E\|G(x_k)\|^2$ (which are inconsistent with each other). Assuming Theorem 3.1 is the most accurate statement of the result, the rate is $O(1/N)$ on $E\|G(x_N)\|^2$, which is a strong result given the challenging assumptions.
The experimental evaluation is conducted on toy problems, which is common for theoretical optimization papers. 1. **Negative-Comonotone Quadratic Saddle Point (Deterministic)**: This experiment uses a problem instance outside the theoretical scope (negative comonotonicity vs. monotonicity), but it's a standard benchmark for comparing VI algorithms. GOMA and FEG show accelerated convergence, while others diverge. GOMA empirically achieves a better constant factor than FEG. 2. **Stochastic Bilinear Game (Bounded Variance)**: On a low-dimensional bilinear game with additive Gaussian noise ($=1$), GOMA significantly outperforms baselines (DSEG, FEG, E-Halpern, RAIN++, Nesterov), achieving the fastest convergence and a residual an order of magnitude smaller. This supports the claim of robustness without variance reduction. 3. **Finite-Sum Saddle-Point Problem (State-Dependent Variance)**: On a higher-dimensional finite-sum problem with multiplicative noise ($>1$), GOMA and RAIN++ show convergence, while DSEG stagnates. This experiment directly validates GOMA's ability to handle state-dependent, unbounded variance, a key theoretical claim. Overall, the experiments, despite being on synthetic problems, effectively demonstrate the empirical advantages of GOMA, particularly in stochastic settings with challenging noise characteristics, aligning well with the theoretical claims.
The paper provides algorithmic details, step size choices, and parameter schedules for GOMA. For baselines, it refers to existing implementations or settings from prior work. However, specific hyperparameters for all methods are deferred to the appendix, and no code repository is provided. While the theoretical derivations are detailed, the lack of a public code release or highly detailed hyperparameter tuning instructions (beyond the appendix reference) might hinder direct reproducibility for practitioners.
1. **Stochastic Rate Inconsistency**: As noted, there is a significant discrepancy in the reported stochastic convergence rates across the abstract, main text, theorem statement, and comparison table. This undermines the clarity and rigor of the paper's central stochastic contribution. Assuming the theorem ($O(1/N)$ on $E\|G(x_N)\|^2$) is correct, the other statements are misleading. 2. **Slower Optimal Rate**: The paper acknowledges that GOMA's stochastic rate ($O(1/N)$ on $E\|G(x_N)\|^2$) does not match the optimal $O(1/N)$ rate (on $E\|G(x_N)\|^2$) achieved by methods using variance reduction or growing batches. Closing this gap without such mechanisms remains an open question. 3. **Toy Experiments**: The empirical validation is limited to synthetic and relatively low-dimensional problems. Scaling GOMA to large-scale deep learning applications (e.g., adversarial training) and demonstrating its practical benefits there would strengthen the work. 4. **Unconstrained Setting**: The analysis is restricted to unconstrained VIs. Extending it to constrained settings, where the convergence measure often shifts to the gap function, is an open direction. 5. **Monotonicity Assumption**: The theoretical guarantees rely on the monotonicity of the operator, which is a strong assumption not always met in practical deep learning min-max problems.
This paper contributes to the fundamental understanding and development of optimization algorithms for variational inequalities and min-max optimization, which are crucial in various machine learning applications like adversarial training, GANs, and multi-agent reinforcement learning. By providing a method that offers last-iterate convergence in challenging stochastic settings (single-call, no variance reduction, no growing batches, unbounded variance), GOMA could enable more efficient and stable training of models in online or resource-constrained environments. The explicit acknowledgment of AI assistant use in proof development is also a noteworthy aspect regarding research methodology. The impact statement correctly identifies the potential for more efficient use of computing resources but also cautions about the Jevons paradox. This paper introduces Generalized Optimistic Method with Anchoring (GOMA), a novel first-order method for monotone variational inequalities that achieves optimal $O(1/k^2)$ last-iterate convergence in the deterministic setting and, critically, provides the first last-iterate $O(1/N)$ convergence guarantee on the expected squared operator norm for stochastic monotone Lipschitz VIs in the unconstrained setting without variance reduction, growing batches, or bounded variance assumptions. The work makes a significant theoretical advancement by demonstrating that strong last-iterate guarantees are compatible with single-sample online models under highly challenging noise conditions, supported by empirical evidence on synthetic problems.
Recently, Muon has gained substantial attention as an appealing alternative to Adam-like optimizers, with many works highlighting its advantages through spectral normalization and improved conditioning. Yet this positive theoretical narrative contrasts with its empirical performance in large language model (LLM) training, where Muon's gains over Adam/AdamW are often mixed, schedule-sensitive, and not uniformly superior. To address this gap, we develop a trajectory-level theory characterizing both the strengths and limitations of Muon. We introduce a mixed-spiked matrix sensing model whose sensing operator decomposes into signal, spike, and bulk components, capturing a mixture of anisotropic structure and long-tail information reminiscent of LLM training. On top of it, we adopted a river-valley perspective in which we view the landscape as composed of a river direction flowing to the desired solution and hill directions encoding nuisance or task-irrelevant information. In the momentum-free setting, we show that Muon moves faster along the information-bearing river direction during early optimization, but can converge much more slowly near the river bottom than gradient descent. We then extend the river-valley perspective to general nonconvex objectives with momentum by studying points on the spectral river. There, while Muon converges faster early on, its orthogonalized update removes residual scale information, making it prone to overshooting and oscillation near the target solution. Together, these results suggest that our characterizations extend beyond spiked matrix sensing and motivate switching to GD-like refinement optimizers in the final phase, rather than relying only on a fixed learning-rate schedule for Muon. We also provide preliminary evidence supporting this two-stage approach in language model training experiments.
Primary: University of Pennsylvania
All Institutions: University of Pennsylvania, City University of Hong Kong, Shanghai University of Finance and Economics
This paper provides a rigorous theoretical framework explaining the strengths and limitations of the Muon optimizer, proposing a two-stage optimization strategy supported by preliminary LLM experiments. The work introduces a novel mixed-spiked matrix sensing model and leverages a river-valley perspective to characterize Muon's fast early exploration and late-stage convergence difficulties, offering valuable insights for the design of more effective training schedules for large language models.
The paper develops a sophisticated theoretical framework to analyze the Muon optimizer, particularly its behavior in anisotropic landscapes reminiscent of LLM training. The core methodology involves introducing a novel "mixed-spiked matrix sensing (MS) model" where the sensing operator decomposes into signal, spike, and bulk components. This model is well-motivated by empirical observations of covariance spectra in deep learning. The authors then adopt a "river-valley perspective," a geometric view that decomposes the optimization landscape into a "river" direction (aligned with meaningful progress) and "hill" directions (nuisance information). This perspective is applied to both a simplified, momentum-free Muon and extended to generalized nonconvex objectives with momentum. The analysis uses invariant manifolds to reduce matrix-valued dynamics to low-dimensional scalar systems, enabling tractable analysis of continuous and discrete-time dynamics for both vanilla GD and simplified Muon. Key theoretical results (Theorems 1, 2, 3) rigorously characterize Muon's early-stage fast exploration and late-stage convergence difficulties (overshooting, oscillation) compared to GD. The extension to generalized settings using a "spectral river" further strengthens the broader applicability of their insights. The mathematical derivations are thorough and provide a deep understanding of the underlying mechanisms.
The experimental evaluation, while described as "preliminary," provides valuable empirical evidence supporting the theoretical claims. The authors train a 250M-parameter LLaMA-style decoder-only Transformer from scratch on OpenWebText2, a relevant and challenging setting for LLM research. They compare Muon-only baselines with various learning rate schedules against a proposed two-stage hybrid approach (Muon followed by AdamW). The results demonstrate that constant-LR Muon indeed exhibits the fastest initial loss decrease, consistent with its early-stage exploratory power. Crucially, the "Muon -> AdamW" hybrid strategy leads to more stable loss trajectories and achieves lower final validation loss compared to Muon-only baselines, even with tuned schedules. This directly supports the theoretical recommendation of using Muon for early exploration and switching to a GD-like optimizer for late-stage refinement. The inclusion of experiments with different switching times and post-switch AdamW LR schedules further strengthens the robustness of their findings. While the scale of the model (250M) is not "large" by today's cutting-edge LLM standards, it is sufficiently large to demonstrate the practical relevance of the theoretical insights.
The paper provides a project website (https://muon-river-valley.github.io/) which typically includes code and experimental details, enhancing reproducibility. The experimental setup details are reasonably well-described, including model architecture (LLaMA-style decoder-only Transformer), parameter count (250M), tokenizer (GPT-2), dataset (OpenWebText2), and training iterations (4k). Learning rate schedules (cosine, linear, cos_inf) and switching points are also mentioned. While not all hyperparameter details are in the main text, the appendix and project website are expected to fill these gaps. The theoretical derivations are detailed in the appendix, allowing for verification.
The primary theoretical analysis relies on a simplified, momentum-free Muon and a specific mixed-spiked MS model, although the paper attempts to generalize these insights to more complex settings. The empirical evidence, while supportive, is explicitly stated as "preliminary" and conducted on a 250M-parameter model, which is modest compared to state-of-the-art LLMs. Further large-scale experiments on diverse architectures and tasks would strengthen the practical implications. The paper also acknowledges that the river-valley decomposition is only one lens and suggests integrating it with other phenomena like edge-of-stability behavior as future work, indicating a limitation in the current scope of analysis.
This paper significantly advances the theoretical understanding of spectral optimizers like Muon, which have gained attention but lacked a comprehensive explanation for their mixed empirical performance. The "river-valley perspective" and the mixed-spiked MS model provide valuable tools for analyzing optimization landscapes in deep learning, particularly in the context of anisotropic gradients observed in LLMs. The practical implication of a two-stage optimization strategy (Muon for exploration, GD-like for refinement) could lead to more efficient and stable training schedules for large models, reducing the need for extensive learning rate tuning. This work has the potential to influence the design and application of future optimizers and contribute to a more principled approach to deep learning training. This paper provides a rigorous theoretical framework explaining the strengths and limitations of the Muon optimizer, proposing a two-stage optimization strategy supported by preliminary LLM experiments. The work introduces a novel mixed-spiked matrix sensing model and leverages a river-valley perspective to characterize Muon's fast early exploration and late-stage convergence difficulties, offering valuable insights for the design of more effective training schedules for large language models.
Model-free reinforcement learning algorithms such as Proximal Policy Optimization (PPO) treat the environment as a black box, estimating policy gradients from sampled rewards; this process demands millions of interactions and relies on high-variance advantage estimates. When environment dynamics are differentiable, the return is an end-to-end differentiable function of the policy parameters, enabling exact gradient computation via backpropagation through simulation. We term this approach Analytic Policy Gradients (APG) and evaluate it against PPO on four continuous control tasks of increasing dynamical complexity: a one-dimensional point-mass target-reaching task, a 2D point-mass navigation task with obstacle avoidance, a 2D rigid-body T-block pushing task, and a 7-DOF Franka FR3 end-effector reaching task. Both algorithms share identical model architectures, observation normalization, and optimizer settings. To decouple sample efficiency from compute efficiency, we design a multi-axis evaluation protocol that records performance against environment steps and gradient steps. We report a segmented backpropagation scheme with MC and critic-based bootstrap modes that mitigates gradient degradation on long-horizon tasks, and present ablations over segment length and bootstrap strategy.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong
This paper has a significant positive broader impact, primarily within the differentiable reinforcement learning and robotics communities. 1. **Advancing Differentiable RL**: It provides compelling empirical evidence for the sample efficiency benefits of leveraging differentiable environment dynamics, addressing a critical bottleneck in real-world RL applications. 2. **Practical Tools and Enablement**: The Warp--PyTorch gradient bridge is a crucial engineering contribution that makes complex, GPU-accelerated physics engines (like NVIDIA Newton/Warp) more accessible for differentiable RL research within the widely used PyTorch framework. This can accelerate progress in areas such as robot manipulation and locomotion. 3. **Improved Evaluation Standards**: The unified benchmarking harness and multi-axis evaluation protocol set a higher standard for comparing model-free and differentiable RL algorithms, promoting more rigorous and fair assessments across the field. 4. **Guidance for Practitioners**: The detailed ablation studies on bootstrap strategies and segment lengths offer practical, actionable advice for researchers and engineers designing differentiable RL systems, helping them make more informed choices for robust training. 5. **Open Science Contribution**: The release of the full codebase and environment suite fosters open research, enabling others to reproduce, verify, and build upon this work, accelerating collective progress in the field. This paper rigorously validates the benefits of Analytic Policy Gradients in differentiable continuous control. It provides a robust benchmarking framework, introduces a practical gradient bridge for complex physics engines, and offers valuable insights into segmented backpropagation strategies, significantly advancing the practical applicability and understanding of differentiable reinforcement learning.
The paper presents Analytic Policy Gradients (APG) as a method for continuous control, leveraging differentiable environment dynamics to compute exact policy gradients via backpropagation through simulation. While the core concept of APG is not new, the paper's strength lies in its meticulous methodological contributions and rigorous implementation. A unified benchmarking harness is developed, allowing for a highly controlled comparison between APG and PPO by ensuring identical actor-critic architectures, observation normalization, and optimizer settings. This standardization is crucial for drawing fair conclusions about the gradient source's impact. The paper adopts a segmented backpropagation scheme to address vanishing/exploding gradients in long-horizon tasks. A key methodological contribution is the detailed exploration and comparison of two bootstrap modes for these segments: Monte Carlo (MC) bootstrap and critic-based bootstrap. The MC bootstrap, which pre-computes future returns from detached rewards, is shown to be a more robust option for shorter segment lengths, providing valuable practical guidance. A significant engineering contribution is the custom `torch.autograd.Function` that bridges NVIDIA Warp/Newton's tape-based autodiff with PyTorch's autograd. This "gradient bridge" enables APG to be applied to complex, GPU-accelerated physics engines that do not natively expose PyTorch-compatible derivatives, thereby expanding the practical applicability of differentiable RL to more realistic and complex robotic tasks like the 7-DOF Franka arm. The use of the reparameterization trick for action sampling ensures proper gradient flow through stochastic policies. Overall, the methodology is sound, well-explained, and effectively tackles practical challenges in implementing differentiable RL.
The experimental evaluation is exceptionally thorough and well-designed. The authors evaluate APG against PPO on four continuous control tasks of increasing dynamical complexity: a 1D point-mass, a 2D point-mass navigation with obstacles, a 2D rigid-body T-block pushing task, and a 7-DOF Franka FR3 end-effector reaching task. This diverse suite effectively demonstrates APG's performance across various scenarios. A key strength of the evaluation is the multi-axis logging protocol, which records performance against both environment steps (measuring sample efficiency) and gradient steps (measuring compute efficiency). This approach is critical for a fair comparison, as APG and PPO consume these resources at different rates. Results are reported as mean ± standard deviation over multiple random seeds, enhancing statistical reliability. Performance thresholds and success rates are clearly defined, providing a comprehensive view of agent capabilities beyond just episodic return. The results consistently show that APG achieves higher final episodic returns and often higher success rates than PPO, particularly on simpler tasks. More importantly, APG demonstrates substantial sample efficiency gains, requiring significantly fewer gradient steps (up to 15.9x fewer on FrankaReach) and environment steps to reach comparable performance thresholds. This strongly validates the benefit of lower-variance analytic gradients. The ablation studies on PointMassNavigate are particularly insightful. They clearly demonstrate that MC bootstrap is robust across varying segment lengths, degrading gracefully even at very short horizons. In contrast, critic bootstrap is highly sensitive to segment length, collapsing entirely at short lengths due to unstable value targets and only becoming competitive at longer segments. This finding provides crucial practical guidance for practitioners. The successful application of the Warp-PyTorch gradient bridge on the FrankaReach task further validates its feasibility and impact.
Reproducibility is a standout feature of this paper. The authors have made their entire implementation, including environment definitions, training scripts, and plotting utilities, open-source on GitHub. They provide detailed instructions, a `requirements.txt` for dependencies, and explicit commands to reproduce each figure and table presented in the paper. The unified benchmarking harness itself contributes significantly to reproducibility by standardizing the comparison between algorithms. This commitment to open science is exemplary and greatly enhances the credibility and utility of the research.
The paper transparently discusses several important limitations inherent to the Analytic Policy Gradients approach: 1. **Environment Differentiability**: APG fundamentally requires the environment dynamics and reward function to be differentiable. This restricts its application to specific simulators and excludes real-world training or environments with non-differentiable elements (e.g., discrete contact events, complex procedural generation). 2. **Gradient Chain Length Issues**: Despite the use of segmented backpropagation, long episodes can still lead to vanishing or exploding gradients. The effectiveness of APG remains sensitive to the choice of segment length and bootstrap strategy, as demonstrated by the ablation studies. 3. **Compute Overhead**: Maintaining the full computation graph during environment rollouts incurs higher memory and computational overhead compared to model-free methods like PPO, which use detached rollouts. This can be a practical concern for very complex environments or extremely long horizons. 4. **Model Bias (for future work)**: While the current work uses ground-truth differentiable dynamics, the authors acknowledge that extending APG to learned differentiable world models would introduce model bias, which could potentially counteract the variance reduction benefits.
This paper has a significant positive broader impact, primarily within the differentiable reinforcement learning and robotics communities. 1. **Advancing Differentiable RL**: It provides compelling empirical evidence for the sample efficiency benefits of leveraging differentiable environment dynamics, addressing a critical bottleneck in real-world RL applications. 2. **Practical Tools and Enablement**: The Warp--PyTorch gradient bridge is a crucial engineering contribution that makes complex, GPU-accelerated physics engines (like NVIDIA Newton/Warp) more accessible for differentiable RL research within the widely used PyTorch framework. This can accelerate progress in areas such as robot manipulation and locomotion. 3. **Improved Evaluation Standards**: The unified benchmarking harness and multi-axis evaluation protocol set a higher standard for comparing model-free and differentiable RL algorithms, promoting more rigorous and fair assessments across the field. 4. **Guidance for Practitioners**: The detailed ablation studies on bootstrap strategies and segment lengths offer practical, actionable advice for researchers and engineers designing differentiable RL systems, helping them make more informed choices for robust training. 5. **Open Science Contribution**: The release of the full codebase and environment suite fosters open research, enabling others to reproduce, verify, and build upon this work, accelerating collective progress in the field. This paper rigorously validates the benefits of Analytic Policy Gradients in differentiable continuous control. It provides a robust benchmarking framework, introduces a practical gradient bridge for complex physics engines, and offers valuable insights into segmented backpropagation strategies, significantly advancing the practical applicability and understanding of differentiable reinforcement learning.
Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in research. Using a formal psychometric framework, we show that these profiles are largely a measurement artifact. Administering a battery of personality and risk-preference instruments spanning self-reports and behavioral tasks to 56 instruction-tuned LLMs alongside large human reference samples, we report four findings. First, differences between models are driven not by the traits an instrument targets but by a directional response bias, a tendency to respond toward one end of the scale, or one labeled option, regardless of item content; a variance decomposition attributes 81-90% of between-model variation to this bias, against 9-16% in humans. Second, the bias declines with model capability but is not eliminated by it. Third, because bias rather than trait drives responding, an instrument's apparent reliability is almost entirely predicted by its response orthogonality, a term we coin for the proportion of items for which trait and bias point in opposite directions. Fourth, the profile a model appears to have shifts with the items used and can be manufactured through item selection. These results demonstrate that the apparent psychological profiles of LLMs are artifacts of the instrument used to measure them, not properties of the models themselves. As instruments borrowed from human psychology are rarely fully orthogonal and may inherently lack validity for LLMs, we call for dedicated assessments centered on response orthogonality.
Primary: Max Planck Institute for Human Development
All Institutions: Max Planck Institute for Human Development, University of Konstanz, Barcelona Supercomputing Center, University of Basel
This paper provides a critical psychometric analysis demonstrating that apparent psychological profiles in LLMs are largely artifacts of response bias rather than genuine traits, fundamentally challenging current evaluation practices and calling for more rigorous, orthogonal measurement frameworks in AI research.
The paper employs a rigorous psychometric framework to deconstruct the measurement of LLM "personalities." By modeling responses as a function of latent trait and response bias, and utilizing the concept of "response orthogonality" (the proportion of reverse-keyed items), the authors provide a mathematically sound method to separate genuine trait variance from systematic response artifacts. This approach is theoretically robust and directly addresses a fundamental confound in current LLM evaluation methodologies.
The experimental design is comprehensive, testing 56 instruction-tuned LLMs against a battery of 29 instruments (personality and risk preference) and comparing them against large human reference samples. The results are striking and consistent: LLMs show positive forward-reverse correlations (indicating bias dominance) whereas humans show negative correlations (indicating trait dominance). The variance decomposition showing 81-90% of LLM variation is bias-driven is a powerful empirical finding. The robustness checks across prompting conditions, model sizes, and elicitation methods strengthen the validity of the conclusions.
The paper provides detailed methodological descriptions, including the specific instruments used, the prompting strategies, and the mathematical derivations for the measurement model. The code repository is explicitly linked, ensuring high reproducibility. The use of standard psychometric instruments and clear definitions of variables enhances the clarity of the experimental setup.
The study focuses primarily on post-trained models and treats each model as a single respondent, which may not capture within-model variability or the nuances of different prompting strategies (e.g., persona adoption). The proprietary model sample size is small (N=10), limiting statistical power for those specific comparisons. Additionally, the study is limited to personality and risk domains; while the authors argue for broader applicability, empirical validation in other domains (e.g., moral reasoning, cognitive biases) is left for future work.
This paper has significant implications for the field of AI safety, alignment, and the use of LLMs as proxies for human participants in research. It challenges the validity of current LLM profiling practices and calls for a re-evaluation of how we measure and interpret LLM behaviors. The concept of response orthogonality offers a new standard for designing valid evaluation instruments for AI systems. This paper provides a critical psychometric analysis demonstrating that apparent psychological profiles in LLMs are largely artifacts of response bias rather than genuine traits, fundamentally challenging current evaluation practices and calling for more rigorous, orthogonal measurement frameworks in AI research.
Speculative inference accelerates large language model (LLM) decoding but provides no inherent safety guarantees. Existing safety defenses are largely incompatible with speculative inference: they either introduce additional computation or disrupt the draft-verify mechanism, negating acceleration benefits. This reveals a fundamental incompatibility between current safety methods and speculative decoding. We propose SafeSpec, a safety-aware speculative inference framework that integrates risk estimation directly into the verification process. SafeSpec attaches a lightweight latent safety head to the target model to jointly evaluate semantic validity and safety in a single forward pass. When unsafe generations are detected, SafeSpec applies rollback and safety-guided reflective multi-sampling to recover safe continuations rather than terminating generation. We model jailbreak attacks as distributional shifts over generative trajectories, where adversarial prompts increase the probability of harmful continuations without eliminating safe ones. Under this model, SafeSpec performs risk-aware trajectory recovery within the speculative decoding process. Across multiple models and adversarial benchmarks, SafeSpec achieves a substantially improved safety-efficiency trade-off. On Qwen3-32B, SafeSpec reduces attack success rates by 15% while preserving a 2.06x inference speedup on benign workloads, demonstrating that speculative acceleration and inference-time safety can be jointly optimized.
Primary: Zhejiang University
All Institutions: Zhejiang University, Huawei, Harbin Institute of Technology, Shenzhen
SafeSpec has a significant positive broader impact by enabling the deployment of safer and more efficient large language models. By addressing the fundamental incompatibility between speculative inference and existing safety defenses, it paves the way for LLMs to be used in more sensitive and performance-critical applications. This framework can help mitigate the risks of harmful content generation, thereby increasing public trust and responsible AI deployment. The approach of recovering safe continuations rather than hard refusal also improves user experience and model helpfulness. The paper does not highlight any specific negative societal consequences beyond the general risks associated with LLMs, which its work aims to mitigate. SafeSpec introduces a novel safety-aware speculative inference framework that elegantly integrates risk estimation and recovery directly into the LLM decoding process. This work makes a substantial technical contribution by demonstrating a superior safety-efficiency trade-off through a lightweight latent safety head and a dynamic rollback-and-reflect multi-sampling mechanism, offering a practical and impactful solution for deploying safer and faster LLMs in real-world applications.
The methodology of SafeSpec is well-conceived and addresses a critical gap in LLM deployment: integrating safety guarantees into speculative inference without negating its acceleration benefits. The core innovation lies in its dual-head verification mechanism. By attaching a lightweight, boundary-aligned latent safety head to the target model, SafeSpec enables simultaneous assessment of semantic validity and safety in a single forward pass. This design is elegant as it leverages the target model's existing computation for quality scoring, incurring negligible additional overhead for safety checks. The boundary-aligned extraction of hidden states for the safety head is a clever detail, preventing interference from the quality scoring prompt. The training methodology for the safety head, using step-wise prefix construction and a guard model for labeling, is sound for aligning the head with the inference process. The "rollback-and-reflect" mechanism, coupled with safety-guided multi-sampling, is a significant departure from traditional hard refusal strategies. Framing jailbreak attacks as distributional shifts where harmful continuations become more probable but safe ones are not entirely eliminated provides a strong theoretical underpinning for the multi-sampling approach. The rollback to a previous, potentially "cleaner" state, combined with a reflection prompt, effectively reshapes the sampling space, increasing the probability of finding a safe continuation. This soft intervention strategy is crucial for maintaining utility and helpfulness, avoiding the common pitfall of over-refusal. The probabilistic view of multi-sampling is clearly articulated, demonstrating how increasing sample size $K$ improves the chance of recovery.
The experimental evaluation is comprehensive and rigorous. The authors use two distinct model families (Qwen3-32B and DeepSeek-R1-Distill-Llama-70B) with appropriate draft models, demonstrating the framework's scalability and versatility. Evaluation metrics cover three critical dimensions: defense against seven advanced adversarial attacks (ASR), over-refusal rates (XSTest), and general capabilities/efficiency (GSM8K, MATH, GPQA-diamond, and inference speedup). This multi-faceted evaluation provides a holistic view of SafeSpec's performance. SafeSpec consistently achieves state-of-the-art defense performance, significantly reducing ASR (e.g., 15% on Qwen3-32B) while preserving substantial inference speedups on benign workloads (2.06x on Qwen3-32B, 1.76x on DeepSeek-70B). Crucially, it maintains low over-refusal rates and negligible accuracy degradation on general reasoning tasks, showcasing a superior safety-efficiency trade-off compared to strong baselines like SafeDecoding and SecDecoding. The ablation studies are well-designed, clearly demonstrating the necessity and synergistic effect of both the reflection prompt and multi-sampling. The comparison with a hard refusal strategy effectively highlights the benefits of SafeSpec's recovery mechanism. Hyperparameter analysis provides valuable insights into the trade-offs involved with sample size, safety threshold, and quality threshold. The detailed latency breakdown in the appendix is particularly insightful, transparently explaining the performance characteristics on benign vs. adversarial inputs and justifying the reduced throughput on jailbreak inputs as a feature of the defense. The comparison with a standalone guard model further validates SafeSpec's efficiency and user experience advantages.
The paper demonstrates good reproducibility. Code is made available on GitHub. The appendix provides detailed information on evaluation datasets, jailbreak prompt construction, quality scoring prompt, safety head configurations (architecture, parameter counts), and training setup (data sources, sampling, hyperparameters, data isolation). Layer choice ablation and per-benchmark sensitivity analysis for quality threshold are also included, providing further confidence in the design choices. The use of a fixed random seed is also mentioned.
1. **Reliance on Guard Model for Labeling**: The training data for the safety head is labeled using Qwen3Guard-Gen-8B. The performance and biases of this external guard model could implicitly limit the safety head's effectiveness and generalization, especially if the guard model itself is imperfect or susceptible to certain attacks. 2. **Heuristic Nature of Reflection Prompt**: While effective, the reflection prompt is a handcrafted heuristic. Its optimal design might be sensitive to the target model or specific attack types, and its generalizability across all future attacks is not guaranteed. 3. **Performance on Adversarial Inputs**: Although justified as a necessary cost for safety, the significant slowdown on jailbreak inputs (throughput below 1x) means that if an attacker can consistently trigger Safety Mode, they can effectively degrade the system's performance, even if they don't get a harmful response. This could be a denial-of-service vector. 4. **Adversarial Attacks on Safety Head**: As the safety head is a lightweight classifier, it might be susceptible to direct adversarial attacks designed to bypass it, rather than just the main LLM. The paper does not explore this. 5. **Fixed Rollback State**: The rollback mechanism reverts to the "previous state." For deeply embedded or multi-turn attacks, a single step rollback might not always be sufficient to reach a truly "clean" context, potentially requiring more sophisticated context recovery.
SafeSpec has a significant positive broader impact by enabling the deployment of safer and more efficient large language models. By addressing the fundamental incompatibility between speculative inference and existing safety defenses, it paves the way for LLMs to be used in more sensitive and performance-critical applications. This framework can help mitigate the risks of harmful content generation, thereby increasing public trust and responsible AI deployment. The approach of recovering safe continuations rather than hard refusal also improves user experience and model helpfulness. The paper does not highlight any specific negative societal consequences beyond the general risks associated with LLMs, which its work aims to mitigate. SafeSpec introduces a novel safety-aware speculative inference framework that elegantly integrates risk estimation and recovery directly into the LLM decoding process. This work makes a substantial technical contribution by demonstrating a superior safety-efficiency trade-off through a lightweight latent safety head and a dynamic rollback-and-reflect multi-sampling mechanism, offering a practical and impactful solution for deploying safer and faster LLMs in real-world applications.
Agentic AI systems increasingly rely on language-model components to interpret instructions, process external data, invoke tools, and coordinate with other agents. These capabilities make prompt-injection and jailbreak attacks more consequential, especially as attackers adopt model-guided automation to scale probing, prompt refinement, and response evaluation. This work analyzes the resulting attack-defense setting through a probabilistic model of a target system, its defense mechanism, and the attacker's automated judge. Our analysis shows that conventional detect-and-block defenses can allow attacker success rate (ASR) to approach one as the query budget grows, since predictable refusals provide useful feedback to automated search. We then examine detect-and-misdirect, where detected malicious interactions receive controlled, non-operational responses designed to induce false-positive errors in the attacker's judge. This strategy reduces the positive predictive value of attacker-selected candidates and yields a bounded asymptotic ASR. We evaluate a proof-of-concept realization of this strategy through Contextual Misdirection via Progressive Engagement (CMPE), a lightweight conversational misdirection method designed to replace predictable refusal text with safe but strategically misleading responses in automated jailbreak settings. On jailbreak benchmarks, CMPE reduces estimated ASR upper bounds by up to two orders of magnitude and nearly eliminates verified attack success in end-to-end PAIR and GPTFuzz attack runs.
Primary: Keysight Technologies Inc.
All Institutions: Keysight Technologies Inc.
This work has significant broader impact for the security and robustness of agentic AI systems. 1. **Paradigm Shift in Defense**: It proposes a fundamental shift from reactive blocking to proactive misdirection, offering a new conceptual framework for designing defenses against automated adversarial attacks. This could inspire new research directions in active defense strategies for LLMs. 2. **Enhanced Agentic AI Security**: As agentic AI systems become more prevalent, their susceptibility to automated attacks is a critical concern. The detect-and-misdirect strategy provides a practical and effective method to improve the resilience of these systems, making them safer for deployment in real-world applications. 3. **Improved Red Teaming**: The insights into how automated judges can be misled can inform the development of more robust red teaming methodologies, pushing attackers to develop more sophisticated (and costly) verification mechanisms. 4. **Understanding LLM Limitations**: The work highlights and leverages specific limitations of LLM-based judges, particularly their reliance on heuristic cues. This contributes to a deeper understanding of how LLMs interpret and evaluate text, which is valuable for both defense and general LLM development. 5. **Deterrence**: By making apparent successes less trustworthy, misdirection can serve as a psychological and operational deterrent, increasing the cost and uncertainty for attackers. This paper introduces a theoretically grounded and empirically validated "detect-and-misdirect" defense strategy that significantly reduces the success rate of model-guided automated attacks on agentic AI systems. Through a probabilistic model, the authors demonstrate the inherent vulnerability of conventional detect-and-block defenses to iterative search and show how misdirection, by inducing false positives in the attacker's judge, can bound asymptotic attack success. The practical instantiation, Contextual Misdirection via Progressive Engagement (CMPE), is shown to be highly effective in end-to-end evaluations against state-of-the-art attack frameworks, nearly eliminating verified attack success and causing premature termination, thereby offering a crucial new paradigm for enhancing the security and robustness of autonomous AI.
The paper introduces a novel "detect-and-misdirect" defense strategy against model-guided automated attacks on agentic AI systems, contrasting it with conventional "detect-and-block" approaches. The methodology is robust, starting with a probabilistic model of the attack-defense setting. This model rigorously demonstrates a fundamental limitation of detect-and-block defenses: predictable refusals provide useful feedback to automated search, allowing attacker success rate (ASR) to approach one as the query budget grows. This theoretical insight is crucial. The proposed detect-and-misdirect strategy is then formalized, showing that by introducing misdirection-induced false positives (MI-FP) in the attacker's automated judge, the positive predictive value (PPV) of attacker-selected candidates is reduced, leading to a bounded asymptotic ASR. This theoretical underpinning is a significant strength. The paper then instantiates this strategy with Contextual Misdirection via Progressive Engagement (CMPE), a lightweight conversational mechanism. CMPE comprises three components: a positive-intent preamble, safe context expansion via prompt-reshaping (token-level transformations, lexical injection, shuffling), and a follow-up question. This design specifically aims to appear cooperative and semantically plausible to an LLM-based judge, while containing no genuinely harmful content, thus exploiting known limitations of such judges that often rely on heuristic cues (tone, structure, perceived intent) rather than strict semantic correctness. The CMPE algorithm is clearly described, enhancing its practicality and reproducibility. The methodology effectively bridges theoretical analysis with a concrete, implementable defense.
The experimental evaluation is comprehensive and compelling, validating the proposed strategy through both simulated ASR bounds and end-to-end attack runs. 1. **Judge Error Rate Estimation**: The authors evaluate CMPE's ability to induce misdirection-induced false positives (MI-FP) across a diverse set of six modern automated judge models (including rubric-based LLM judges like StrongREJECT and PAIR, and classifier judges like HB-FT-LLaMA2-13B, GPTFuzz-RoBERTa, and Llama-Guard-3-8B). This is performed on 500 high-risk jailbreak prompts from the AdvBench dataset. Each prompt-response pair is evaluated 10 times to account for stochasticity. The results clearly show that CMPE responses consistently receive high harmfulness scores from judges, demonstrating its effectiveness in generating MI-FPs. 2. **Simulated ASR Evaluation**: Using these per-sample judge error estimates, the paper computes simulated maximum ASR upper bounds for various attacker-defender judge configurations. The results show that CMPE substantially reduces the estimated ASR upper bound, often by one to two orders of magnitude, compared to the detect-and-block baseline. This directly supports the theoretical prediction that misdirection degrades the attacker's PPV and bounds their success. 3. **End-to-End Attack Framework Evaluation**: This is the most impactful part of the evaluation. CMPE is tested against two representative model-guided attack frameworks, GPTFuzz and PAIR, using both an aligned victim model (Vicuna-13b-v1.5) and a refusal-suppressed model (NeuralDaredevil-8B-abliterated). The experiments emulate a realistic agentic security setting. The results are striking: CMPE nearly eliminates verified attack success (reducing ASR from 10-20% to 0-2%) and causes the automated attack frameworks to terminate prematurely due to accepting misdirection responses as successful. This demonstrates that CMPE effectively disrupts the attack loop by making apparent successes untrustworthy. The use of manual validation with a secondary LLM judge for final verification adds credibility to the reported true positive rates. The experimental setup is well-controlled, with local defense and attack components hosted on separate systems.
The paper provides sufficient details to facilitate reproducibility. The probabilistic model is clearly defined with equations. The CMPE algorithm is presented in detail, including its three components and an example. Specific models used for response generation (NeuralDaredevil-8B-abliterated) and judging (various LLM and classifier judges, including their backend models) are named, along with the dataset (AdvBench) and its source. The experimental setup for both simulations and end-to-end runs (number of prompts, iterations, victim models, attacker models, defense models, validation judges, hardware) is described. URLs for the AdvBench dataset and the NeuralDaredevil model are provided. This level of detail is commendable and supports the reproducibility of the work.
1. **Attacker Adaptation**: While the paper discusses potential attacker adaptations (e.g., judge ensembling, stricter calibration), it acknowledges that these introduce trade-offs (e.g., increased false negatives for the attacker). However, the arms race between attackers and defenders is continuous, and more sophisticated misdirection detection methods might emerge. 2. **Generality of Misdirection**: CMPE is a specific instantiation of the detect-and-misdirect strategy. While effective, developing equally lightweight and effective misdirection for other types of prompt injection or agentic attacks might require different approaches. 3. **Complexity of Misdirection Generation**: Although CMPE is described as lightweight, generating consistently plausible and misleading responses without inadvertently triggering harmful behavior or being easily detectable as non-operational could become challenging for more complex attack scenarios or highly sophisticated attackers. 4. **Focus on Jailbreak**: The evaluation primarily focuses on jailbreak attacks. While the theoretical framework is general, the CMPE instantiation and empirical validation are specific to jailbreaking. Its effectiveness against other prompt injection variants (e.g., data exfiltration, tool misuse) would need further investigation. 5. **Human Oversight in Validation**: The final validation of true positives still requires manual inspection and a secondary LLM judge, highlighting the inherent difficulty in fully automating the verification of malicious intent, even for the defender.
This work has significant broader impact for the security and robustness of agentic AI systems. 1. **Paradigm Shift in Defense**: It proposes a fundamental shift from reactive blocking to proactive misdirection, offering a new conceptual framework for designing defenses against automated adversarial attacks. This could inspire new research directions in active defense strategies for LLMs. 2. **Enhanced Agentic AI Security**: As agentic AI systems become more prevalent, their susceptibility to automated attacks is a critical concern. The detect-and-misdirect strategy provides a practical and effective method to improve the resilience of these systems, making them safer for deployment in real-world applications. 3. **Improved Red Teaming**: The insights into how automated judges can be misled can inform the development of more robust red teaming methodologies, pushing attackers to develop more sophisticated (and costly) verification mechanisms. 4. **Understanding LLM Limitations**: The work highlights and leverages specific limitations of LLM-based judges, particularly their reliance on heuristic cues. This contributes to a deeper understanding of how LLMs interpret and evaluate text, which is valuable for both defense and general LLM development. 5. **Deterrence**: By making apparent successes less trustworthy, misdirection can serve as a psychological and operational deterrent, increasing the cost and uncertainty for attackers. This paper introduces a theoretically grounded and empirically validated "detect-and-misdirect" defense strategy that significantly reduces the success rate of model-guided automated attacks on agentic AI systems. Through a probabilistic model, the authors demonstrate the inherent vulnerability of conventional detect-and-block defenses to iterative search and show how misdirection, by inducing false positives in the attacker's judge, can bound asymptotic attack success. The practical instantiation, Contextual Misdirection via Progressive Engagement (CMPE), is shown to be highly effective in end-to-end evaluations against state-of-the-art attack frameworks, nearly eliminating verified attack success and causing premature termination, thereby offering a crucial new paradigm for enhancing the security and robustness of autonomous AI.
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced Large Language Model (LLM) reasoning; however, it faces a fundamental optimization instability: uniform token updates precipitate entropy collapse, leading to premature convergence to suboptimal strategies, whereas excessive Shannon Entropy maximization can cause entropy explosion, driving blind exploration toward incoherent reasoning chains. To resolve this dichotomy, we introduce the Independent Combinatorial Tokens (ICT) framework, which shifts the optimization focus from scalar uncertainty to the distributional properties of token logits. By leveraging the Jensen-Shannon (JS) divergence between token logits distributions, ICT identifies tokens with distinctive distributional patterns as critical branching points for guiding effective exploration in LLM reasoning. Our theoretical analysis, grounded in both Shannon and second-order Rényi entropy, proves that selectively updating on these tokens regulates policy concentration: it reduces the overall distribution uncertainty measured by Shannon entropy, while controlling probability concentration captured by second-order Rényi entropy. This dual effect prevents over-concentrated token generation from weakening exploration and effectively stabilizes the training landscape. Empirical results demonstrate that updating only the top 10% of unique tokens on Qwen2.5 (0.5B/1.5B/7B) models yields an average pass@4 improvement of 4.58%, with a maximum gain of 14.9%, over GRPO, 20-Entropy, and STAPO baselines across seven benchmarks spanning math, commonsense, and Olympiad-level problems.
Primary: Hong Kong University of Science and Technology
All Institutions: Hong Kong University of Science and Technology, Unknown Institution 2, Unknown Institution 3
The ICT framework offers a principled solution to a fundamental instability in RLVR, a critical technique for aligning LLMs with objective correctness in complex domains like mathematics and programming. By enabling more stable and effective exploration, ICT can lead to: 1. **Improved LLM Reasoning**: The demonstrated gains in Pass@4 suggest that models trained with ICT can explore more diverse and correct reasoning paths, leading to higher-quality solutions. 2. **Enhanced Training Efficiency**: The "less is more" finding, where updating only 10% of tokens yields superior results, implies potential for significant computational savings during RL fine-tuning, making it more accessible and scalable. 3. **Deeper Understanding of LLM Dynamics**: The shift from scalar entropy to distributional properties provides a new lens for understanding how LLMs make decisions and explore, potentially inspiring further research into information-theoretic approaches for policy optimization. 4. **Foundation for Future RLVR Algorithms**: The token-centric foundation established by this work could serve as a building block for the next generation of RLVR algorithms, moving beyond uniform updates to more intelligent, selective gradient application. This paper introduces the Independent Combinatorial Tokens (ICT) framework, a novel approach to stabilize Reinforcement Learning with Verifiable Rewards (RLVR) for LLM reasoning by focusing sparse updates on "unique tokens" identified via Jensen-Shannon divergence of logit distributions. The work provides a strong theoretical foundation, meticulously analyzing entropy dynamics with second-order Rényi entropy, and demonstrates significant, consistent empirical gains across multiple LLM scales and diverse reasoning benchmarks, offering a principled and efficient method for enhancing LLM exploration and reasoning capabilities.
The paper introduces the Independent Combinatorial Tokens (ICT) framework to address the fundamental optimization instability (entropy collapse/explosion) in Reinforcement Learning with Verifiable Rewards (RLVR) for LLM reasoning. The core idea is to shift the optimization focus from scalar uncertainty (Shannon Entropy) to the distributional properties of token logits. Specifically, ICT leverages Jensen-Shannon (JS) divergence to identify "unique tokens" whose logit distributions significantly deviate from the sequence-level average distribution. These unique tokens are posited as critical branching points for guiding effective exploration. The theoretical analysis is a major strength. It rigorously grounds the ICT framework in both Shannon and second-order Rényi entropy ($H_2$). The paper meticulously derives the gradient dynamics of $H_2$ and introduces the concept of "strategy purity" (collision probability) to formalize entropy bifurcation into regimes of collapse (updating high-probability tokens) and explosion (updating low-probability tokens). It proves that selectively updating unique tokens, which are shown to reside near the strategy purity threshold, regulates policy concentration by reducing overall Shannon entropy while controlling probability concentration via $H_2$. The appendix provides extensive derivations, including the homogeneity of $H_1$ and $H_2$ gradients under certain conditions, and a formal bridge connecting JS-unique tokens to these critical branching points. The ICT framework then integrates this insight into a sparse policy gradient estimator built upon Group Relative Policy Optimization (GRPO). An ICT distributional selector constructs a binary mask, retaining only the top-k percentile of tokens based on their JS uniqueness scores. This sparse mask is applied to the GRPO objective, ensuring that optimization resources are focused on high-information learning signals. The methodology is well-articulated, providing a principled approach to stabilize RLVR training.
The experimental evaluation is comprehensive and robust. The authors evaluate ICT on the Qwen2.5 series of models (0.5B, 1.5B, 7B), demonstrating scalability across different model sizes. Seven benchmarks are used, spanning diverse reasoning tasks including math (GSM8K, Math500, AIME23/24/25), commonsense (GPQA), and general knowledge (MMLU-Stem). This broad evaluation scope strengthens the generalization claims. ICT is compared against strong baselines: GRPO (the backbone RLVR algorithm), 20-Entropy, and STAPO. The results consistently show that ICT achieves the highest average Pass@1, Pass@4, and Total scores across all model scales and benchmarks. The average Pass@4 improvement of 4.58% (with a maximum gain of 14.9%) over baselines is significant, especially considering that only the top 10% of unique tokens are updated. This "less is more" finding is compelling. A key empirical finding is the differential improvement in Pass@4 versus Pass@1, indicating enhanced exploration capacity. Ablation studies further validate the theory: 1. **Update Ratios**: Comparing different sparsity ratios (10%, 20%, 50%, 100%) shows that updating only the top 10% of unique tokens yields the best performance, aligning with the hypothesis that focusing on critical branching points is optimal. 2. **Composition of Unique Tokens**: The analysis reveals that the ratio of high-entropy to low-entropy tokens among the selected unique tokens is approximately 1:1 (1.03 on GSM8K, 0.99 on MATH). This empirically confirms the theoretical prediction that unique tokens are drawn from both entropy collapse (Regime H) and entropy explosion (Regime L) regimes, thereby maintaining balanced entropy dynamics. The experiments are well-designed to support the theoretical claims and demonstrate practical efficacy.
The paper states that the training pipeline is built upon VeRL and closely follows the GRPO training recipe, with the only difference being the sparse updates. It mentions using the mean across 5 independent random seeds and provides details on datasets and baselines. More implementation details are said to be in the Appendix, which does provide extensive theoretical derivations but not explicit code or hyperparameter tables. While the methodology is clearly described, the absence of a public code repository or a detailed hyperparameter table in the main paper or appendix makes full reproducibility challenging without significant effort to replicate the VeRL/GRPO setup and then implement ICT.
1. **Strictness of H1/H2 Homogeneity Condition**: The paper acknowledges that the condition for co-directional Shannon entropy gradients ($(a) > e^{-1} \approx 0.37$) is restrictive for typical LLM token probabilities. While the $H_2$ analysis is unconditionally valid and an extended homogeneity for top-k tokens is argued, this still highlights a nuance in the theoretical claims. 2. **First-Order Approximation**: The theoretical derivations rely on a first-order Taylor approximation for entropy change, which assumes infinitesimally small step sizes. While justified for small learning rates, aggressive learning rates or large advantage spikes could lead to non-negligible higher-order effects. 3. **Computational Overhead**: Although the paper claims negligible computational overhead for JS divergence computation due to parallel batch processing, it still represents an additional step in the training loop. The primary savings come from the sparse backward pass, but the forward pass still involves this calculation. 4. **Generalizability Beyond Reasoning**: While the paper demonstrates strong results on reasoning tasks, the applicability of "unique tokens" identified via JS divergence to other LLM tasks (e.g., creative writing, summarization, dialogue) is not explored. 5. **Code Availability**: The lack of a publicly available code repository is a common limitation for arXiv papers and hinders direct reproducibility and adoption by the community.
The ICT framework offers a principled solution to a fundamental instability in RLVR, a critical technique for aligning LLMs with objective correctness in complex domains like mathematics and programming. By enabling more stable and effective exploration, ICT can lead to: 1. **Improved LLM Reasoning**: The demonstrated gains in Pass@4 suggest that models trained with ICT can explore more diverse and correct reasoning paths, leading to higher-quality solutions. 2. **Enhanced Training Efficiency**: The "less is more" finding, where updating only 10% of tokens yields superior results, implies potential for significant computational savings during RL fine-tuning, making it more accessible and scalable. 3. **Deeper Understanding of LLM Dynamics**: The shift from scalar entropy to distributional properties provides a new lens for understanding how LLMs make decisions and explore, potentially inspiring further research into information-theoretic approaches for policy optimization. 4. **Foundation for Future RLVR Algorithms**: The token-centric foundation established by this work could serve as a building block for the next generation of RLVR algorithms, moving beyond uniform updates to more intelligent, selective gradient application. This paper introduces the Independent Combinatorial Tokens (ICT) framework, a novel approach to stabilize Reinforcement Learning with Verifiable Rewards (RLVR) for LLM reasoning by focusing sparse updates on "unique tokens" identified via Jensen-Shannon divergence of logit distributions. The work provides a strong theoretical foundation, meticulously analyzing entropy dynamics with second-order Rényi entropy, and demonstrates significant, consistent empirical gains across multiple LLM scales and diverse reasoning benchmarks, offering a principled and efficient method for enhancing LLM exploration and reasoning capabilities.
A/B testing has become the gold standard for data-driven decision-making in large-scale online experimentation, providing critical guidance for feature launch, pricing optimization, and user experience enhancement. To maximize statistical sensitivity, many technology companies routinely employ Controlled-experiment Using Pre-Experiment Data (CUPED), a technique that achieves substantial variance reduction while preserving the unbiasedness of estimating the average treatment effect. Despite its widespread adoption, several critical methodological and practical nuances of CUPED remain underexplored. This paper systematically addresses five frequently encountered yet overlooked questions regarding the application of CUPED. First, we provide a comparative analysis of various post-CUPED estimators to identify the optimal adjustment specification. Second, we evaluate the validity of regression-based adjustments and delineate robust variance estimation methods tailored for such frameworks. Finally, we extend our investigation to complex but common scenarios, including multi-arm experiments and two-stage sampling designs. Our findings reveal that in these settings, naive reliance on standard variance estimators can lead to severely misleading inferences. By offering rigorous theoretical insights and extensive experimental validation, this work deepens the conceptual understanding of CUPED. Notably, the recommended methodologies have been successfully deployed and integrated into ByteDance's experimentation platform.
Primary: Zhongtai Securities Institute for Financial Studies
All Institutions: Zhongtai Securities Institute for Financial Studies, Shandong University, ByteDance
This paper has a significant broader impact on the practice of online experimentation across the technology industry. By systematically addressing critical, yet often misunderstood, aspects of CUPED, it provides a robust framework for ensuring trustworthy A/B testing. 1. **Improved Decision-Making**: By preventing inflated Type I errors and increasing statistical power, the methodologies will lead to more reliable causal inferences, enabling companies to make better, data-driven decisions on feature launches, pricing, and user experience. 2. **Enhanced Efficiency**: The recommendations for optimal adjustment specifications and variance reduction in complex settings will allow experiments to detect smaller effects with fewer users or in shorter durations, accelerating product iteration and reducing opportunity costs. 3. **Standardization of Best Practices**: The paper's clear guidelines and rigorous validations can help standardize CUPED implementation across the industry, moving away from ad-hoc heuristics towards statistically sound practices. 4. **Educational Value**: It serves as a valuable resource for practitioners and researchers, deepening their conceptual understanding of CUPED and its nuances in various experimental designs. The successful integration of these methodologies into ByteDance's platform demonstrates their immediate and large-scale utility, paving the way for wider adoption and a higher standard of rigor in online experimentation. This paper provides a rigorous and highly practical investigation into the nuances of CUPED for online A/B testing, offering critical insights and actionable recommendations for ensuring trustworthy inference in complex experimental designs. Through a combination of robust theoretical proofs and extensive empirical validation using real-world data from ByteDance, the authors systematically address common pitfalls in variance estimation and adjustment specification, particularly in multi-arm and two-stage sampling scenarios, thereby significantly enhancing the reliability and efficiency of large-scale experimentation.
The paper adopts a design-based framework under a completely randomized design, which is standard and robust for A/B testing. It systematically addresses five key questions regarding CUPED, focusing on practical nuances often overlooked in industry. The methodology involves a rigorous comparative analysis of various CUPED estimators: full-sample, pooled-sample, split-sample, and regression-based (with and without interaction). A significant strength is the detailed evaluation of inferential validity, particularly concerning variance estimation. The paper highlights the critical need for robust variance estimators (sandwich estimators) in regression-based CUPED, especially under heteroscedasticity and imbalanced group sizes, directly addressing Freedman's critique. The core methodological contributions lie in extending CUPED analysis to complex, yet common, scenarios: multi-arm experiments and two-stage sampling designs. For multi-arm experiments, the paper theoretically proves that using the full-sample covariate mean for adjustment is more efficient than local (pairwise) adjustments and derives a necessary variance correction for the local approach to maintain inferential validity. For two-stage sampling, it demonstrates that split-sample CUPED retains its efficiency advantages but requires a variance correction to account for the compounded randomness from the initial sampling stage. The theoretical results are presented as theorems with detailed proofs provided in the appendix, demonstrating a high level of mathematical rigor. The paper also provides a nuanced discussion on the choice between model-free and regression-based approaches, emphasizing computational efficiency and metric structure over perceived robustness differences.
The experimental evaluation is comprehensive and highly relevant to real-world applications. It combines simulation studies with validation using real-world data from ByteDance's experimentation platform. 1. **Type I Error Rate and Power Simulations**: For regression-based CUPED, simulations demonstrate the unreliability of standard OLS variance estimators in imbalanced scenarios, showing inflated Type I errors or excessive conservatism. In contrast, the sandwich estimator consistently maintains stable, conservative control. Similar simulations are conducted for multi-arm and two-stage sampling, clearly illustrating the under-coverage of confidence intervals by naive variance estimators and the effectiveness of the proposed corrections. 2. **Real-World Data Validation**: The paper uses proprietary data from ByteDance's core business metrics (e.g., GMV, user feedback) to empirically validate the theoretical findings. For multi-arm experiments, it shows that the full-sample covariate mean consistently achieves higher variance reduction compared to the corrected local approach. For two-stage sampling, the empirical coverage rates confirm the theoretical predictions regarding the necessity of variance corrections. The experiments are well-designed, covering various allocation schemes and sampling probabilities, which enhances the generalizability of the findings within industrial settings. The use of a large number of replications ($10^4$ to $10^5$) ensures statistical reliability of the simulation results. The direct application and integration into ByteDance's platform serve as a strong testament to the practical utility and validity of the proposed methodologies.
The paper provides a good level of detail for reproducibility of its theoretical claims. All theorems are accompanied by formal proofs in the appendix, allowing independent verification. For the simulation experiments, the data generation process is explicitly described, including distributions and parameters, which should enable reproduction of the simulation results. While the real-world data from ByteDance is proprietary and thus not directly reproducible by external researchers, the methodologies applied to this data are clearly articulated. The overall clarity of the methodological descriptions and the theoretical backing contribute positively to reproducibility.
1. **Focus on Mean-Based Metrics**: While the paper acknowledges ratio metrics and suggests model-free approaches with the delta method, a deeper dive into the specific challenges and optimal CUPED strategies for various complex ratio metrics (e.g., conversion rates, revenue per user) could be beneficial, as these are prevalent in online experimentation. 2. **Generalizability of ByteDance's Scenarios**: While the use of ByteDance's data is a strength, the specific characteristics of their platform and user behavior might influence the magnitude of the observed effects (e.g., variance reduction ratios). While the theoretical results are general, the empirical gains might vary across different platforms. 3. **Computational Cost of Sandwich Estimators**: While sandwich estimators are theoretically robust, their computational cost can be higher than standard OLS estimators, especially with very large datasets or complex models. The paper doesn't extensively discuss the practical implications of this trade-off for real-time A/B testing platforms. 4. **Assumptions of Design-Based Framework**: The paper operates under a design-based framework. While robust, exploring the implications or comparisons with super-population inference frameworks, especially for the two-stage sampling where `p` approaches 0, could offer a more complete picture.
This paper has a significant broader impact on the practice of online experimentation across the technology industry. By systematically addressing critical, yet often misunderstood, aspects of CUPED, it provides a robust framework for ensuring trustworthy A/B testing. 1. **Improved Decision-Making**: By preventing inflated Type I errors and increasing statistical power, the methodologies will lead to more reliable causal inferences, enabling companies to make better, data-driven decisions on feature launches, pricing, and user experience. 2. **Enhanced Efficiency**: The recommendations for optimal adjustment specifications and variance reduction in complex settings will allow experiments to detect smaller effects with fewer users or in shorter durations, accelerating product iteration and reducing opportunity costs. 3. **Standardization of Best Practices**: The paper's clear guidelines and rigorous validations can help standardize CUPED implementation across the industry, moving away from ad-hoc heuristics towards statistically sound practices. 4. **Educational Value**: It serves as a valuable resource for practitioners and researchers, deepening their conceptual understanding of CUPED and its nuances in various experimental designs. The successful integration of these methodologies into ByteDance's platform demonstrates their immediate and large-scale utility, paving the way for wider adoption and a higher standard of rigor in online experimentation. This paper provides a rigorous and highly practical investigation into the nuances of CUPED for online A/B testing, offering critical insights and actionable recommendations for ensuring trustworthy inference in complex experimental designs. Through a combination of robust theoretical proofs and extensive empirical validation using real-world data from ByteDance, the authors systematically address common pitfalls in variance estimation and adjustment specification, particularly in multi-arm and two-stage sampling scenarios, thereby significantly enhancing the reliability and efficiency of large-scale experimentation.
Testing conditional independence is fundamental yet intrinsically difficult: without additional assumptions, Type I error control is impossible in general. The "Model-X'' paradigm addresses this difficulty by assuming exact knowledge of a relevant conditional distribution. While small deviations from this assumption can sometimes be tolerated in classical one-shot testing, existing sequential conditional independence tests typically require the Model-X conditional to be known exactly, making them fragile when it must instead be estimated. We propose a new approach that is substantially more robust to such estimation error. Our method applies testing-by-betting to an adaptively optimized Kernel Conditional Independence statistic, together with a normalization scheme and a truncate-and-shift calibration strategy. These modifications greatly reduce Type I error inflation while preserving high power across high-dimensional synthetic benchmarks and real-world fairness tasks, outperforming existing sequential Model-X approaches. Code is available at https://github.com/he-zh/SKCI.
Primary: University of British Columbia
All Institutions: University of British Columbia, Alberta Machine Intelligence Institute
[One sentence main contribution]. This paper introduces SKCI, a robust sequential conditional independence test using adaptive betting and kernel methods that maintains valid Type I error control even when the conditional distribution is estimated online. [Comprehensive analysis of the technical contribution, methodology, and significance to the field]. The work represents a significant advancement in sequential hypothesis testing, particularly for conditional independence, a notoriously difficult problem. By integrating testing-by-betting with adaptive kernel methods and rigorous calibration techniques, the authors address a critical gap in the literature where existing methods fail under distributional estimation error. The theoretical guarantees and strong empirical performance make this a valuable contribution to the statistical learning and machine learning communities, offering a practical tool for real-world applications requiring reliable, anytime-valid inference.
The paper proposes a novel sequential testing framework for Conditional Independence (CI) called SKCI, addressing the fragility of existing Model-X based sequential tests when the conditional distribution $P(A|C)$ must be estimated rather than known. The core innovation lies in combining testing-by-betting with an adaptively optimized Kernel Conditional Independence (KCI) statistic. Key methodological contributions include: 1) A self-normalized payoff function using a cross-U-statistic structure to handle scale invariance; 2) A "shift-and-truncate" mechanism to ensure the wealth process remains a valid supermartingale (or close to it) despite estimation errors in the conditional mean embeddings; 3) A Gaussian approximation strategy to estimate the necessary shift parameter for calibration; and 4) An adaptive optimization loop for kernel hyperparameters and betting fractions using empirical log-wealth proxies. The theoretical analysis provides a finite-sample bound on Type I error inflation, decomposing the drift into Gaussian approximation error and calibration mismatch, which is a rigorous and non-trivial theoretical contribution.
The experimental evaluation is comprehensive and convincing. The authors test SKCI against strong baselines (e-CRT, DAVT, EC2ST) across multiple regimes: Oracle (known conditional), Pretrained (offline estimated conditional), and Online (sequential estimation). They use challenging synthetic benchmarks (Gaussian, CI Hardness, RatInABox) and real-world applications (dSprites, Car Insurance Discrimination). The results demonstrate that SKCI significantly outperforms baselines in terms of Type I error control in the Online and Pretrained settings, where other methods suffer from severe inflation or loss of power. The inclusion of fairness and biological data adds practical relevance. The ablation studies and sensitivity analysis support the theoretical claims regarding batch size and regularization.
The paper provides a clear algorithm description, detailed theoretical proofs in the appendix, and a link to the source code. The experimental setup is well-described, including data splits and hyperparameter selection strategies. The code availability ensures high reproducibility.
The method relies on kernel ridge regression for conditional mean embeddings, which can scale poorly with very large datasets ($O(N^3)$ or $O(N^2)$ depending on implementation). The Gaussian approximation for the shift parameter is an assumption that may not hold perfectly in finite samples with heavy-tailed distributions, although the theory bounds this error. The adaptive optimization of kernel parameters adds computational overhead per batch compared to fixed-kernel methods.
Conditional independence testing is fundamental for causal discovery, fairness auditing, and robust machine learning. By providing a robust sequential test that works with estimated conditionals, this work enables more reliable and flexible inference in online settings, such as real-time fairness monitoring or adaptive experimental design. This has positive societal implications by improving the reliability of automated decision-making systems. [One sentence main contribution]. This paper introduces SKCI, a robust sequential conditional independence test using adaptive betting and kernel methods that maintains valid Type I error control even when the conditional distribution is estimated online. [Comprehensive analysis of the technical contribution, methodology, and significance to the field]. The work represents a significant advancement in sequential hypothesis testing, particularly for conditional independence, a notoriously difficult problem. By integrating testing-by-betting with adaptive kernel methods and rigorous calibration techniques, the authors address a critical gap in the literature where existing methods fail under distributional estimation error. The theoretical guarantees and strong empirical performance make this a valuable contribution to the statistical learning and machine learning communities, offering a practical tool for real-world applications requiring reliable, anytime-valid inference.
Math and science reasoning benchmarks rely on pass@k, the fraction of sampled chains that reach gold, as the canonical per-example difficulty signal. The same signal drives RL with verifiable rewards, math data curation, synthetic curricula, and verifier training. We show this proxy has a persistent blind spot on its hardest stratum: on the eight free-form math cells we test (GSM8K and MATH across four open-weight models), 10.3-22.9% of the examples that no sampling seed solves in six tries are instead solved at matched compute by a six-chain deterministic regime. These are greedy decoding plus five cheap residual-stream perturbations applied via activation grafting, while greedy alone solves at most 6% on these math cells. Recovery scales with the additional budget, across perturbations whose mechanistic distinctness we verify across all twelve cells (cross-kind fix-set Jaccard <= 0.47 in every setup). Activation grafting is used as an intervention on internal representations, not a decoding method; we use it purely as a diagnostic and diversification tool, and our recovered items show that the pass@k= 0 % stratum is structurally identifiable in the residual stream rather than that the unmodified model reaches them under ordinary inference.
Primary: Sapienza University of Rome
All Institutions: Sapienza University of Rome, Not Diamond, Paradigma
This paper has significant broader impact across several areas of ML research and practice: 1. **LLM Evaluation**: It fundamentally challenges the reliance on pass@k as the sole or primary signal for per-example difficulty, especially in reasoning tasks. This could lead to more nuanced and robust difficulty estimation methods for benchmarks. 2. **Data Curation and Synthetic Curricula**: Pipelines that filter out or downweight problems based on pass@k=0 (e.g., for RL with verifiable rewards, math data curation, synthetic curricula) are shown to discard a non-trivial fraction of problems that the model *can* solve. This implies wasted compute and potentially biased datasets, leading to less effective training. 3. **Verifier and Reward Model Training**: Datasets for training verifiers and reward models, built from sampled-chain correctness, will inherit this blind spot. Items that are solvable deterministically but missed by sampling contribute only negative examples, potentially misguiding the verifier. 4. **Interpretability and Mechanistic Understanding**: The use of activation grafting as a diagnostic tool provides a concrete method for probing the "reachability" of solutions within the residual stream, offering insights into how internal representations can be perturbed to unlock different behaviors. 5. **Resource Efficiency**: By identifying that a fraction of "hard" problems are merely "unreached," the paper suggests that auditing these items with cheap deterministic perturbations can improve data quality and reduce the need for generating more samples or discarding valuable data. The impact statement correctly notes that this is a diagnostic study, not a new inference method, and does not pose dual-use risks. Its primary benefit is to improve the rigor and efficiency of LLM development and evaluation. This paper presents a critical diagnostic study revealing a "sampling blind spot" in pass@k-based difficulty estimation for math reasoning tasks. Through rigorous experimentation across multiple LLMs and benchmarks, it demonstrates that a significant fraction (10-29%) of problems deemed "hardest" by sampling are, in fact, solvable by the same model under a matched-compute deterministic regime using activation grafting, challenging a fundamental assumption in LLM evaluation and data curation.
The paper introduces a novel diagnostic methodology to investigate the "sampling blind spot" in pass@k-based difficulty estimation for math reasoning tasks. The core idea is to use activation grafting as an intervention on internal representations, rather than a decoding method, to explore deterministic trajectories that are distinct from standard stochastic sampling. The methodology is sound and well-justified: 1. **Problem Framing**: Clearly defines the pass@k=0 stratum as the target for investigation, representing examples deemed "hardest" by sampling-only methods. 2. **Activation Grafting as Diagnostic**: This is a key methodological strength. By replacing the last prompt-token hidden state with various cheap synthetic vectors (zero, random, BOS-token, etc.), the authors create distinct deterministic decoding paths. This allows them to probe whether the model *can* solve these problems if its internal state is slightly perturbed, without changing the model's weights or the decoding algorithm (greedy). 3. **Matched Compute**: The comparison is rigorously set up by matching the number of forward passes between the sampling regime (k samples) and the deterministic regime (greedy + k-1 grafts). This ensures a fair comparison of "compute budget." 4. **Mechanistic Distinctness**: The authors meticulously verify that the different graft types indeed lead to mechanistically distinct trajectories, as evidenced by low cross-kind fix-set Jaccard similarity and analysis of hidden-state divergence. This is crucial for arguing that the deterministic regime explores genuinely different solution paths, not just noisy variations of one. 5. **Robustness Checks**: The methodology includes checks for robustness against sampling temperature and layer selection for grafting, further strengthening the claims.
The experimental evaluation is comprehensive and robust, covering a good range of models and benchmarks. 1. **Models and Benchmarks**: Experiments are conducted on four open-weight instruction-tuned models (Qwen-2.5-3B, Llama-3.2-3B, Llama-3.1-8B, Mistral-Nemo-12B) and three reasoning benchmarks (GSM8K, MATH, MMLU-Pro). This breadth demonstrates the generality of the findings across different model sizes and reasoning domains. The focus on free-form math (GSM8K, MATH) where the effect is largest is appropriate. 2. **Key Findings**: * **Greedy Competitiveness**: Shows that greedy decoding can be competitive or even better than single-sample accuracy, challenging a common assumption. * **Persistent Blind Spot**: Demonstrates that the pass@k=0 stratum is substantial (5.1-43.5% of prompts at k=6) and persists even with additional samples, indicating it's not an artifact of undersampling. * **Deterministic Recovery**: The central finding is that a six-chain deterministic regime (greedy + five grafts) recovers 10.3-22.9% of the pass@k=0 examples on free-form math cells (10-29% across all 12 cells). This is a significant fraction of items previously deemed "unsolvable." * **Scaling and Diversity**: Recovery scales with the deterministic budget, and the distinctness of grafts (low Jaccard index) confirms that different grafts probe different subsets of the problem space. 3. **Mechanistic Analysis**: Detailed analysis of hidden-state divergence and attention weight changes provides strong evidence that grafts inject content vectors that propagate through the residual stream, rather than merely rerouting attention. This supports the claim of distinct mechanistic axes. 4. **Practical Utility**: Two deployable recipes are presented: a matched-cost substitution (replacing one sample with an `avg` graft for better coverage) and a label-free curation flag (using chain disagreement to identify recoverable items). These demonstrate direct applicability of the diagnostic insights. 5. **Quantitative Rigor**: Results are presented with clear percentages and absolute counts, and statistical significance is implicitly supported by the consistent trends across many setups. The Jaccard similarity metric is well-chosen to quantify fix-set diversity.
The paper provides a good level of detail for reproducibility: * **Models and Benchmarks**: Specific models and benchmarks are named. * **Decoding Parameters**: Sampling temperature (T=0.7, p_top=0.9) and max_new_tokens are specified. * **Grafting Details**: The layer (26) and position (last prompt token) for grafting are fixed, and the types of graft vectors (zero, random, BOS-token, etc.) are described. The process of applying grafts via `register_forward_hook` is mentioned. * **Compute Matching**: The definition of "matched compute" is clear. * **Worked Examples**: Concrete examples of recovered items with explanations of where sampling failed and grafts succeeded are provided, aiding understanding. While the exact code for activation grafting and evaluation is not provided in the paper text, the methodological descriptions are sufficiently detailed for an experienced researcher to reimplement the experiments.
The authors acknowledge several limitations, and the review aligns with them: 1. **Scope of Models/Benchmarks**: The study covers 3B-12B open-weight models and three reasoning benchmarks. While substantial, it doesn't cover larger frontier models or other domains (e.g., code generation, creative writing). 2. **Unreached vs. Intrinsically Hard**: Even with 8 deterministic chains, 66-88% of the pass@k=0 stratum remains unreached. These could be genuinely hard or reachable by other diversity axes not probed. The paper is careful not to overstate the "easy" claim for all unreached items. 3. **Point Estimates**: The recovery rates are point estimates without bootstrap confidence intervals or repeated-seed-set variance, which means per-cell rates should be interpreted with some uncertainty, though the qualitative direction holds. 4. **Label-Free Identification Precision**: The label-free probe for identifying recoverable items works well on free-form math but degrades on multiple-choice benchmarks due to chance agreement. A calibrated precision guarantee would require a small labeled dev set. 5. **Small Strata Noise**: For very small pass@k=0 strata (e.g., 51-58 examples), absolute recovery counts are small, leading to a higher noise floor in recovery rates. 6. **Reverse Asymmetry**: The paper explicitly states it does not quantify items the deterministic regime misses but sampling reaches, focusing only on the direction relevant to auditing current pass@k practices. This is a reasonable scope choice but still a limitation for a complete picture of decoding regime differences.
This paper has significant broader impact across several areas of ML research and practice: 1. **LLM Evaluation**: It fundamentally challenges the reliance on pass@k as the sole or primary signal for per-example difficulty, especially in reasoning tasks. This could lead to more nuanced and robust difficulty estimation methods for benchmarks. 2. **Data Curation and Synthetic Curricula**: Pipelines that filter out or downweight problems based on pass@k=0 (e.g., for RL with verifiable rewards, math data curation, synthetic curricula) are shown to discard a non-trivial fraction of problems that the model *can* solve. This implies wasted compute and potentially biased datasets, leading to less effective training. 3. **Verifier and Reward Model Training**: Datasets for training verifiers and reward models, built from sampled-chain correctness, will inherit this blind spot. Items that are solvable deterministically but missed by sampling contribute only negative examples, potentially misguiding the verifier. 4. **Interpretability and Mechanistic Understanding**: The use of activation grafting as a diagnostic tool provides a concrete method for probing the "reachability" of solutions within the residual stream, offering insights into how internal representations can be perturbed to unlock different behaviors. 5. **Resource Efficiency**: By identifying that a fraction of "hard" problems are merely "unreached," the paper suggests that auditing these items with cheap deterministic perturbations can improve data quality and reduce the need for generating more samples or discarding valuable data. The impact statement correctly notes that this is a diagnostic study, not a new inference method, and does not pose dual-use risks. Its primary benefit is to improve the rigor and efficiency of LLM development and evaluation. This paper presents a critical diagnostic study revealing a "sampling blind spot" in pass@k-based difficulty estimation for math reasoning tasks. Through rigorous experimentation across multiple LLMs and benchmarks, it demonstrates that a significant fraction (10-29%) of problems deemed "hardest" by sampling are, in fact, solvable by the same model under a matched-compute deterministic regime using activation grafting, challenging a fundamental assumption in LLM evaluation and data curation.
Multi-turn tool-use RL is bottlenecked by the rapid depletion of informative samples in static datasets. We observe that the gradient signal in GRPO concentrates on tasks with the highest rollout reward variance, a consequence of the Popoviciu upper bound. Consequently, samples near the agent's capability boundary -- where successes and failures are roughly balanced -- contribute disproportionately large policy gradients. As training progresses, this boundary continuously shifts, which gradually depletes the pool of informative samples in a static dataset. We propose RODS (Reward-driven Online Data Synthesis) to resolve this depletion. RODS closes the loop between RL training and data generation by repurposing the progress reward variance as a practical, zero-cost boundary detector that requires no extra inference beyond the rollouts already computed for training. It continuously identifies such boundary samples, synthesizes new multi-turn variants matching their structural complexity (e.g., API topology and dependency depth) via a skill-aligned resampling pipeline, and manages a dynamic replay buffer that co-evolves with the policy. Starting from 400 human seeds and maintaining an active training pool of ~800 samples, RODS achieves comparable performance to a 17K-sample offline pipeline while requiring roughly 20x fewer trajectories, and improves over fixed-data RL and environment augmentation in our controlled setting.
Primary: Ant Group
All Institutions: Ant Group, Inclusion AI, Shanghai Innovation Institute, Westlake University, Zhejiang University
This paper presents a highly impactful and efficient method for online data synthesis in multi-turn tool-use agents, demonstrating that strategic, variance-driven data curation can drastically reduce the sample complexity of RL training while maintaining or improving performance.
The paper introduces RODS (Reward-Driven Online Data Synthesis), a method designed to address the "sample depletion" problem in multi-turn tool-use reinforcement learning (RL). The core theoretical insight leverages Popoviciu’s inequality to argue that GRPO (Group Relative Policy Optimization) gradients are dominated by samples with high reward variance—specifically, those near the agent's current capability boundary. RODS operationalizes this by using the variance of progress rewards from existing rollouts as a zero-cost proxy for identifying these "boundary" samples. It then synthesizes new, structurally similar multi-turn trajectories to replenish the training pool. The methodology is elegant in its simplicity: it repurposes existing rollout data for data curation without requiring additional expensive inference passes or complex reward models. The approach effectively closes the loop between policy improvement and data generation, maintaining a dynamic replay buffer that co-evolves with the policy.
The empirical evaluation demonstrates that RODS achieves performance comparable to a large-scale (17K sample) offline training pipeline while using only ~800 active samples and 400 human seeds. This represents a significant efficiency gain (roughly 20x fewer trajectories). The paper compares RODS against fixed-data RL baselines and environment augmentation techniques, showing consistent improvements. The results suggest that the quality and strategic selection of training data (via boundary detection) are more critical than sheer volume in the context of tool-use agents. The controlled setting validates the hypothesis that static datasets become uninformative as the policy improves, and that online synthesis of boundary samples mitigates this degradation.
The authors provide a GitHub repository link (https://github.com/inclusionAI/AWorld-RL/tree/main/RODS) and model weights (HuggingFace), which strongly supports reproducibility. The method relies on standard RL components (GRPO, rollouts) and a clear algorithmic step for data synthesis, making it relatively straightforward to implement for other researchers. The use of open-source models (Qwen3-4B) further aids in independent verification.
The primary limitation is the dependency on the quality of the "skill-aligned resampling pipeline." If the mechanism for synthesizing new variants fails to preserve the structural complexity or semantic validity of the original boundary samples, the benefits may diminish. Additionally, the approach assumes that reward variance is a reliable indicator of the capability boundary, which might not hold in all environments with sparse or noisy rewards. The evaluation is currently limited to specific tool-use benchmarks; generalization to other multi-turn decision-making tasks (e.g., complex reasoning without tools) is not fully explored. The "zero-cost" claim is relative; while it doesn't require extra *inference* for reward modeling, the synthesis step does require computational resources.
This work has significant implications for making RL for LLMs more scalable and cost-effective. By reducing the dependency on massive static datasets and expensive data collection pipelines, RODS lowers the barrier to entry for training capable agents. It shifts the focus from data quantity to data *stratification* and *dynamism*. This could accelerate the development of autonomous agents in resource-constrained settings. However, as with all RL methods, there are risks related to reward hacking or over-optimization on specific synthetic patterns, which should be monitored in broader deployments. This paper presents a highly impactful and efficient method for online data synthesis in multi-turn tool-use agents, demonstrating that strategic, variance-driven data curation can drastically reduce the sample complexity of RL training while maintaining or improving performance.
Long-context reasoning is an essential capability for large language models, particularly when they are deployed as autonomous agents that must reason over lengthy trajectories. Reinforcement learning (RL) has recently emerged as a dominant paradigm for improving this ability, yet existing work largely focuses on reward engineering while diverse training data remains scarce. We revisit this problem from a data-centric perspective and show that a simple yet effective data recipe alone, paired with a minimal outcome-based GRPO setup, suffices to substantially improve long-context reasoning. Our recipe targets three complementary task families -- retrieval, multi-evidence synthesis, and reasoning -- for which we construct and curate eight datasets totaling ~14K examples. Experiments on three models (Qwen3-4B/8B/30B-A3B) yield average gains of +7.2/+3.2/+6.4 points across seven long-context benchmarks, surpassing prior RL training sets. We further demonstrate that these gains transfer to agentic tasks, where continuing RL training on an agent-tuned model with our data recipe improves GAIA by +4.8 and BrowseComp by +7.0 points. We will release our datasets to facilitate future research.
Primary: Tsinghua University
All Institutions: Tsinghua University
This paper presents a compelling data-centric solution to long-context reinforcement learning, demonstrating that a carefully curated mixture of retrieval, synthesis, and reasoning tasks can significantly enhance model performance without complex reward engineering. The rigorous evaluation and transfer to agentic tasks make it a valuable contribution to the field, with high potential for adoption and further research.
The paper proposes a data-centric approach to long-context reinforcement learning (RL), arguing that diverse, high-quality training data is more critical than complex reward engineering. The core methodological contribution is a curated mixture of eight datasets (~14K examples) spanning three complementary task families: Retrieval (FuzzyNeedle, MultiNeedle), Multi-evidence Synthesis (CrossEntity, WebSearch, MultiQuery, KeyChain, LongDocQA), and Reasoning (LongMath). The authors employ a minimal outcome-based GRPO setup, demonstrating that this specific data recipe yields significant gains without auxiliary process rewards. The methodology is sound, leveraging synthetic data generation guided by LLMs to create "hard" samples that target specific failure modes of current long-context models (e.g., lexical shortcuts, incomplete coverage). The ablation studies effectively isolate the contribution of each task family, providing strong empirical support for the hypothesis that these three abilities are complementary and necessary for robust long-context reasoning.
The experimental evaluation is comprehensive and rigorous. The authors test their method on three Qwen3 model variants (4B, 8B, 30B-A3B) across seven long-context benchmarks, including multi-hop QA, holistic reasoning, and synthetic reasoning tasks. The results show consistent improvements over base models and prior RL training sets (DocQA-RL, KeyChain). Notably, the gains transfer to agentic tasks (GAIA, BrowseComp), suggesting broader utility. The evaluation also includes an analysis of generalization to contexts longer than the training distribution (up to 230K tokens), which is a crucial and impressive finding. The ablation studies on task balancing and reward design further strengthen the validity of the claims. The use of LLM-as-a-judge for certain metrics is noted, but the consistency of results across different evaluation protocols mitigates concerns.
The paper provides detailed descriptions of the data construction pipelines, including the specific datasets used, the synthetic generation prompts (implied by the description), and the RL training hyperparameters (GRPO, batch sizes, learning rates). The authors commit to releasing the datasets, which is a significant positive factor for reproducibility. The training setup is described in sufficient detail for replication by other researchers with similar computational resources. The use of standard frameworks (Miles, Megatron-LM, SGLang) also aids reproducibility.
The primary limitation is the scale of the training data (~14K examples), which is small compared to the pre-training data of the base models. While effective for RL fine-tuning, the generalizability to even larger models or different model families is not fully explored. The synthetic nature of most datasets raises questions about their alignment with real-world long-context distributions, although the transfer to agentic tasks suggests some degree of realism. The reliance on LLM-as-a-judge for evaluation and reward calculation introduces potential biases, although the authors attempt to mitigate this with rule-based extraction where possible. The computational cost, while manageable for a research lab, is non-trivial.
This work has significant implications for the development of long-context LLMs and autonomous agents. By demonstrating that a simple data recipe can outperform complex reward engineering, it shifts the focus of the community towards data curation and quality. The release of the datasets will facilitate further research in this area. The transfer to agentic tasks highlights the potential for improving real-world AI systems that operate in long-context environments. However, the reliance on synthetic data and LLM judges warrants careful consideration regarding bias and robustness. This paper presents a compelling data-centric solution to long-context reinforcement learning, demonstrating that a carefully curated mixture of retrieval, synthesis, and reasoning tasks can significantly enhance model performance without complex reward engineering. The rigorous evaluation and transfer to agentic tasks make it a valuable contribution to the field, with high potential for adoption and further research.
Long-horizon robot manipulation policies trained with reward shaping can still exploit dense rewards through inefficient interaction, while rare efficient behaviors may be forgotten during training. We argue that temporal efficiency itself provides a powerful and underutilized source of self-supervision for reinforcement learning. We introduce Temporal Self-Imitation Learning (TSIL), a reinforcement learning framework that mines temporally efficient successful trajectories generated during learning and converts them into reusable supervision for future policy improvement. TSIL progressively refines learning using configuration-conditioned adaptive temporal targets derived from fast successful trajectories, while preserving and replaying efficient behaviors through efficiency-weighted self-imitation learning. Across 15 distinct long-horizon manipulation tasks, TSIL consistently improves learning efficiency, task-completion efficiency, revisitation of fast successful behaviors, and robustness to unstable training conditions. More broadly, our results suggest that the temporal structure of successful behavior itself provides a scalable self-supervisory signal for reinforcement learning beyond manually engineered reward shaping alone.
Primary: Duke University
All Institutions: Duke University
TSIL offers a powerful paradigm shift in how self-supervision can be generated for reinforcement learning, moving beyond manually engineered reward shaping. By demonstrating that the temporal structure of successful behavior itself provides a scalable self-supervisory signal, it opens new avenues for designing more efficient, robust, and autonomous learning systems for complex robotic tasks. This could significantly reduce the burden of reward engineering, accelerate policy learning, and lead to more dexterous and capable robots in real-world applications. The principles could also inspire similar approaches in other long-horizon decision-making problems beyond robotics. This paper introduces Temporal Self-Imitation Learning (TSIL), a novel reinforcement learning framework that leverages temporal efficiency as a self-supervisory signal to improve long-horizon robot manipulation. The method's innovative use of configuration-conditioned adaptive temporal targets and efficiency-weighted self-imitation learning, coupled with extensive empirical validation across 15 distinct tasks, demonstrates significant improvements in learning efficiency, task-completion efficiency, and robustness, positioning it as a highly impactful contribution to the field of robotics and reinforcement learning.
The proposed Temporal Self-Imitation Learning (TSIL) framework presents a well-conceived approach to address critical challenges in long-horizon robot manipulation: inefficient reward exploitation and the forgetting of rare, efficient behaviors. TSIL's core innovation lies in leveraging temporal efficiency itself as a self-supervisory signal. This is achieved through two main mechanisms: 1. **Configuration-conditioned adaptive temporal targets:** Instead of relying on static reward shaping, TSIL dynamically derives temporal targets from the fastest successful trajectories observed so far, conditioned on the current state (configuration). This makes the learning targets progressively more challenging and context-aware, pushing the policy towards increasingly efficient solutions. This adaptive mechanism is a significant improvement over fixed reward functions, which can often be exploited or become suboptimal as the policy improves. 2. **Efficiency-weighted self-imitation learning:** TSIL explicitly preserves and replays these fast, successful behaviors. By weighting the imitation loss based on the temporal efficiency of past trajectories, it prioritizes learning from the most optimal experiences. This directly combats the problem of catastrophic forgetting of rare but highly effective actions, ensuring that the policy continuously refines its understanding of efficient pathways. The methodology is coherent, directly targets known limitations of existing RL approaches in complex robotic tasks, and offers a scalable way to generate self-supervision.
The experimental evaluation is exceptionally strong, claiming consistent improvements across "15 distinct long-horizon manipulation tasks." This breadth of evaluation is crucial for demonstrating the generalizability and robustness of the TSIL framework beyond specific, hand-picked scenarios. The metrics of interest—learning efficiency, task-completion efficiency, revisitation of fast successful behaviors, and robustness to unstable training conditions—are all highly relevant and impactful for practical robot learning. The abstract's claim of "consistently improves" suggests statistically significant and repeatable gains, which is a high bar for empirical success in this domain. If these claims hold, the empirical evidence strongly supports the method's effectiveness and practical utility, making it a significant contribution to the field.
The mention of a project URL (`https://generalroboticslab.com/TSIL`) is a strong positive indicator for reproducibility. Project pages often include code implementations, detailed experimental setups, datasets, and potentially pre-trained models or videos, which are essential for researchers to verify and build upon the work. The structured nature of the paper (Method, Experiments sections) also implies a detailed description of the algorithm and experimental protocols.
While the paper presents a very strong case, potential limitations might include: 1. **Initial Success Requirement:** TSIL relies on mining "fast successful trajectories." If initial task success is extremely rare or non-existent, the method might struggle to bootstrap. 2. **Computational Overhead:** Mining, storing, and adaptively managing a growing set of efficient trajectories, especially in high-dimensional state spaces, could introduce computational overhead. 3. **Definition of "Configuration-conditioned":** The complexity of defining and implementing "configuration-conditioned" targets might vary significantly with the task and state representation, potentially requiring careful engineering. 4. **Generalizability beyond temporal efficiency:** While temporal efficiency is critical, some tasks might have other primary optimization criteria (e.g., energy consumption, safety, precision) that TSIL, in its current form, might not directly optimize.
TSIL offers a powerful paradigm shift in how self-supervision can be generated for reinforcement learning, moving beyond manually engineered reward shaping. By demonstrating that the temporal structure of successful behavior itself provides a scalable self-supervisory signal, it opens new avenues for designing more efficient, robust, and autonomous learning systems for complex robotic tasks. This could significantly reduce the burden of reward engineering, accelerate policy learning, and lead to more dexterous and capable robots in real-world applications. The principles could also inspire similar approaches in other long-horizon decision-making problems beyond robotics. This paper introduces Temporal Self-Imitation Learning (TSIL), a novel reinforcement learning framework that leverages temporal efficiency as a self-supervisory signal to improve long-horizon robot manipulation. The method's innovative use of configuration-conditioned adaptive temporal targets and efficiency-weighted self-imitation learning, coupled with extensive empirical validation across 15 distinct tasks, demonstrates significant improvements in learning efficiency, task-completion efficiency, and robustness, positioning it as a highly impactful contribution to the field of robotics and reinforcement learning.