Last 7 Days (June 18 – June 24, 2026)
Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in research. Using a formal psychometric framework, we show that these profiles are largely a measurement artifact. Administering a battery of personality and risk-preference instruments spanning self-reports and behavioral tasks to 56 instruction-tuned LLMs alongside large human reference samples, we report four findings. First, differences between models are driven not by the traits an instrument targets but by a directional response bias, a tendency to respond toward one end of the scale, or one labeled option, regardless of item content; a variance decomposition attributes 81-90% of between-model variation to this bias, against 9-16% in humans. Second, the bias declines with model capability but is not eliminated by it. Third, because bias rather than trait drives responding, an instrument's apparent reliability is almost entirely predicted by its response orthogonality, a term we coin for the proportion of items for which trait and bias point in opposite directions. Fourth, the profile a model appears to have shifts with the items used and can be manufactured through item selection. These results demonstrate that the apparent psychological profiles of LLMs are artifacts of the instrument used to measure them, not properties of the models themselves. As instruments borrowed from human psychology are rarely fully orthogonal and may inherently lack validity for LLMs, we call for dedicated assessments centered on response orthogonality.
Primary: Max Planck Institute for Human Development
All Institutions: Max Planck Institute for Human Development, University of Konstanz, Barcelona Supercomputing Center, University of Basel
This paper provides a critical psychometric analysis demonstrating that apparent psychological profiles in LLMs are largely artifacts of response bias rather than genuine traits, fundamentally challenging current evaluation practices and calling for more rigorous, orthogonal measurement frameworks in AI research.
The paper employs a rigorous psychometric framework to deconstruct the measurement of LLM "personalities." By modeling responses as a function of latent trait and response bias, and utilizing the concept of "response orthogonality" (the proportion of reverse-keyed items), the authors provide a mathematically sound method to separate genuine trait variance from systematic response artifacts. This approach is theoretically robust and directly addresses a fundamental confound in current LLM evaluation methodologies.
The experimental design is comprehensive, testing 56 instruction-tuned LLMs against a battery of 29 instruments (personality and risk preference) and comparing them against large human reference samples. The results are striking and consistent: LLMs show positive forward-reverse correlations (indicating bias dominance) whereas humans show negative correlations (indicating trait dominance). The variance decomposition showing 81-90% of LLM variation is bias-driven is a powerful empirical finding. The robustness checks across prompting conditions, model sizes, and elicitation methods strengthen the validity of the conclusions.
The paper provides detailed methodological descriptions, including the specific instruments used, the prompting strategies, and the mathematical derivations for the measurement model. The code repository is explicitly linked, ensuring high reproducibility. The use of standard psychometric instruments and clear definitions of variables enhances the clarity of the experimental setup.
The study focuses primarily on post-trained models and treats each model as a single respondent, which may not capture within-model variability or the nuances of different prompting strategies (e.g., persona adoption). The proprietary model sample size is small (N=10), limiting statistical power for those specific comparisons. Additionally, the study is limited to personality and risk domains; while the authors argue for broader applicability, empirical validation in other domains (e.g., moral reasoning, cognitive biases) is left for future work.
This paper has significant implications for the field of AI safety, alignment, and the use of LLMs as proxies for human participants in research. It challenges the validity of current LLM profiling practices and calls for a re-evaluation of how we measure and interpret LLM behaviors. The concept of response orthogonality offers a new standard for designing valid evaluation instruments for AI systems. This paper provides a critical psychometric analysis demonstrating that apparent psychological profiles in LLMs are largely artifacts of response bias rather than genuine traits, fundamentally challenging current evaluation practices and calling for more rigorous, orthogonal measurement frameworks in AI research.
Speculative inference accelerates large language model (LLM) decoding but provides no inherent safety guarantees. Existing safety defenses are largely incompatible with speculative inference: they either introduce additional computation or disrupt the draft-verify mechanism, negating acceleration benefits. This reveals a fundamental incompatibility between current safety methods and speculative decoding. We propose SafeSpec, a safety-aware speculative inference framework that integrates risk estimation directly into the verification process. SafeSpec attaches a lightweight latent safety head to the target model to jointly evaluate semantic validity and safety in a single forward pass. When unsafe generations are detected, SafeSpec applies rollback and safety-guided reflective multi-sampling to recover safe continuations rather than terminating generation. We model jailbreak attacks as distributional shifts over generative trajectories, where adversarial prompts increase the probability of harmful continuations without eliminating safe ones. Under this model, SafeSpec performs risk-aware trajectory recovery within the speculative decoding process. Across multiple models and adversarial benchmarks, SafeSpec achieves a substantially improved safety-efficiency trade-off. On Qwen3-32B, SafeSpec reduces attack success rates by 15% while preserving a 2.06x inference speedup on benign workloads, demonstrating that speculative acceleration and inference-time safety can be jointly optimized.
Primary: Zhejiang University
All Institutions: Zhejiang University, Huawei, Harbin Institute of Technology, Shenzhen
SafeSpec has a significant positive broader impact by enabling the deployment of safer and more efficient large language models. By addressing the fundamental incompatibility between speculative inference and existing safety defenses, it paves the way for LLMs to be used in more sensitive and performance-critical applications. This framework can help mitigate the risks of harmful content generation, thereby increasing public trust and responsible AI deployment. The approach of recovering safe continuations rather than hard refusal also improves user experience and model helpfulness. The paper does not highlight any specific negative societal consequences beyond the general risks associated with LLMs, which its work aims to mitigate. SafeSpec introduces a novel safety-aware speculative inference framework that elegantly integrates risk estimation and recovery directly into the LLM decoding process. This work makes a substantial technical contribution by demonstrating a superior safety-efficiency trade-off through a lightweight latent safety head and a dynamic rollback-and-reflect multi-sampling mechanism, offering a practical and impactful solution for deploying safer and faster LLMs in real-world applications.
The methodology of SafeSpec is well-conceived and addresses a critical gap in LLM deployment: integrating safety guarantees into speculative inference without negating its acceleration benefits. The core innovation lies in its dual-head verification mechanism. By attaching a lightweight, boundary-aligned latent safety head to the target model, SafeSpec enables simultaneous assessment of semantic validity and safety in a single forward pass. This design is elegant as it leverages the target model's existing computation for quality scoring, incurring negligible additional overhead for safety checks. The boundary-aligned extraction of hidden states for the safety head is a clever detail, preventing interference from the quality scoring prompt. The training methodology for the safety head, using step-wise prefix construction and a guard model for labeling, is sound for aligning the head with the inference process. The "rollback-and-reflect" mechanism, coupled with safety-guided multi-sampling, is a significant departure from traditional hard refusal strategies. Framing jailbreak attacks as distributional shifts where harmful continuations become more probable but safe ones are not entirely eliminated provides a strong theoretical underpinning for the multi-sampling approach. The rollback to a previous, potentially "cleaner" state, combined with a reflection prompt, effectively reshapes the sampling space, increasing the probability of finding a safe continuation. This soft intervention strategy is crucial for maintaining utility and helpfulness, avoiding the common pitfall of over-refusal. The probabilistic view of multi-sampling is clearly articulated, demonstrating how increasing sample size $K$ improves the chance of recovery.
The experimental evaluation is comprehensive and rigorous. The authors use two distinct model families (Qwen3-32B and DeepSeek-R1-Distill-Llama-70B) with appropriate draft models, demonstrating the framework's scalability and versatility. Evaluation metrics cover three critical dimensions: defense against seven advanced adversarial attacks (ASR), over-refusal rates (XSTest), and general capabilities/efficiency (GSM8K, MATH, GPQA-diamond, and inference speedup). This multi-faceted evaluation provides a holistic view of SafeSpec's performance. SafeSpec consistently achieves state-of-the-art defense performance, significantly reducing ASR (e.g., 15% on Qwen3-32B) while preserving substantial inference speedups on benign workloads (2.06x on Qwen3-32B, 1.76x on DeepSeek-70B). Crucially, it maintains low over-refusal rates and negligible accuracy degradation on general reasoning tasks, showcasing a superior safety-efficiency trade-off compared to strong baselines like SafeDecoding and SecDecoding. The ablation studies are well-designed, clearly demonstrating the necessity and synergistic effect of both the reflection prompt and multi-sampling. The comparison with a hard refusal strategy effectively highlights the benefits of SafeSpec's recovery mechanism. Hyperparameter analysis provides valuable insights into the trade-offs involved with sample size, safety threshold, and quality threshold. The detailed latency breakdown in the appendix is particularly insightful, transparently explaining the performance characteristics on benign vs. adversarial inputs and justifying the reduced throughput on jailbreak inputs as a feature of the defense. The comparison with a standalone guard model further validates SafeSpec's efficiency and user experience advantages.
The paper demonstrates good reproducibility. Code is made available on GitHub. The appendix provides detailed information on evaluation datasets, jailbreak prompt construction, quality scoring prompt, safety head configurations (architecture, parameter counts), and training setup (data sources, sampling, hyperparameters, data isolation). Layer choice ablation and per-benchmark sensitivity analysis for quality threshold are also included, providing further confidence in the design choices. The use of a fixed random seed is also mentioned.
1. **Reliance on Guard Model for Labeling**: The training data for the safety head is labeled using Qwen3Guard-Gen-8B. The performance and biases of this external guard model could implicitly limit the safety head's effectiveness and generalization, especially if the guard model itself is imperfect or susceptible to certain attacks. 2. **Heuristic Nature of Reflection Prompt**: While effective, the reflection prompt is a handcrafted heuristic. Its optimal design might be sensitive to the target model or specific attack types, and its generalizability across all future attacks is not guaranteed. 3. **Performance on Adversarial Inputs**: Although justified as a necessary cost for safety, the significant slowdown on jailbreak inputs (throughput below 1x) means that if an attacker can consistently trigger Safety Mode, they can effectively degrade the system's performance, even if they don't get a harmful response. This could be a denial-of-service vector. 4. **Adversarial Attacks on Safety Head**: As the safety head is a lightweight classifier, it might be susceptible to direct adversarial attacks designed to bypass it, rather than just the main LLM. The paper does not explore this. 5. **Fixed Rollback State**: The rollback mechanism reverts to the "previous state." For deeply embedded or multi-turn attacks, a single step rollback might not always be sufficient to reach a truly "clean" context, potentially requiring more sophisticated context recovery.
SafeSpec has a significant positive broader impact by enabling the deployment of safer and more efficient large language models. By addressing the fundamental incompatibility between speculative inference and existing safety defenses, it paves the way for LLMs to be used in more sensitive and performance-critical applications. This framework can help mitigate the risks of harmful content generation, thereby increasing public trust and responsible AI deployment. The approach of recovering safe continuations rather than hard refusal also improves user experience and model helpfulness. The paper does not highlight any specific negative societal consequences beyond the general risks associated with LLMs, which its work aims to mitigate. SafeSpec introduces a novel safety-aware speculative inference framework that elegantly integrates risk estimation and recovery directly into the LLM decoding process. This work makes a substantial technical contribution by demonstrating a superior safety-efficiency trade-off through a lightweight latent safety head and a dynamic rollback-and-reflect multi-sampling mechanism, offering a practical and impactful solution for deploying safer and faster LLMs in real-world applications.
Long-horizon robot manipulation policies trained with reward shaping can still exploit dense rewards through inefficient interaction, while rare efficient behaviors may be forgotten during training. We argue that temporal efficiency itself provides a powerful and underutilized source of self-supervision for reinforcement learning. We introduce Temporal Self-Imitation Learning (TSIL), a reinforcement learning framework that mines temporally efficient successful trajectories generated during learning and converts them into reusable supervision for future policy improvement. TSIL progressively refines learning using configuration-conditioned adaptive temporal targets derived from fast successful trajectories, while preserving and replaying efficient behaviors through efficiency-weighted self-imitation learning. Across 15 distinct long-horizon manipulation tasks, TSIL consistently improves learning efficiency, task-completion efficiency, revisitation of fast successful behaviors, and robustness to unstable training conditions. More broadly, our results suggest that the temporal structure of successful behavior itself provides a scalable self-supervisory signal for reinforcement learning beyond manually engineered reward shaping alone.
Primary: Duke University
All Institutions: Duke University
TSIL offers a powerful paradigm shift in how self-supervision can be generated for reinforcement learning, moving beyond manually engineered reward shaping. By demonstrating that the temporal structure of successful behavior itself provides a scalable self-supervisory signal, it opens new avenues for designing more efficient, robust, and autonomous learning systems for complex robotic tasks. This could significantly reduce the burden of reward engineering, accelerate policy learning, and lead to more dexterous and capable robots in real-world applications. The principles could also inspire similar approaches in other long-horizon decision-making problems beyond robotics. This paper introduces Temporal Self-Imitation Learning (TSIL), a novel reinforcement learning framework that leverages temporal efficiency as a self-supervisory signal to improve long-horizon robot manipulation. The method's innovative use of configuration-conditioned adaptive temporal targets and efficiency-weighted self-imitation learning, coupled with extensive empirical validation across 15 distinct tasks, demonstrates significant improvements in learning efficiency, task-completion efficiency, and robustness, positioning it as a highly impactful contribution to the field of robotics and reinforcement learning.
The proposed Temporal Self-Imitation Learning (TSIL) framework presents a well-conceived approach to address critical challenges in long-horizon robot manipulation: inefficient reward exploitation and the forgetting of rare, efficient behaviors. TSIL's core innovation lies in leveraging temporal efficiency itself as a self-supervisory signal. This is achieved through two main mechanisms: 1. **Configuration-conditioned adaptive temporal targets:** Instead of relying on static reward shaping, TSIL dynamically derives temporal targets from the fastest successful trajectories observed so far, conditioned on the current state (configuration). This makes the learning targets progressively more challenging and context-aware, pushing the policy towards increasingly efficient solutions. This adaptive mechanism is a significant improvement over fixed reward functions, which can often be exploited or become suboptimal as the policy improves. 2. **Efficiency-weighted self-imitation learning:** TSIL explicitly preserves and replays these fast, successful behaviors. By weighting the imitation loss based on the temporal efficiency of past trajectories, it prioritizes learning from the most optimal experiences. This directly combats the problem of catastrophic forgetting of rare but highly effective actions, ensuring that the policy continuously refines its understanding of efficient pathways. The methodology is coherent, directly targets known limitations of existing RL approaches in complex robotic tasks, and offers a scalable way to generate self-supervision.
The experimental evaluation is exceptionally strong, claiming consistent improvements across "15 distinct long-horizon manipulation tasks." This breadth of evaluation is crucial for demonstrating the generalizability and robustness of the TSIL framework beyond specific, hand-picked scenarios. The metrics of interest—learning efficiency, task-completion efficiency, revisitation of fast successful behaviors, and robustness to unstable training conditions—are all highly relevant and impactful for practical robot learning. The abstract's claim of "consistently improves" suggests statistically significant and repeatable gains, which is a high bar for empirical success in this domain. If these claims hold, the empirical evidence strongly supports the method's effectiveness and practical utility, making it a significant contribution to the field.
The mention of a project URL (`https://generalroboticslab.com/TSIL`) is a strong positive indicator for reproducibility. Project pages often include code implementations, detailed experimental setups, datasets, and potentially pre-trained models or videos, which are essential for researchers to verify and build upon the work. The structured nature of the paper (Method, Experiments sections) also implies a detailed description of the algorithm and experimental protocols.
While the paper presents a very strong case, potential limitations might include: 1. **Initial Success Requirement:** TSIL relies on mining "fast successful trajectories." If initial task success is extremely rare or non-existent, the method might struggle to bootstrap. 2. **Computational Overhead:** Mining, storing, and adaptively managing a growing set of efficient trajectories, especially in high-dimensional state spaces, could introduce computational overhead. 3. **Definition of "Configuration-conditioned":** The complexity of defining and implementing "configuration-conditioned" targets might vary significantly with the task and state representation, potentially requiring careful engineering. 4. **Generalizability beyond temporal efficiency:** While temporal efficiency is critical, some tasks might have other primary optimization criteria (e.g., energy consumption, safety, precision) that TSIL, in its current form, might not directly optimize.
TSIL offers a powerful paradigm shift in how self-supervision can be generated for reinforcement learning, moving beyond manually engineered reward shaping. By demonstrating that the temporal structure of successful behavior itself provides a scalable self-supervisory signal, it opens new avenues for designing more efficient, robust, and autonomous learning systems for complex robotic tasks. This could significantly reduce the burden of reward engineering, accelerate policy learning, and lead to more dexterous and capable robots in real-world applications. The principles could also inspire similar approaches in other long-horizon decision-making problems beyond robotics. This paper introduces Temporal Self-Imitation Learning (TSIL), a novel reinforcement learning framework that leverages temporal efficiency as a self-supervisory signal to improve long-horizon robot manipulation. The method's innovative use of configuration-conditioned adaptive temporal targets and efficiency-weighted self-imitation learning, coupled with extensive empirical validation across 15 distinct tasks, demonstrates significant improvements in learning efficiency, task-completion efficiency, and robustness, positioning it as a highly impactful contribution to the field of robotics and reinforcement learning.
Modern coding agents pair LLM generators with various tools, including cheap diagnostics and expensive verifiers. The tool-use decisions are typically governed by orchestrators that often use fixed rules and ignore uncertainty. We formulate orchestration as cost-sensitive sequential hypothesis testing: a Bayesian controller maintains a belief over candidate correctness and dynamically decides whether to gather more evidence, refine the candidate, verify it, or stop. Across six generators and nine coding benchmarks, Bayesian control proves to be most valuable when verification is costly and critics are informative but imperfect. Beyond control, the belief state yields an interpretable correctness score that outperforms token-probability and raw tool-success baselines for uncertainty quantification.
Primary: Google DeepMind
All Institutions: Google DeepMind
This paper has significant broader impact for the design and deployment of robust and efficient LLM-based agents. By introducing a principled, uncertainty-aware control mechanism, it moves beyond heuristic orchestration, paving the way for more reliable and cost-effective AI systems. The ability to dynamically adapt decisions based on evidence and cost considerations is crucial for real-world applications where resources (e.g., API calls, human review) are limited and errors are costly. The belief state's superior uncertainty quantification capability can lead to more trustworthy AI systems, allowing users to better understand the confidence level of an agent's output. This framework could be extended to other domains beyond coding, such as scientific discovery, complex problem-solving, or even general-purpose autonomous agents, fostering the development of more intelligent and adaptive AI. This paper introduces a novel Bayesian control framework for LLM coding agents, formulating orchestration as cost-sensitive sequential hypothesis testing to dynamically manage tool use and uncertainty. The methodology, grounded in decision theory, significantly outperforms heuristic baselines in terms of cost-efficiency and success rate, especially when verification is expensive, and provides a superior correctness score for uncertainty quantification, marking a substantial step towards more robust and principled LLM agent design.
The paper proposes a principled Bayesian control framework for orchestrating LLM-based coding agents, framing the problem as cost-sensitive sequential hypothesis testing. This is a significant departure from the prevalent heuristic-based orchestrators. The core of the methodology lies in maintaining a belief state—a probability distribution over the true correctness of the generated code—which is dynamically updated using Bayes' rule based on observations from various tools (diagnostics, verifiers). The decision policy is derived from a partially observable Markov decision process (POMDP) formulation, aiming to minimize expected costs associated with refinement, verification, and incorrect stopping. To make the POMDP tractable, the authors introduce practical simplifications, such as a fixed maximum number of refinement steps, allowing for a finite-horizon dynamic programming approach. The critic models (diagnostics and verifier) are characterized by their likelihoods, which are learned or estimated. A notable strength is the dual utility of the belief state: it not only guides optimal decision-making but also provides an interpretable correctness score for uncertainty quantification. The methodology is theoretically sound, drawing from established decision theory, and provides a robust, uncertainty-aware mechanism for agent control.
The experimental evaluation is comprehensive and rigorous. The authors test their Bayesian control framework across a diverse set of six LLM generators (including GPT-3.5, GPT-4, Gemini 1.0 Pro, and open-source models like CodeLlama and StarCoder) and nine coding benchmarks (HumanEval, MBPP, and APPS at various difficulty levels). This broad coverage demonstrates the generalizability of the approach. Baselines include several fixed-rule orchestrators (e.g., "Always Refine," "Refine until pass," "Verify immediately") and uncertainty quantification methods (token probability, raw tool success). The results clearly show that Bayesian control consistently outperforms fixed-rule baselines, particularly when verification is costly and diagnostic critics are informative but imperfect. The value proposition of Bayesian control is shown to increase significantly with higher verification costs. Furthermore, the belief state's correctness score demonstrates superior performance in uncertainty quantification, achieving higher AUC scores than token probability and raw tool success in predicting code correctness. The experiments effectively validate the core hypotheses and highlight the conditions under which Bayesian control is most beneficial.
The paper provides a detailed appendix outlining the experimental setup, including specific LLM models, benchmarks, critic configurations, and hyper-parameters used for the Bayesian controller. This level of detail is commendable and greatly aids in understanding the experimental procedure. However, the paper states, "Our code and data are available at [anonymized for review]," indicating that the code is not publicly accessible at the time of review. While the detailed methodology and experimental setup provide a strong basis, the lack of publicly available code and data slightly hinders immediate, independent reproducibility. Should the code be released, the paper's reproducibility would be excellent.
The authors acknowledge several important limitations. The performance of the Bayesian controller is heavily dependent on the quality and accurate modeling of the critic likelihoods. If critics are unreliable or their characteristics are poorly estimated, the belief state and subsequent decisions may be suboptimal. The full POMDP formulation is computationally intractable, necessitating simplifications like a fixed maximum number of refinement steps, which might not always be optimal. The current framework assumes fixed costs for actions, which may not hold in dynamic real-world scenarios. The action space is also limited to "refine," "verify," and "stop," without considering more complex actions like re-planning or seeking human assistance. Finally, the work focuses specifically on coding agents, and its generalization to other LLM agent domains requires further investigation.
This paper has significant broader impact for the design and deployment of robust and efficient LLM-based agents. By introducing a principled, uncertainty-aware control mechanism, it moves beyond heuristic orchestration, paving the way for more reliable and cost-effective AI systems. The ability to dynamically adapt decisions based on evidence and cost considerations is crucial for real-world applications where resources (e.g., API calls, human review) are limited and errors are costly. The belief state's superior uncertainty quantification capability can lead to more trustworthy AI systems, allowing users to better understand the confidence level of an agent's output. This framework could be extended to other domains beyond coding, such as scientific discovery, complex problem-solving, or even general-purpose autonomous agents, fostering the development of more intelligent and adaptive AI. This paper introduces a novel Bayesian control framework for LLM coding agents, formulating orchestration as cost-sensitive sequential hypothesis testing to dynamically manage tool use and uncertainty. The methodology, grounded in decision theory, significantly outperforms heuristic baselines in terms of cost-efficiency and success rate, especially when verification is expensive, and provides a superior correctness score for uncertainty quantification, marking a substantial step towards more robust and principled LLM agent design.
Long-context large language model (LLM) inference is increasingly constrained by the memory footprint and decoding cost of key-value (KV) caches, limiting sustainable deployment on resource-constrained hardware. Existing KV cache eviction methods typically apply heuristic token scoring over all heads in GQA-based LLMs. These methods ignore the different functionalities of attention heads, leading to the eviction of critical tokens and thus degrading the performance of LLMs. To address this issue, we propose CompressKV, a resource-efficient KV-cache compression framework for GQA-based LLMs. Instead of aggregating attention scores from all heads, CompressKV identifies Semantic Retrieval Heads (SRHs) that capture both the initial and final tokens of a prompt and semantically important mid-context evidence, and uses them to select tokens whose KV pairs should be retained. Furthermore, CompressKV allocates cache budgets across layers according to offline estimates of layer-wise eviction error. Experiments on LongBench and Needle-in-a-Haystack show that CompressKV consistently outperforms existing KV-cache eviction methods across memory budgets. Notably, it preserves over 97\% of full-cache performance using only 3\% of the KV cache on LongBench question-answering tasks and achieves 90\% accuracy with just 0.7\% KV storage on Needle-in-a-Haystack. These results demonstrate an improved resource--performance trade-off for long-context LLM inference. Our code is publicly available at: https://github.com/TUDa-HWAI/CompressKV
Primary: Technical University of Darmstadt
All Institutions: Technical University of Darmstadt, Darmstadt, Germany; University of Notre Dame, Notre Dame, IN, USA; Technical University of Ilmenau, Ilmenau, Germany
CompressKV makes a significant contribution towards enabling more resource-efficient and sustainable deployment of long-context LLMs, particularly on memory-constrained hardware. By substantially reducing the KV-cache memory footprint while preserving high performance, it can facilitate wider adoption of advanced LLMs in edge devices, mobile applications, or large-scale inference clusters where memory is a critical bottleneck. The principled approach to identifying semantically important tokens and allocating cache budgets could inspire further research into fine-grained attention head functionalities and more accurate error modeling for compression. Its demonstrated compatibility and additive benefits with other efficiency techniques suggest it can be a foundational component in a multi-faceted approach to LLM optimization, contributing to the overall goal of making powerful LLMs more accessible and cost-effective. CompressKV introduces a novel KV-cache compression framework for GQA-based LLMs, leveraging Semantic Retrieval Heads for robust token selection and an offline error-aware mechanism for layer-adaptive budget allocation, significantly improving the resource-performance trade-off for long-context inference. This paper presents a highly effective and well-validated approach to a critical problem in large language model inference, offering a principled and practical solution to the memory footprint of KV caches. The core innovations, including the span-aggregation-based Semantic Retrieval Heads and the offline Frobenius norm error-aware layer allocation, are well-motivated and address clear limitations of prior work. The experimental validation is exceptionally thorough, demonstrating consistent and significant performance gains across multiple LLMs and benchmarks, especially under tight memory constraints, and proving orthogonality with other efficiency techniques. This work has strong practical implications for deploying long-context LLMs more efficiently and sustainably.
CompressKV proposes a two-fold framework for KV-cache compression in GQA-based LLMs. The first key component is the identification and utilization of Semantic Retrieval Heads (SRHs) for token selection. Unlike prior methods that often aggregate attention scores across all heads or rely on strict top-k attention hits (Traditional Retrieval Heads), SRHs are identified by aggregating attention mass over the *entire answer span* during correct answer generation on a calibration dataset. This novel span-aggregation approach allows SRHs to capture broader semantic context, effectively mitigating the "streaming head dominance" issue where critical mid-context tokens might be evicted. The selected SRHs then guide the importance scoring for tokens to be retained. The second component is an error-aware layer-adaptive cache allocation strategy. Instead of using online attention statistics, CompressKV quantifies the compression error for each layer by computing the Frobenius norm of the difference between attention-block outputs with full and compressed caches. This error estimation is performed *offline*, which is a significant practical advantage as it introduces no additional runtime overhead during inference. The total cache budget is then distributed proportionally to these precomputed layer-wise error scores, with practical minimum and maximum allocation constraints. The methodology is well-motivated, directly addresses identified limitations of existing methods, and offers a principled, efficient, and practical approach to KV-cache management.
The experimental evaluation is exceptionally comprehensive and robust. CompressKV is rigorously benchmarked against six strong, state-of-the-art KV-cache eviction baselines (StreamingLLM, SnapKV, PyramidKV, CAKE, HeadKV, AdaKV). The evaluation spans multiple GQA-based LLMs, including Llama-3.1-8B, Mistral-7B, Qwen2.5-14B, and Qwen2.5-32B, demonstrating broad applicability. Performance is assessed on two crucial long-context benchmarks: LongBench (covering diverse tasks like QA, summarization, few-shot learning) and Needle-in-a-Haystack (focused on retrieval accuracy). The results consistently show CompressKV's superior performance across models and memory budgets, with particularly impressive gains under tight memory constraints. For instance, it preserves over 97% of full-cache performance using only 3% of the KV cache on LongBench and achieves 90% accuracy with just 0.7% KV storage on Needle-in-a-Haystack. Extensive ablation studies confirm the individual contributions and complementary nature of SRH-driven token selection and error-aware layer-adaptive allocation. A causal ablation further highlights the critical role of SRHs compared to Traditional Retrieval Heads. Crucially, the paper also demonstrates CompressKV's orthogonality and additive benefits when combined with other efficiency techniques such as prefilling acceleration, KV-cache quantization, and head-level allocation, underscoring its potential as a general improvement. Memory and latency measurements further validate the practical benefits, showing stable decoding latency and reduced peak memory at long contexts.
The paper explicitly states that the code is publicly available at `https://github.com/TUDa-HWAI/CompressKV`, which is a strong indicator of reproducibility. Key implementation details are provided, including the use of FlashAttention-2, greedy decoding, specific local attention parameters (window_size=8, kernel_size=5), the number of selected SRHs per layer (top four), and the min/max budget constraints for layer allocation (m=32, M=3*B_per-layer). The offline nature of SRH identification and error-aware allocation, along with the mention of a calibration dataset (following prior work and provided in their codebase), further aids reproducibility by clearly defining the precomputation steps.
One potential limitation is the reliance on a calibration dataset with ground-truth answers for the identification of Semantic Retrieval Heads. While the paper states this data is provided and follows prior work, it implies that applying CompressKV to entirely new tasks or models without such a dataset might require an initial data collection and calibration step, which could be an overhead for certain niche applications. The method is specifically designed for GQA-based LLMs, and its direct applicability or performance on other attention mechanisms (e.g., MQA, MHA) is not explicitly discussed or evaluated. Although the offline computation is a strength for efficiency, it means the SRH identification and layer budgets are fixed and do not adapt dynamically to specific input prompts or changing task characteristics during inference, which might be a trade-off for ultimate adaptability.
CompressKV makes a significant contribution towards enabling more resource-efficient and sustainable deployment of long-context LLMs, particularly on memory-constrained hardware. By substantially reducing the KV-cache memory footprint while preserving high performance, it can facilitate wider adoption of advanced LLMs in edge devices, mobile applications, or large-scale inference clusters where memory is a critical bottleneck. The principled approach to identifying semantically important tokens and allocating cache budgets could inspire further research into fine-grained attention head functionalities and more accurate error modeling for compression. Its demonstrated compatibility and additive benefits with other efficiency techniques suggest it can be a foundational component in a multi-faceted approach to LLM optimization, contributing to the overall goal of making powerful LLMs more accessible and cost-effective. CompressKV introduces a novel KV-cache compression framework for GQA-based LLMs, leveraging Semantic Retrieval Heads for robust token selection and an offline error-aware mechanism for layer-adaptive budget allocation, significantly improving the resource-performance trade-off for long-context inference. This paper presents a highly effective and well-validated approach to a critical problem in large language model inference, offering a principled and practical solution to the memory footprint of KV caches. The core innovations, including the span-aggregation-based Semantic Retrieval Heads and the offline Frobenius norm error-aware layer allocation, are well-motivated and address clear limitations of prior work. The experimental validation is exceptionally thorough, demonstrating consistent and significant performance gains across multiple LLMs and benchmarks, especially under tight memory constraints, and proving orthogonality with other efficiency techniques. This work has strong practical implications for deploying long-context LLMs more efficiently and sustainably.
Accurate user modeling often depends on rich interaction histories, which are unavailable for billions of low-activity users. Large Language Models (LLMs) can infer latent user states from static profiles, but this reasoning becomes unreliable when profiles are sparse, and applying an LLM to billions of users is prohibitively expensive. We present ScaleToT, which learns structured reasoning from a small LLM-processed subset and extends it to the broader low-activity user population. To improve reasoning reliability, ScaleToT constructs typed user-state chains with a bounded entropy-guided Tree-of-Thought (ToT) refinement procedure. To make this structured reasoning usable from sparse profiles, the teacher-curated chains are used to train a student model on static profiles through supervised fine-tuning (SFT) and Outcome-Driven Segment-Aware Implicit Reward Policy Optimization (OSIPO). ScaleToT then transfers the student's reasoning representations to a lightweight profile encoder, providing shared reasoning signals for the remaining users without LLM inference. We evaluate ScaleToT on lifetime value (LTV) prediction in a billion-scale advertising deployment. A randomized online A/B test increased LT30 by 6.738\%, while offline reasoning covered only 7.32\% of the potential population, greatly reducing compute cost compared with full-population reasoning.
Primary: Kuaishou Technology
All Institutions: Kuaishou Technology
ScaleToT presents a robust industrial solution for scaling LLM-based structured reasoning to billion-scale low-activity user modeling, achieving significant online gains through a novel combination of entropy-guided ToT refinement, segment-aware RL distillation, and vector-quantized reasoning transfer.
The paper proposes ScaleToT, a framework for low-activity user modeling that bridges the gap between expensive LLM reasoning and scalable inference. The core methodological innovation lies in the "Bounded Typed Tree-of-Thought" (ToT) construction, which uses entropy-guided refinement to create structured, typed user-state chains from sparse profiles using privileged context during training. This is followed by a distillation phase where a student model learns to generate these chains via Supervised Fine-Tuning (SFT) and a novel Outcome-Driven Segment-Aware Implicit Reward Policy Optimization (OSIPO). Finally, the reasoning representations are transferred to the full population using Vector Quantization (VQ) and a profile-conditioned gate, allowing inference without LLM calls. The approach is technically sound, addressing specific industrial constraints (sparsity, cost) with a multi-stage pipeline that combines structured reasoning, RL-based alignment, and representation learning.
The evaluation is conducted on a billion-scale industrial dataset for Lifetime Value (LTV) prediction. The paper reports a 6.738% increase in LT30 (cumulative active days) in a randomized online A/B test, which is a significant and practically meaningful metric for an advertising platform. Offline metrics (Ranking AUC) also show improvements over baselines like Direct LLM, Free-Form CoT, and Sequential CoT. The ablation studies effectively isolate the contributions of the entropy-guided refinement and the OSIPO reward signal. The scalability analysis demonstrates that high performance can be maintained with reasoning coverage of only ~7.32% of the population, validating the efficiency claims.
The paper provides detailed descriptions of the model architectures, hyperparameters (learning rates, batch sizes, codebook sizes), and the specific LLM backbones used (Qwen3 series). The algorithms for entropy-guided refinement and reasoning transfer are formally defined. However, as is common with industrial papers, the exact dataset statistics and proprietary features are anonymized, which may limit exact replication. The code is not publicly available.
The method relies heavily on the assumption that latent user states can be represented by a finite set of typed fields, which may not hold for all user modeling tasks. The "privileged context" used during training (post-return feedback) is not available at inference, creating a distribution shift that the student must learn to approximate from sparse profiles alone; while the results are good, this is an inherent limitation of the cold-start setting. The VQ retrieval mechanism, while efficient, introduces a quantization error that might discard nuanced reasoning patterns.
This work has significant implications for the deployment of LLMs in large-scale recommendation and advertising systems. By demonstrating how to distill structured reasoning into lightweight, scalable models, it provides a blueprint for applying complex LLM capabilities to billion-user populations where direct inference is infeasible. It highlights the value of structured, interpretable reasoning in user modeling, potentially shifting the field away from black-box sequence modeling towards more explicit state inference for cold-start users. ScaleToT presents a robust industrial solution for scaling LLM-based structured reasoning to billion-scale low-activity user modeling, achieving significant online gains through a novel combination of entropy-guided ToT refinement, segment-aware RL distillation, and vector-quantized reasoning transfer.
Graph convolutional networks (GCNs) have demonstrated significant success in capturing complex user-item relationships for collaborative filtering (CF). However, due to their reliance on extensive model training, training-free graph filtering (GF)-based CF methods have emerged as a promising alternative, offering computational efficiency by smoothing graph signals via matrix operations. In particular, polynomial GF-based approaches demonstrate improved accuracy through their ability to design more expressive and flexible filtering functions. Despite these advantages, existing GF methods suffer from a critical memory bottleneck: they necessitate storing the full item similarity graph, incurring prohibitive memory costs for large-scale datasets, which limits their practical applicability. To tackle this challenge, we propose Mem-GF (Memory-efficient GF), a new GF-based CF method that departs from conventional designs by principally leveraging the structure of Krylov subspaces as a core mechanism for approximating polynomial graph filters without explicitly storing the item similarity graph. We theoretically analyze the minimum Krylov subspace size that guarantees lossless approximation. Through extensive experiments, we demonstrate that Mem-GF achieves up to 5.74$\times$ lower memory usage and 4.38$\times$ speedup in runtime, while consistently exceeding the recommendation accuracy of state-of-the-art GF and GCN-based methods. Mem-GF robustly scales to datasets with tens of millions of interactions, establishing itself as a practically viable and theoretically grounded solution for efficient CF.
Primary: Yonsei University
All Institutions: Yonsei University
Mem-GF has a significant broader impact on the field of recommender systems and potentially other areas of graph machine learning. By effectively addressing the memory bottleneck, it makes high-accuracy, polynomial graph filtering techniques practically viable for large-scale collaborative filtering, a critical requirement for modern online platforms. This enables faster preprocessing, low-latency real-time inference, and superior recommendation quality on standard hardware, democratizing access to advanced graph-based CF. The principled use of Krylov subspaces as a core filtering mechanism, rather than merely a computational shortcut, could inspire similar memory-efficient approaches in other graph signal processing or graph machine learning contexts where large, implicitly defined matrices are a challenge. The strong theoretical grounding further enhances the trustworthiness and potential for generalization of this methodology. Mem-GF proposes a novel, memory-efficient, and training-free graph filtering method for collaborative filtering that leverages Krylov subspaces to approximate high-order polynomial graph filters without explicitly storing the full item similarity graph. This paper makes a substantial technical contribution by elegantly solving a critical memory bottleneck in graph filtering-based collaborative filtering, enabling scalable and high-accuracy recommendations on large datasets. The method's theoretical grounding, combined with comprehensive experimental validation demonstrating significant memory savings, speedups, and state-of-the-art accuracy, establishes Mem-GF as a practically viable and theoretically sound solution that will likely influence future research and deployment of graph-based recommender systems.
The paper effectively identifies a critical memory bottleneck in existing Graph Filtering (GF)-based Collaborative Filtering (CF) methods, which stem from the necessity to explicitly store a full item similarity graph $P$ of size $|I| \times |I|$. The proposed Mem-GF method offers an elegant and principled solution by leveraging Krylov subspaces. Instead of forming and storing $P$, Mem-GF approximates polynomial graph filters $f(P)r_u$ by projecting $P$ onto a user-specific Krylov subspace $K_K(P, r_u)$, generated by the user's interaction vector $r_u$. This projection is efficiently computed using the Lanczos algorithm, which yields an orthonormal basis $Q_u$ and a much smaller tridiagonal matrix $T_u$. The filtering operation is then performed in this reduced space as $\|r_u\|_2 Q_u f(T_u) e_1$. A key methodological strength is that the matrix-vector product $Pq_{u,j}$ required by Lanczos is computed as $R^T(Rq_{u,j})$, completely bypassing the explicit construction of $P$. The theoretical analysis provides a clear and strong guarantee: for a polynomial filter of degree $N$, setting the Krylov subspace size $K > N$ ensures lossless approximation under exact arithmetic. This theoretical foundation is crucial for understanding and applying the method. Furthermore, the ability to operate within a low-dimensional subspace grants Mem-GF the flexibility to design and utilize high-order polynomial filters (e.g., approximating a Gaussian filter), which are typically infeasible for conventional methods due to memory constraints, thereby enhancing filter expressiveness and accuracy. The "training-free" nature aligns with the paper's goal of computational efficiency.
The experimental evaluation is exceptionally comprehensive and provides strong empirical evidence for all claims. Experiments are conducted on three widely used CF benchmark datasets: Yelp, Amazon-book, and the large-scale MovieLens-20M, covering diverse scales and characteristics. A broad range of 21 baselines is included, encompassing various CF categories (MF, Autoencoder, GCN, Generative, LinkProp), with a particular focus on other GF-based methods. Key metrics such as memory usage (VRAM, RAM), runtime (preprocessing and inference), and recommendation accuracy (Recall@K, NDCG@K) are rigorously evaluated. The results are highly impactful: Mem-GF achieves up to 5.74x lower memory usage and 4.38x speedup during preprocessing, and a remarkable 26.2x speedup during inference. Crucially, these significant efficiency gains are accompanied by state-of-the-art recommendation accuracy, consistently outperforming both GF and GCN-based methods across most datasets and metrics. The scalability analysis on synthetic datasets further validates the method's linear complexity with respect to the number of users, items, and interactions, confirming its practical applicability for real-world, large-scale deployments. The empirical validation of the theoretical condition ($K > N$), along with analyses of different polynomial filters and hyperparameter sensitivity, adds to the robustness and thoroughness of the evaluation.
The paper demonstrates a strong commitment to reproducibility. A GitHub link to the source code (`https://github.com/jindeok/Mem-GF`) is provided, which is a critical component for enabling replication. Detailed hyperparameters for Mem-GF are explicitly stated for each dataset. Furthermore, the paper outlines the data splitting, evaluation protocols, hardware specifications (CPU, GPU, RAM), and software environment (PyTorch), along with the method for generating synthetic datasets. These comprehensive details provide sufficient information for researchers to reproduce the reported results.
While Mem-GF's "training-free" nature offers efficiency, it inherently implies less flexibility compared to learnable GCNs that can adapt their filters through end-to-end optimization. The polynomial coefficients are found by approximating a target frequency response, which is still a predefined approach rather than a fully learned one. The theoretical guarantee of lossless approximation holds under *exact arithmetic* and when the polynomial degree $N$ is less than the Krylov subspace size $K$. While the paper mentions that finite-precision arithmetic or $N \ge K$ might lead to instability, a deeper exploration of these practical implications beyond empirical observation would be beneficial. The method still requires tuning of hyperparameters such as $s$ (Hadamard power) and $\delta$ (damping factor for the Gaussian filter). Although Mem-GF enables user-specific filtering in the Krylov subspace, the underlying polynomial filter itself is still globally defined, rather than being truly personalized to each user's unique spectral characteristics.
Mem-GF has a significant broader impact on the field of recommender systems and potentially other areas of graph machine learning. By effectively addressing the memory bottleneck, it makes high-accuracy, polynomial graph filtering techniques practically viable for large-scale collaborative filtering, a critical requirement for modern online platforms. This enables faster preprocessing, low-latency real-time inference, and superior recommendation quality on standard hardware, democratizing access to advanced graph-based CF. The principled use of Krylov subspaces as a core filtering mechanism, rather than merely a computational shortcut, could inspire similar memory-efficient approaches in other graph signal processing or graph machine learning contexts where large, implicitly defined matrices are a challenge. The strong theoretical grounding further enhances the trustworthiness and potential for generalization of this methodology. Mem-GF proposes a novel, memory-efficient, and training-free graph filtering method for collaborative filtering that leverages Krylov subspaces to approximate high-order polynomial graph filters without explicitly storing the full item similarity graph. This paper makes a substantial technical contribution by elegantly solving a critical memory bottleneck in graph filtering-based collaborative filtering, enabling scalable and high-accuracy recommendations on large datasets. The method's theoretical grounding, combined with comprehensive experimental validation demonstrating significant memory savings, speedups, and state-of-the-art accuracy, establishes Mem-GF as a practically viable and theoretically sound solution that will likely influence future research and deployment of graph-based recommender systems.
We study first-order methods for solving monotone variational inequalities arising in min-max optimization. Classical approaches such as the extragradient method rely on two gradient queries per iteration, which limits their analysis and applicability in the online and stochastic settings. We propose a family of Generalized Optimistic Methods with Anchoring (GOMA), which combine two-time-scale optimistic updates with an anchoring term inspired by Halpern iteration. In the deterministic setting, GOMA achieves the optimal accelerated last-iterate rate $O(1/k^2)$ on the squared gradient norm for monotone Lipschitz operators. In the stochastic setting with unbounded variance, a simplified single-call variant of GOMA achieves a last-iterate convergence rate of $O(1/\sqrt{k})$ on the squared gradient norm. To the best of our knowledge, this is the first such guarantee for stochastic monotone Lipschitz variational inequalities in the unconstrained setting without variance reduction or growing batches.
Primary: Université de Montréal
All Institutions: Université de Montréal, Mila - Quebec AI Institute, Mohammed Bin Zayed University of Artificial Intelligence, CIFAR AI Chair
This paper contributes to the fundamental understanding and development of optimization algorithms for variational inequalities and min-max optimization, which are crucial in various machine learning applications like adversarial training, GANs, and multi-agent reinforcement learning. By providing a method that offers last-iterate convergence in challenging stochastic settings (single-call, no variance reduction, no growing batches, unbounded variance), GOMA could enable more efficient and stable training of models in online or resource-constrained environments. The explicit acknowledgment of AI assistant use in proof development is also a noteworthy aspect regarding research methodology. The impact statement correctly identifies the potential for more efficient use of computing resources but also cautions about the Jevons paradox. This paper introduces Generalized Optimistic Method with Anchoring (GOMA), a novel first-order method for monotone variational inequalities that achieves optimal $O(1/k^2)$ last-iterate convergence in the deterministic setting and, critically, provides the first last-iterate $O(1/N)$ convergence guarantee on the expected squared operator norm for stochastic monotone Lipschitz VIs in the unconstrained setting without variance reduction, growing batches, or bounded variance assumptions. The work makes a significant theoretical advancement by demonstrating that strong last-iterate guarantees are compatible with single-sample online models under highly challenging noise conditions, supported by empirical evidence on synthetic problems.
The paper proposes Generalized Optimistic Methods with Anchoring (GOMA) for solving monotone variational inequalities (VIs) in min-max optimization. GOMA combines three key ideas: two-time-scale optimistic updates (from generalized optimistic methods), and an anchoring term (inspired by Halpern iteration). The method is presented in a general form (Eq. 7) with separate step sizes for exploration and update, and an anchoring coefficient. In the deterministic setting, GOMA is analyzed under two parameter setups (larger update step or larger exploration step), both achieving the optimal accelerated last-iterate rate of $O(1/k^2)$ on the squared gradient norm for monotone Lipschitz operators. The proof relies on a potential-based analysis, which is a standard and robust technique. A notable aspect is the claim of a "pseudo fixed-step size scheme" that simplifies hyperparameter tuning compared to some prior methods. For the stochastic setting, the paper introduces a simplified single-call variant of GOMA (Eq. 16) by setting the optimistic update coefficient to zero, effectively replacing extrapolation with anchoring to the initial point. This variant is analyzed under state-dependent noise (Assumption 1) where the variance can grow with the squared norm of the operator, a challenging setting. The proof strategy involves comparing noisy iterates to a deterministic reference trajectory and bounding the mean-square deviation. Theorem 3.1 establishes a last-iterate convergence rate of $O(1/N)$ on the expected squared operator norm $E\|G(x_N)\|^2$. This is a significant theoretical contribution, as the paper claims it's the first such guarantee for stochastic monotone Lipschitz VIs in the unconstrained setting without variance reduction or growing batches, and under unbounded variance. A critical issue is the inconsistency in reporting the stochastic convergence rate. The abstract states $O(1/\sqrt{k})$ on the squared gradient norm, which implies $O(1/k^{1/4})$ on the gradient norm. Theorem 3.1, however, states $E\|G(x_N)\|^2 = O(1/N)$, which implies $O(1/\sqrt{N})$ on the gradient norm. The comparison table (Table 1) and parts of the discussion further add to this confusion, sometimes stating $O(1/k^{1/4})$ on $E\|G(x_k)\|$ and sometimes $O(1/k)$ on $E\|G(x_k)\|^2$ (which are inconsistent with each other). Assuming Theorem 3.1 is the most accurate statement of the result, the rate is $O(1/N)$ on $E\|G(x_N)\|^2$, which is a strong result given the challenging assumptions.
The experimental evaluation is conducted on toy problems, which is common for theoretical optimization papers. 1. **Negative-Comonotone Quadratic Saddle Point (Deterministic)**: This experiment uses a problem instance outside the theoretical scope (negative comonotonicity vs. monotonicity), but it's a standard benchmark for comparing VI algorithms. GOMA and FEG show accelerated convergence, while others diverge. GOMA empirically achieves a better constant factor than FEG. 2. **Stochastic Bilinear Game (Bounded Variance)**: On a low-dimensional bilinear game with additive Gaussian noise ($=1$), GOMA significantly outperforms baselines (DSEG, FEG, E-Halpern, RAIN++, Nesterov), achieving the fastest convergence and a residual an order of magnitude smaller. This supports the claim of robustness without variance reduction. 3. **Finite-Sum Saddle-Point Problem (State-Dependent Variance)**: On a higher-dimensional finite-sum problem with multiplicative noise ($>1$), GOMA and RAIN++ show convergence, while DSEG stagnates. This experiment directly validates GOMA's ability to handle state-dependent, unbounded variance, a key theoretical claim. Overall, the experiments, despite being on synthetic problems, effectively demonstrate the empirical advantages of GOMA, particularly in stochastic settings with challenging noise characteristics, aligning well with the theoretical claims.
The paper provides algorithmic details, step size choices, and parameter schedules for GOMA. For baselines, it refers to existing implementations or settings from prior work. However, specific hyperparameters for all methods are deferred to the appendix, and no code repository is provided. While the theoretical derivations are detailed, the lack of a public code release or highly detailed hyperparameter tuning instructions (beyond the appendix reference) might hinder direct reproducibility for practitioners.
1. **Stochastic Rate Inconsistency**: As noted, there is a significant discrepancy in the reported stochastic convergence rates across the abstract, main text, theorem statement, and comparison table. This undermines the clarity and rigor of the paper's central stochastic contribution. Assuming the theorem ($O(1/N)$ on $E\|G(x_N)\|^2$) is correct, the other statements are misleading. 2. **Slower Optimal Rate**: The paper acknowledges that GOMA's stochastic rate ($O(1/N)$ on $E\|G(x_N)\|^2$) does not match the optimal $O(1/N)$ rate (on $E\|G(x_N)\|^2$) achieved by methods using variance reduction or growing batches. Closing this gap without such mechanisms remains an open question. 3. **Toy Experiments**: The empirical validation is limited to synthetic and relatively low-dimensional problems. Scaling GOMA to large-scale deep learning applications (e.g., adversarial training) and demonstrating its practical benefits there would strengthen the work. 4. **Unconstrained Setting**: The analysis is restricted to unconstrained VIs. Extending it to constrained settings, where the convergence measure often shifts to the gap function, is an open direction. 5. **Monotonicity Assumption**: The theoretical guarantees rely on the monotonicity of the operator, which is a strong assumption not always met in practical deep learning min-max problems.
This paper contributes to the fundamental understanding and development of optimization algorithms for variational inequalities and min-max optimization, which are crucial in various machine learning applications like adversarial training, GANs, and multi-agent reinforcement learning. By providing a method that offers last-iterate convergence in challenging stochastic settings (single-call, no variance reduction, no growing batches, unbounded variance), GOMA could enable more efficient and stable training of models in online or resource-constrained environments. The explicit acknowledgment of AI assistant use in proof development is also a noteworthy aspect regarding research methodology. The impact statement correctly identifies the potential for more efficient use of computing resources but also cautions about the Jevons paradox. This paper introduces Generalized Optimistic Method with Anchoring (GOMA), a novel first-order method for monotone variational inequalities that achieves optimal $O(1/k^2)$ last-iterate convergence in the deterministic setting and, critically, provides the first last-iterate $O(1/N)$ convergence guarantee on the expected squared operator norm for stochastic monotone Lipschitz VIs in the unconstrained setting without variance reduction, growing batches, or bounded variance assumptions. The work makes a significant theoretical advancement by demonstrating that strong last-iterate guarantees are compatible with single-sample online models under highly challenging noise conditions, supported by empirical evidence on synthetic problems.
Recently, Muon has gained substantial attention as an appealing alternative to Adam-like optimizers, with many works highlighting its advantages through spectral normalization and improved conditioning. Yet this positive theoretical narrative contrasts with its empirical performance in large language model (LLM) training, where Muon's gains over Adam/AdamW are often mixed, schedule-sensitive, and not uniformly superior. To address this gap, we develop a trajectory-level theory characterizing both the strengths and limitations of Muon. We introduce a mixed-spiked matrix sensing model whose sensing operator decomposes into signal, spike, and bulk components, capturing a mixture of anisotropic structure and long-tail information reminiscent of LLM training. On top of it, we adopted a river-valley perspective in which we view the landscape as composed of a river direction flowing to the desired solution and hill directions encoding nuisance or task-irrelevant information. In the momentum-free setting, we show that Muon moves faster along the information-bearing river direction during early optimization, but can converge much more slowly near the river bottom than gradient descent. We then extend the river-valley perspective to general nonconvex objectives with momentum by studying points on the spectral river. There, while Muon converges faster early on, its orthogonalized update removes residual scale information, making it prone to overshooting and oscillation near the target solution. Together, these results suggest that our characterizations extend beyond spiked matrix sensing and motivate switching to GD-like refinement optimizers in the final phase, rather than relying only on a fixed learning-rate schedule for Muon. We also provide preliminary evidence supporting this two-stage approach in language model training experiments.
Primary: University of Pennsylvania
All Institutions: University of Pennsylvania, City University of Hong Kong, Shanghai University of Finance and Economics
This paper provides a rigorous theoretical framework explaining the strengths and limitations of the Muon optimizer, proposing a two-stage optimization strategy supported by preliminary LLM experiments. The work introduces a novel mixed-spiked matrix sensing model and leverages a river-valley perspective to characterize Muon's fast early exploration and late-stage convergence difficulties, offering valuable insights for the design of more effective training schedules for large language models.
The paper develops a sophisticated theoretical framework to analyze the Muon optimizer, particularly its behavior in anisotropic landscapes reminiscent of LLM training. The core methodology involves introducing a novel "mixed-spiked matrix sensing (MS) model" where the sensing operator decomposes into signal, spike, and bulk components. This model is well-motivated by empirical observations of covariance spectra in deep learning. The authors then adopt a "river-valley perspective," a geometric view that decomposes the optimization landscape into a "river" direction (aligned with meaningful progress) and "hill" directions (nuisance information). This perspective is applied to both a simplified, momentum-free Muon and extended to generalized nonconvex objectives with momentum. The analysis uses invariant manifolds to reduce matrix-valued dynamics to low-dimensional scalar systems, enabling tractable analysis of continuous and discrete-time dynamics for both vanilla GD and simplified Muon. Key theoretical results (Theorems 1, 2, 3) rigorously characterize Muon's early-stage fast exploration and late-stage convergence difficulties (overshooting, oscillation) compared to GD. The extension to generalized settings using a "spectral river" further strengthens the broader applicability of their insights. The mathematical derivations are thorough and provide a deep understanding of the underlying mechanisms.
The experimental evaluation, while described as "preliminary," provides valuable empirical evidence supporting the theoretical claims. The authors train a 250M-parameter LLaMA-style decoder-only Transformer from scratch on OpenWebText2, a relevant and challenging setting for LLM research. They compare Muon-only baselines with various learning rate schedules against a proposed two-stage hybrid approach (Muon followed by AdamW). The results demonstrate that constant-LR Muon indeed exhibits the fastest initial loss decrease, consistent with its early-stage exploratory power. Crucially, the "Muon -> AdamW" hybrid strategy leads to more stable loss trajectories and achieves lower final validation loss compared to Muon-only baselines, even with tuned schedules. This directly supports the theoretical recommendation of using Muon for early exploration and switching to a GD-like optimizer for late-stage refinement. The inclusion of experiments with different switching times and post-switch AdamW LR schedules further strengthens the robustness of their findings. While the scale of the model (250M) is not "large" by today's cutting-edge LLM standards, it is sufficiently large to demonstrate the practical relevance of the theoretical insights.
The paper provides a project website (https://muon-river-valley.github.io/) which typically includes code and experimental details, enhancing reproducibility. The experimental setup details are reasonably well-described, including model architecture (LLaMA-style decoder-only Transformer), parameter count (250M), tokenizer (GPT-2), dataset (OpenWebText2), and training iterations (4k). Learning rate schedules (cosine, linear, cos_inf) and switching points are also mentioned. While not all hyperparameter details are in the main text, the appendix and project website are expected to fill these gaps. The theoretical derivations are detailed in the appendix, allowing for verification.
The primary theoretical analysis relies on a simplified, momentum-free Muon and a specific mixed-spiked MS model, although the paper attempts to generalize these insights to more complex settings. The empirical evidence, while supportive, is explicitly stated as "preliminary" and conducted on a 250M-parameter model, which is modest compared to state-of-the-art LLMs. Further large-scale experiments on diverse architectures and tasks would strengthen the practical implications. The paper also acknowledges that the river-valley decomposition is only one lens and suggests integrating it with other phenomena like edge-of-stability behavior as future work, indicating a limitation in the current scope of analysis.
This paper significantly advances the theoretical understanding of spectral optimizers like Muon, which have gained attention but lacked a comprehensive explanation for their mixed empirical performance. The "river-valley perspective" and the mixed-spiked MS model provide valuable tools for analyzing optimization landscapes in deep learning, particularly in the context of anisotropic gradients observed in LLMs. The practical implication of a two-stage optimization strategy (Muon for exploration, GD-like for refinement) could lead to more efficient and stable training schedules for large models, reducing the need for extensive learning rate tuning. This work has the potential to influence the design and application of future optimizers and contribute to a more principled approach to deep learning training. This paper provides a rigorous theoretical framework explaining the strengths and limitations of the Muon optimizer, proposing a two-stage optimization strategy supported by preliminary LLM experiments. The work introduces a novel mixed-spiked matrix sensing model and leverages a river-valley perspective to characterize Muon's fast early exploration and late-stage convergence difficulties, offering valuable insights for the design of more effective training schedules for large language models.
Model-free reinforcement learning algorithms such as Proximal Policy Optimization (PPO) treat the environment as a black box, estimating policy gradients from sampled rewards; this process demands millions of interactions and relies on high-variance advantage estimates. When environment dynamics are differentiable, the return is an end-to-end differentiable function of the policy parameters, enabling exact gradient computation via backpropagation through simulation. We term this approach Analytic Policy Gradients (APG) and evaluate it against PPO on four continuous control tasks of increasing dynamical complexity: a one-dimensional point-mass target-reaching task, a 2D point-mass navigation task with obstacle avoidance, a 2D rigid-body T-block pushing task, and a 7-DOF Franka FR3 end-effector reaching task. Both algorithms share identical model architectures, observation normalization, and optimizer settings. To decouple sample efficiency from compute efficiency, we design a multi-axis evaluation protocol that records performance against environment steps and gradient steps. We report a segmented backpropagation scheme with MC and critic-based bootstrap modes that mitigates gradient degradation on long-horizon tasks, and present ablations over segment length and bootstrap strategy.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong
This paper has a significant positive broader impact, primarily within the differentiable reinforcement learning and robotics communities. 1. **Advancing Differentiable RL**: It provides compelling empirical evidence for the sample efficiency benefits of leveraging differentiable environment dynamics, addressing a critical bottleneck in real-world RL applications. 2. **Practical Tools and Enablement**: The Warp--PyTorch gradient bridge is a crucial engineering contribution that makes complex, GPU-accelerated physics engines (like NVIDIA Newton/Warp) more accessible for differentiable RL research within the widely used PyTorch framework. This can accelerate progress in areas such as robot manipulation and locomotion. 3. **Improved Evaluation Standards**: The unified benchmarking harness and multi-axis evaluation protocol set a higher standard for comparing model-free and differentiable RL algorithms, promoting more rigorous and fair assessments across the field. 4. **Guidance for Practitioners**: The detailed ablation studies on bootstrap strategies and segment lengths offer practical, actionable advice for researchers and engineers designing differentiable RL systems, helping them make more informed choices for robust training. 5. **Open Science Contribution**: The release of the full codebase and environment suite fosters open research, enabling others to reproduce, verify, and build upon this work, accelerating collective progress in the field. This paper rigorously validates the benefits of Analytic Policy Gradients in differentiable continuous control. It provides a robust benchmarking framework, introduces a practical gradient bridge for complex physics engines, and offers valuable insights into segmented backpropagation strategies, significantly advancing the practical applicability and understanding of differentiable reinforcement learning.
The paper presents Analytic Policy Gradients (APG) as a method for continuous control, leveraging differentiable environment dynamics to compute exact policy gradients via backpropagation through simulation. While the core concept of APG is not new, the paper's strength lies in its meticulous methodological contributions and rigorous implementation. A unified benchmarking harness is developed, allowing for a highly controlled comparison between APG and PPO by ensuring identical actor-critic architectures, observation normalization, and optimizer settings. This standardization is crucial for drawing fair conclusions about the gradient source's impact. The paper adopts a segmented backpropagation scheme to address vanishing/exploding gradients in long-horizon tasks. A key methodological contribution is the detailed exploration and comparison of two bootstrap modes for these segments: Monte Carlo (MC) bootstrap and critic-based bootstrap. The MC bootstrap, which pre-computes future returns from detached rewards, is shown to be a more robust option for shorter segment lengths, providing valuable practical guidance. A significant engineering contribution is the custom `torch.autograd.Function` that bridges NVIDIA Warp/Newton's tape-based autodiff with PyTorch's autograd. This "gradient bridge" enables APG to be applied to complex, GPU-accelerated physics engines that do not natively expose PyTorch-compatible derivatives, thereby expanding the practical applicability of differentiable RL to more realistic and complex robotic tasks like the 7-DOF Franka arm. The use of the reparameterization trick for action sampling ensures proper gradient flow through stochastic policies. Overall, the methodology is sound, well-explained, and effectively tackles practical challenges in implementing differentiable RL.
The experimental evaluation is exceptionally thorough and well-designed. The authors evaluate APG against PPO on four continuous control tasks of increasing dynamical complexity: a 1D point-mass, a 2D point-mass navigation with obstacles, a 2D rigid-body T-block pushing task, and a 7-DOF Franka FR3 end-effector reaching task. This diverse suite effectively demonstrates APG's performance across various scenarios. A key strength of the evaluation is the multi-axis logging protocol, which records performance against both environment steps (measuring sample efficiency) and gradient steps (measuring compute efficiency). This approach is critical for a fair comparison, as APG and PPO consume these resources at different rates. Results are reported as mean ± standard deviation over multiple random seeds, enhancing statistical reliability. Performance thresholds and success rates are clearly defined, providing a comprehensive view of agent capabilities beyond just episodic return. The results consistently show that APG achieves higher final episodic returns and often higher success rates than PPO, particularly on simpler tasks. More importantly, APG demonstrates substantial sample efficiency gains, requiring significantly fewer gradient steps (up to 15.9x fewer on FrankaReach) and environment steps to reach comparable performance thresholds. This strongly validates the benefit of lower-variance analytic gradients. The ablation studies on PointMassNavigate are particularly insightful. They clearly demonstrate that MC bootstrap is robust across varying segment lengths, degrading gracefully even at very short horizons. In contrast, critic bootstrap is highly sensitive to segment length, collapsing entirely at short lengths due to unstable value targets and only becoming competitive at longer segments. This finding provides crucial practical guidance for practitioners. The successful application of the Warp-PyTorch gradient bridge on the FrankaReach task further validates its feasibility and impact.
Reproducibility is a standout feature of this paper. The authors have made their entire implementation, including environment definitions, training scripts, and plotting utilities, open-source on GitHub. They provide detailed instructions, a `requirements.txt` for dependencies, and explicit commands to reproduce each figure and table presented in the paper. The unified benchmarking harness itself contributes significantly to reproducibility by standardizing the comparison between algorithms. This commitment to open science is exemplary and greatly enhances the credibility and utility of the research.
The paper transparently discusses several important limitations inherent to the Analytic Policy Gradients approach: 1. **Environment Differentiability**: APG fundamentally requires the environment dynamics and reward function to be differentiable. This restricts its application to specific simulators and excludes real-world training or environments with non-differentiable elements (e.g., discrete contact events, complex procedural generation). 2. **Gradient Chain Length Issues**: Despite the use of segmented backpropagation, long episodes can still lead to vanishing or exploding gradients. The effectiveness of APG remains sensitive to the choice of segment length and bootstrap strategy, as demonstrated by the ablation studies. 3. **Compute Overhead**: Maintaining the full computation graph during environment rollouts incurs higher memory and computational overhead compared to model-free methods like PPO, which use detached rollouts. This can be a practical concern for very complex environments or extremely long horizons. 4. **Model Bias (for future work)**: While the current work uses ground-truth differentiable dynamics, the authors acknowledge that extending APG to learned differentiable world models would introduce model bias, which could potentially counteract the variance reduction benefits.
This paper has a significant positive broader impact, primarily within the differentiable reinforcement learning and robotics communities. 1. **Advancing Differentiable RL**: It provides compelling empirical evidence for the sample efficiency benefits of leveraging differentiable environment dynamics, addressing a critical bottleneck in real-world RL applications. 2. **Practical Tools and Enablement**: The Warp--PyTorch gradient bridge is a crucial engineering contribution that makes complex, GPU-accelerated physics engines (like NVIDIA Newton/Warp) more accessible for differentiable RL research within the widely used PyTorch framework. This can accelerate progress in areas such as robot manipulation and locomotion. 3. **Improved Evaluation Standards**: The unified benchmarking harness and multi-axis evaluation protocol set a higher standard for comparing model-free and differentiable RL algorithms, promoting more rigorous and fair assessments across the field. 4. **Guidance for Practitioners**: The detailed ablation studies on bootstrap strategies and segment lengths offer practical, actionable advice for researchers and engineers designing differentiable RL systems, helping them make more informed choices for robust training. 5. **Open Science Contribution**: The release of the full codebase and environment suite fosters open research, enabling others to reproduce, verify, and build upon this work, accelerating collective progress in the field. This paper rigorously validates the benefits of Analytic Policy Gradients in differentiable continuous control. It provides a robust benchmarking framework, introduces a practical gradient bridge for complex physics engines, and offers valuable insights into segmented backpropagation strategies, significantly advancing the practical applicability and understanding of differentiable reinforcement learning.
Psychological instruments designed for humans are increasingly used to assign large language models (LLMs) stable psychological profiles that affect their usability, safety assessment, and use as proxies for human participants in research. Using a formal psychometric framework, we show that these profiles are largely a measurement artifact. Administering a battery of personality and risk-preference instruments spanning self-reports and behavioral tasks to 56 instruction-tuned LLMs alongside large human reference samples, we report four findings. First, differences between models are driven not by the traits an instrument targets but by a directional response bias, a tendency to respond toward one end of the scale, or one labeled option, regardless of item content; a variance decomposition attributes 81-90% of between-model variation to this bias, against 9-16% in humans. Second, the bias declines with model capability but is not eliminated by it. Third, because bias rather than trait drives responding, an instrument's apparent reliability is almost entirely predicted by its response orthogonality, a term we coin for the proportion of items for which trait and bias point in opposite directions. Fourth, the profile a model appears to have shifts with the items used and can be manufactured through item selection. These results demonstrate that the apparent psychological profiles of LLMs are artifacts of the instrument used to measure them, not properties of the models themselves. As instruments borrowed from human psychology are rarely fully orthogonal and may inherently lack validity for LLMs, we call for dedicated assessments centered on response orthogonality.
Primary: Max Planck Institute for Human Development
All Institutions: Max Planck Institute for Human Development, University of Konstanz, Barcelona Supercomputing Center, University of Basel
This paper provides a critical psychometric analysis demonstrating that apparent psychological profiles in LLMs are largely artifacts of response bias rather than genuine traits, fundamentally challenging current evaluation practices and calling for more rigorous, orthogonal measurement frameworks in AI research.
The paper employs a rigorous psychometric framework to deconstruct the measurement of LLM "personalities." By modeling responses as a function of latent trait and response bias, and utilizing the concept of "response orthogonality" (the proportion of reverse-keyed items), the authors provide a mathematically sound method to separate genuine trait variance from systematic response artifacts. This approach is theoretically robust and directly addresses a fundamental confound in current LLM evaluation methodologies.
The experimental design is comprehensive, testing 56 instruction-tuned LLMs against a battery of 29 instruments (personality and risk preference) and comparing them against large human reference samples. The results are striking and consistent: LLMs show positive forward-reverse correlations (indicating bias dominance) whereas humans show negative correlations (indicating trait dominance). The variance decomposition showing 81-90% of LLM variation is bias-driven is a powerful empirical finding. The robustness checks across prompting conditions, model sizes, and elicitation methods strengthen the validity of the conclusions.
The paper provides detailed methodological descriptions, including the specific instruments used, the prompting strategies, and the mathematical derivations for the measurement model. The code repository is explicitly linked, ensuring high reproducibility. The use of standard psychometric instruments and clear definitions of variables enhances the clarity of the experimental setup.
The study focuses primarily on post-trained models and treats each model as a single respondent, which may not capture within-model variability or the nuances of different prompting strategies (e.g., persona adoption). The proprietary model sample size is small (N=10), limiting statistical power for those specific comparisons. Additionally, the study is limited to personality and risk domains; while the authors argue for broader applicability, empirical validation in other domains (e.g., moral reasoning, cognitive biases) is left for future work.
This paper has significant implications for the field of AI safety, alignment, and the use of LLMs as proxies for human participants in research. It challenges the validity of current LLM profiling practices and calls for a re-evaluation of how we measure and interpret LLM behaviors. The concept of response orthogonality offers a new standard for designing valid evaluation instruments for AI systems. This paper provides a critical psychometric analysis demonstrating that apparent psychological profiles in LLMs are largely artifacts of response bias rather than genuine traits, fundamentally challenging current evaluation practices and calling for more rigorous, orthogonal measurement frameworks in AI research.
Speculative inference accelerates large language model (LLM) decoding but provides no inherent safety guarantees. Existing safety defenses are largely incompatible with speculative inference: they either introduce additional computation or disrupt the draft-verify mechanism, negating acceleration benefits. This reveals a fundamental incompatibility between current safety methods and speculative decoding. We propose SafeSpec, a safety-aware speculative inference framework that integrates risk estimation directly into the verification process. SafeSpec attaches a lightweight latent safety head to the target model to jointly evaluate semantic validity and safety in a single forward pass. When unsafe generations are detected, SafeSpec applies rollback and safety-guided reflective multi-sampling to recover safe continuations rather than terminating generation. We model jailbreak attacks as distributional shifts over generative trajectories, where adversarial prompts increase the probability of harmful continuations without eliminating safe ones. Under this model, SafeSpec performs risk-aware trajectory recovery within the speculative decoding process. Across multiple models and adversarial benchmarks, SafeSpec achieves a substantially improved safety-efficiency trade-off. On Qwen3-32B, SafeSpec reduces attack success rates by 15% while preserving a 2.06x inference speedup on benign workloads, demonstrating that speculative acceleration and inference-time safety can be jointly optimized.
Primary: Zhejiang University
All Institutions: Zhejiang University, Huawei, Harbin Institute of Technology, Shenzhen
SafeSpec has a significant positive broader impact by enabling the deployment of safer and more efficient large language models. By addressing the fundamental incompatibility between speculative inference and existing safety defenses, it paves the way for LLMs to be used in more sensitive and performance-critical applications. This framework can help mitigate the risks of harmful content generation, thereby increasing public trust and responsible AI deployment. The approach of recovering safe continuations rather than hard refusal also improves user experience and model helpfulness. The paper does not highlight any specific negative societal consequences beyond the general risks associated with LLMs, which its work aims to mitigate. SafeSpec introduces a novel safety-aware speculative inference framework that elegantly integrates risk estimation and recovery directly into the LLM decoding process. This work makes a substantial technical contribution by demonstrating a superior safety-efficiency trade-off through a lightweight latent safety head and a dynamic rollback-and-reflect multi-sampling mechanism, offering a practical and impactful solution for deploying safer and faster LLMs in real-world applications.
The methodology of SafeSpec is well-conceived and addresses a critical gap in LLM deployment: integrating safety guarantees into speculative inference without negating its acceleration benefits. The core innovation lies in its dual-head verification mechanism. By attaching a lightweight, boundary-aligned latent safety head to the target model, SafeSpec enables simultaneous assessment of semantic validity and safety in a single forward pass. This design is elegant as it leverages the target model's existing computation for quality scoring, incurring negligible additional overhead for safety checks. The boundary-aligned extraction of hidden states for the safety head is a clever detail, preventing interference from the quality scoring prompt. The training methodology for the safety head, using step-wise prefix construction and a guard model for labeling, is sound for aligning the head with the inference process. The "rollback-and-reflect" mechanism, coupled with safety-guided multi-sampling, is a significant departure from traditional hard refusal strategies. Framing jailbreak attacks as distributional shifts where harmful continuations become more probable but safe ones are not entirely eliminated provides a strong theoretical underpinning for the multi-sampling approach. The rollback to a previous, potentially "cleaner" state, combined with a reflection prompt, effectively reshapes the sampling space, increasing the probability of finding a safe continuation. This soft intervention strategy is crucial for maintaining utility and helpfulness, avoiding the common pitfall of over-refusal. The probabilistic view of multi-sampling is clearly articulated, demonstrating how increasing sample size $K$ improves the chance of recovery.
The experimental evaluation is comprehensive and rigorous. The authors use two distinct model families (Qwen3-32B and DeepSeek-R1-Distill-Llama-70B) with appropriate draft models, demonstrating the framework's scalability and versatility. Evaluation metrics cover three critical dimensions: defense against seven advanced adversarial attacks (ASR), over-refusal rates (XSTest), and general capabilities/efficiency (GSM8K, MATH, GPQA-diamond, and inference speedup). This multi-faceted evaluation provides a holistic view of SafeSpec's performance. SafeSpec consistently achieves state-of-the-art defense performance, significantly reducing ASR (e.g., 15% on Qwen3-32B) while preserving substantial inference speedups on benign workloads (2.06x on Qwen3-32B, 1.76x on DeepSeek-70B). Crucially, it maintains low over-refusal rates and negligible accuracy degradation on general reasoning tasks, showcasing a superior safety-efficiency trade-off compared to strong baselines like SafeDecoding and SecDecoding. The ablation studies are well-designed, clearly demonstrating the necessity and synergistic effect of both the reflection prompt and multi-sampling. The comparison with a hard refusal strategy effectively highlights the benefits of SafeSpec's recovery mechanism. Hyperparameter analysis provides valuable insights into the trade-offs involved with sample size, safety threshold, and quality threshold. The detailed latency breakdown in the appendix is particularly insightful, transparently explaining the performance characteristics on benign vs. adversarial inputs and justifying the reduced throughput on jailbreak inputs as a feature of the defense. The comparison with a standalone guard model further validates SafeSpec's efficiency and user experience advantages.
The paper demonstrates good reproducibility. Code is made available on GitHub. The appendix provides detailed information on evaluation datasets, jailbreak prompt construction, quality scoring prompt, safety head configurations (architecture, parameter counts), and training setup (data sources, sampling, hyperparameters, data isolation). Layer choice ablation and per-benchmark sensitivity analysis for quality threshold are also included, providing further confidence in the design choices. The use of a fixed random seed is also mentioned.
1. **Reliance on Guard Model for Labeling**: The training data for the safety head is labeled using Qwen3Guard-Gen-8B. The performance and biases of this external guard model could implicitly limit the safety head's effectiveness and generalization, especially if the guard model itself is imperfect or susceptible to certain attacks. 2. **Heuristic Nature of Reflection Prompt**: While effective, the reflection prompt is a handcrafted heuristic. Its optimal design might be sensitive to the target model or specific attack types, and its generalizability across all future attacks is not guaranteed. 3. **Performance on Adversarial Inputs**: Although justified as a necessary cost for safety, the significant slowdown on jailbreak inputs (throughput below 1x) means that if an attacker can consistently trigger Safety Mode, they can effectively degrade the system's performance, even if they don't get a harmful response. This could be a denial-of-service vector. 4. **Adversarial Attacks on Safety Head**: As the safety head is a lightweight classifier, it might be susceptible to direct adversarial attacks designed to bypass it, rather than just the main LLM. The paper does not explore this. 5. **Fixed Rollback State**: The rollback mechanism reverts to the "previous state." For deeply embedded or multi-turn attacks, a single step rollback might not always be sufficient to reach a truly "clean" context, potentially requiring more sophisticated context recovery.
SafeSpec has a significant positive broader impact by enabling the deployment of safer and more efficient large language models. By addressing the fundamental incompatibility between speculative inference and existing safety defenses, it paves the way for LLMs to be used in more sensitive and performance-critical applications. This framework can help mitigate the risks of harmful content generation, thereby increasing public trust and responsible AI deployment. The approach of recovering safe continuations rather than hard refusal also improves user experience and model helpfulness. The paper does not highlight any specific negative societal consequences beyond the general risks associated with LLMs, which its work aims to mitigate. SafeSpec introduces a novel safety-aware speculative inference framework that elegantly integrates risk estimation and recovery directly into the LLM decoding process. This work makes a substantial technical contribution by demonstrating a superior safety-efficiency trade-off through a lightweight latent safety head and a dynamic rollback-and-reflect multi-sampling mechanism, offering a practical and impactful solution for deploying safer and faster LLMs in real-world applications.
Agentic AI systems increasingly rely on language-model components to interpret instructions, process external data, invoke tools, and coordinate with other agents. These capabilities make prompt-injection and jailbreak attacks more consequential, especially as attackers adopt model-guided automation to scale probing, prompt refinement, and response evaluation. This work analyzes the resulting attack-defense setting through a probabilistic model of a target system, its defense mechanism, and the attacker's automated judge. Our analysis shows that conventional detect-and-block defenses can allow attacker success rate (ASR) to approach one as the query budget grows, since predictable refusals provide useful feedback to automated search. We then examine detect-and-misdirect, where detected malicious interactions receive controlled, non-operational responses designed to induce false-positive errors in the attacker's judge. This strategy reduces the positive predictive value of attacker-selected candidates and yields a bounded asymptotic ASR. We evaluate a proof-of-concept realization of this strategy through Contextual Misdirection via Progressive Engagement (CMPE), a lightweight conversational misdirection method designed to replace predictable refusal text with safe but strategically misleading responses in automated jailbreak settings. On jailbreak benchmarks, CMPE reduces estimated ASR upper bounds by up to two orders of magnitude and nearly eliminates verified attack success in end-to-end PAIR and GPTFuzz attack runs.
Primary: Keysight Technologies Inc.
All Institutions: Keysight Technologies Inc.
This work has significant broader impact for the security and robustness of agentic AI systems. 1. **Paradigm Shift in Defense**: It proposes a fundamental shift from reactive blocking to proactive misdirection, offering a new conceptual framework for designing defenses against automated adversarial attacks. This could inspire new research directions in active defense strategies for LLMs. 2. **Enhanced Agentic AI Security**: As agentic AI systems become more prevalent, their susceptibility to automated attacks is a critical concern. The detect-and-misdirect strategy provides a practical and effective method to improve the resilience of these systems, making them safer for deployment in real-world applications. 3. **Improved Red Teaming**: The insights into how automated judges can be misled can inform the development of more robust red teaming methodologies, pushing attackers to develop more sophisticated (and costly) verification mechanisms. 4. **Understanding LLM Limitations**: The work highlights and leverages specific limitations of LLM-based judges, particularly their reliance on heuristic cues. This contributes to a deeper understanding of how LLMs interpret and evaluate text, which is valuable for both defense and general LLM development. 5. **Deterrence**: By making apparent successes less trustworthy, misdirection can serve as a psychological and operational deterrent, increasing the cost and uncertainty for attackers. This paper introduces a theoretically grounded and empirically validated "detect-and-misdirect" defense strategy that significantly reduces the success rate of model-guided automated attacks on agentic AI systems. Through a probabilistic model, the authors demonstrate the inherent vulnerability of conventional detect-and-block defenses to iterative search and show how misdirection, by inducing false positives in the attacker's judge, can bound asymptotic attack success. The practical instantiation, Contextual Misdirection via Progressive Engagement (CMPE), is shown to be highly effective in end-to-end evaluations against state-of-the-art attack frameworks, nearly eliminating verified attack success and causing premature termination, thereby offering a crucial new paradigm for enhancing the security and robustness of autonomous AI.
The paper introduces a novel "detect-and-misdirect" defense strategy against model-guided automated attacks on agentic AI systems, contrasting it with conventional "detect-and-block" approaches. The methodology is robust, starting with a probabilistic model of the attack-defense setting. This model rigorously demonstrates a fundamental limitation of detect-and-block defenses: predictable refusals provide useful feedback to automated search, allowing attacker success rate (ASR) to approach one as the query budget grows. This theoretical insight is crucial. The proposed detect-and-misdirect strategy is then formalized, showing that by introducing misdirection-induced false positives (MI-FP) in the attacker's automated judge, the positive predictive value (PPV) of attacker-selected candidates is reduced, leading to a bounded asymptotic ASR. This theoretical underpinning is a significant strength. The paper then instantiates this strategy with Contextual Misdirection via Progressive Engagement (CMPE), a lightweight conversational mechanism. CMPE comprises three components: a positive-intent preamble, safe context expansion via prompt-reshaping (token-level transformations, lexical injection, shuffling), and a follow-up question. This design specifically aims to appear cooperative and semantically plausible to an LLM-based judge, while containing no genuinely harmful content, thus exploiting known limitations of such judges that often rely on heuristic cues (tone, structure, perceived intent) rather than strict semantic correctness. The CMPE algorithm is clearly described, enhancing its practicality and reproducibility. The methodology effectively bridges theoretical analysis with a concrete, implementable defense.
The experimental evaluation is comprehensive and compelling, validating the proposed strategy through both simulated ASR bounds and end-to-end attack runs. 1. **Judge Error Rate Estimation**: The authors evaluate CMPE's ability to induce misdirection-induced false positives (MI-FP) across a diverse set of six modern automated judge models (including rubric-based LLM judges like StrongREJECT and PAIR, and classifier judges like HB-FT-LLaMA2-13B, GPTFuzz-RoBERTa, and Llama-Guard-3-8B). This is performed on 500 high-risk jailbreak prompts from the AdvBench dataset. Each prompt-response pair is evaluated 10 times to account for stochasticity. The results clearly show that CMPE responses consistently receive high harmfulness scores from judges, demonstrating its effectiveness in generating MI-FPs. 2. **Simulated ASR Evaluation**: Using these per-sample judge error estimates, the paper computes simulated maximum ASR upper bounds for various attacker-defender judge configurations. The results show that CMPE substantially reduces the estimated ASR upper bound, often by one to two orders of magnitude, compared to the detect-and-block baseline. This directly supports the theoretical prediction that misdirection degrades the attacker's PPV and bounds their success. 3. **End-to-End Attack Framework Evaluation**: This is the most impactful part of the evaluation. CMPE is tested against two representative model-guided attack frameworks, GPTFuzz and PAIR, using both an aligned victim model (Vicuna-13b-v1.5) and a refusal-suppressed model (NeuralDaredevil-8B-abliterated). The experiments emulate a realistic agentic security setting. The results are striking: CMPE nearly eliminates verified attack success (reducing ASR from 10-20% to 0-2%) and causes the automated attack frameworks to terminate prematurely due to accepting misdirection responses as successful. This demonstrates that CMPE effectively disrupts the attack loop by making apparent successes untrustworthy. The use of manual validation with a secondary LLM judge for final verification adds credibility to the reported true positive rates. The experimental setup is well-controlled, with local defense and attack components hosted on separate systems.
The paper provides sufficient details to facilitate reproducibility. The probabilistic model is clearly defined with equations. The CMPE algorithm is presented in detail, including its three components and an example. Specific models used for response generation (NeuralDaredevil-8B-abliterated) and judging (various LLM and classifier judges, including their backend models) are named, along with the dataset (AdvBench) and its source. The experimental setup for both simulations and end-to-end runs (number of prompts, iterations, victim models, attacker models, defense models, validation judges, hardware) is described. URLs for the AdvBench dataset and the NeuralDaredevil model are provided. This level of detail is commendable and supports the reproducibility of the work.
1. **Attacker Adaptation**: While the paper discusses potential attacker adaptations (e.g., judge ensembling, stricter calibration), it acknowledges that these introduce trade-offs (e.g., increased false negatives for the attacker). However, the arms race between attackers and defenders is continuous, and more sophisticated misdirection detection methods might emerge. 2. **Generality of Misdirection**: CMPE is a specific instantiation of the detect-and-misdirect strategy. While effective, developing equally lightweight and effective misdirection for other types of prompt injection or agentic attacks might require different approaches. 3. **Complexity of Misdirection Generation**: Although CMPE is described as lightweight, generating consistently plausible and misleading responses without inadvertently triggering harmful behavior or being easily detectable as non-operational could become challenging for more complex attack scenarios or highly sophisticated attackers. 4. **Focus on Jailbreak**: The evaluation primarily focuses on jailbreak attacks. While the theoretical framework is general, the CMPE instantiation and empirical validation are specific to jailbreaking. Its effectiveness against other prompt injection variants (e.g., data exfiltration, tool misuse) would need further investigation. 5. **Human Oversight in Validation**: The final validation of true positives still requires manual inspection and a secondary LLM judge, highlighting the inherent difficulty in fully automating the verification of malicious intent, even for the defender.
This work has significant broader impact for the security and robustness of agentic AI systems. 1. **Paradigm Shift in Defense**: It proposes a fundamental shift from reactive blocking to proactive misdirection, offering a new conceptual framework for designing defenses against automated adversarial attacks. This could inspire new research directions in active defense strategies for LLMs. 2. **Enhanced Agentic AI Security**: As agentic AI systems become more prevalent, their susceptibility to automated attacks is a critical concern. The detect-and-misdirect strategy provides a practical and effective method to improve the resilience of these systems, making them safer for deployment in real-world applications. 3. **Improved Red Teaming**: The insights into how automated judges can be misled can inform the development of more robust red teaming methodologies, pushing attackers to develop more sophisticated (and costly) verification mechanisms. 4. **Understanding LLM Limitations**: The work highlights and leverages specific limitations of LLM-based judges, particularly their reliance on heuristic cues. This contributes to a deeper understanding of how LLMs interpret and evaluate text, which is valuable for both defense and general LLM development. 5. **Deterrence**: By making apparent successes less trustworthy, misdirection can serve as a psychological and operational deterrent, increasing the cost and uncertainty for attackers. This paper introduces a theoretically grounded and empirically validated "detect-and-misdirect" defense strategy that significantly reduces the success rate of model-guided automated attacks on agentic AI systems. Through a probabilistic model, the authors demonstrate the inherent vulnerability of conventional detect-and-block defenses to iterative search and show how misdirection, by inducing false positives in the attacker's judge, can bound asymptotic attack success. The practical instantiation, Contextual Misdirection via Progressive Engagement (CMPE), is shown to be highly effective in end-to-end evaluations against state-of-the-art attack frameworks, nearly eliminating verified attack success and causing premature termination, thereby offering a crucial new paradigm for enhancing the security and robustness of autonomous AI.
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced Large Language Model (LLM) reasoning; however, it faces a fundamental optimization instability: uniform token updates precipitate entropy collapse, leading to premature convergence to suboptimal strategies, whereas excessive Shannon Entropy maximization can cause entropy explosion, driving blind exploration toward incoherent reasoning chains. To resolve this dichotomy, we introduce the Independent Combinatorial Tokens (ICT) framework, which shifts the optimization focus from scalar uncertainty to the distributional properties of token logits. By leveraging the Jensen-Shannon (JS) divergence between token logits distributions, ICT identifies tokens with distinctive distributional patterns as critical branching points for guiding effective exploration in LLM reasoning. Our theoretical analysis, grounded in both Shannon and second-order Rényi entropy, proves that selectively updating on these tokens regulates policy concentration: it reduces the overall distribution uncertainty measured by Shannon entropy, while controlling probability concentration captured by second-order Rényi entropy. This dual effect prevents over-concentrated token generation from weakening exploration and effectively stabilizes the training landscape. Empirical results demonstrate that updating only the top 10% of unique tokens on Qwen2.5 (0.5B/1.5B/7B) models yields an average pass@4 improvement of 4.58%, with a maximum gain of 14.9%, over GRPO, 20-Entropy, and STAPO baselines across seven benchmarks spanning math, commonsense, and Olympiad-level problems.
Primary: Hong Kong University of Science and Technology
All Institutions: Hong Kong University of Science and Technology, Unknown Institution 2, Unknown Institution 3
The ICT framework offers a principled solution to a fundamental instability in RLVR, a critical technique for aligning LLMs with objective correctness in complex domains like mathematics and programming. By enabling more stable and effective exploration, ICT can lead to: 1. **Improved LLM Reasoning**: The demonstrated gains in Pass@4 suggest that models trained with ICT can explore more diverse and correct reasoning paths, leading to higher-quality solutions. 2. **Enhanced Training Efficiency**: The "less is more" finding, where updating only 10% of tokens yields superior results, implies potential for significant computational savings during RL fine-tuning, making it more accessible and scalable. 3. **Deeper Understanding of LLM Dynamics**: The shift from scalar entropy to distributional properties provides a new lens for understanding how LLMs make decisions and explore, potentially inspiring further research into information-theoretic approaches for policy optimization. 4. **Foundation for Future RLVR Algorithms**: The token-centric foundation established by this work could serve as a building block for the next generation of RLVR algorithms, moving beyond uniform updates to more intelligent, selective gradient application. This paper introduces the Independent Combinatorial Tokens (ICT) framework, a novel approach to stabilize Reinforcement Learning with Verifiable Rewards (RLVR) for LLM reasoning by focusing sparse updates on "unique tokens" identified via Jensen-Shannon divergence of logit distributions. The work provides a strong theoretical foundation, meticulously analyzing entropy dynamics with second-order Rényi entropy, and demonstrates significant, consistent empirical gains across multiple LLM scales and diverse reasoning benchmarks, offering a principled and efficient method for enhancing LLM exploration and reasoning capabilities.
The paper introduces the Independent Combinatorial Tokens (ICT) framework to address the fundamental optimization instability (entropy collapse/explosion) in Reinforcement Learning with Verifiable Rewards (RLVR) for LLM reasoning. The core idea is to shift the optimization focus from scalar uncertainty (Shannon Entropy) to the distributional properties of token logits. Specifically, ICT leverages Jensen-Shannon (JS) divergence to identify "unique tokens" whose logit distributions significantly deviate from the sequence-level average distribution. These unique tokens are posited as critical branching points for guiding effective exploration. The theoretical analysis is a major strength. It rigorously grounds the ICT framework in both Shannon and second-order Rényi entropy ($H_2$). The paper meticulously derives the gradient dynamics of $H_2$ and introduces the concept of "strategy purity" (collision probability) to formalize entropy bifurcation into regimes of collapse (updating high-probability tokens) and explosion (updating low-probability tokens). It proves that selectively updating unique tokens, which are shown to reside near the strategy purity threshold, regulates policy concentration by reducing overall Shannon entropy while controlling probability concentration via $H_2$. The appendix provides extensive derivations, including the homogeneity of $H_1$ and $H_2$ gradients under certain conditions, and a formal bridge connecting JS-unique tokens to these critical branching points. The ICT framework then integrates this insight into a sparse policy gradient estimator built upon Group Relative Policy Optimization (GRPO). An ICT distributional selector constructs a binary mask, retaining only the top-k percentile of tokens based on their JS uniqueness scores. This sparse mask is applied to the GRPO objective, ensuring that optimization resources are focused on high-information learning signals. The methodology is well-articulated, providing a principled approach to stabilize RLVR training.
The experimental evaluation is comprehensive and robust. The authors evaluate ICT on the Qwen2.5 series of models (0.5B, 1.5B, 7B), demonstrating scalability across different model sizes. Seven benchmarks are used, spanning diverse reasoning tasks including math (GSM8K, Math500, AIME23/24/25), commonsense (GPQA), and general knowledge (MMLU-Stem). This broad evaluation scope strengthens the generalization claims. ICT is compared against strong baselines: GRPO (the backbone RLVR algorithm), 20-Entropy, and STAPO. The results consistently show that ICT achieves the highest average Pass@1, Pass@4, and Total scores across all model scales and benchmarks. The average Pass@4 improvement of 4.58% (with a maximum gain of 14.9%) over baselines is significant, especially considering that only the top 10% of unique tokens are updated. This "less is more" finding is compelling. A key empirical finding is the differential improvement in Pass@4 versus Pass@1, indicating enhanced exploration capacity. Ablation studies further validate the theory: 1. **Update Ratios**: Comparing different sparsity ratios (10%, 20%, 50%, 100%) shows that updating only the top 10% of unique tokens yields the best performance, aligning with the hypothesis that focusing on critical branching points is optimal. 2. **Composition of Unique Tokens**: The analysis reveals that the ratio of high-entropy to low-entropy tokens among the selected unique tokens is approximately 1:1 (1.03 on GSM8K, 0.99 on MATH). This empirically confirms the theoretical prediction that unique tokens are drawn from both entropy collapse (Regime H) and entropy explosion (Regime L) regimes, thereby maintaining balanced entropy dynamics. The experiments are well-designed to support the theoretical claims and demonstrate practical efficacy.
The paper states that the training pipeline is built upon VeRL and closely follows the GRPO training recipe, with the only difference being the sparse updates. It mentions using the mean across 5 independent random seeds and provides details on datasets and baselines. More implementation details are said to be in the Appendix, which does provide extensive theoretical derivations but not explicit code or hyperparameter tables. While the methodology is clearly described, the absence of a public code repository or a detailed hyperparameter table in the main paper or appendix makes full reproducibility challenging without significant effort to replicate the VeRL/GRPO setup and then implement ICT.
1. **Strictness of H1/H2 Homogeneity Condition**: The paper acknowledges that the condition for co-directional Shannon entropy gradients ($(a) > e^{-1} \approx 0.37$) is restrictive for typical LLM token probabilities. While the $H_2$ analysis is unconditionally valid and an extended homogeneity for top-k tokens is argued, this still highlights a nuance in the theoretical claims. 2. **First-Order Approximation**: The theoretical derivations rely on a first-order Taylor approximation for entropy change, which assumes infinitesimally small step sizes. While justified for small learning rates, aggressive learning rates or large advantage spikes could lead to non-negligible higher-order effects. 3. **Computational Overhead**: Although the paper claims negligible computational overhead for JS divergence computation due to parallel batch processing, it still represents an additional step in the training loop. The primary savings come from the sparse backward pass, but the forward pass still involves this calculation. 4. **Generalizability Beyond Reasoning**: While the paper demonstrates strong results on reasoning tasks, the applicability of "unique tokens" identified via JS divergence to other LLM tasks (e.g., creative writing, summarization, dialogue) is not explored. 5. **Code Availability**: The lack of a publicly available code repository is a common limitation for arXiv papers and hinders direct reproducibility and adoption by the community.
The ICT framework offers a principled solution to a fundamental instability in RLVR, a critical technique for aligning LLMs with objective correctness in complex domains like mathematics and programming. By enabling more stable and effective exploration, ICT can lead to: 1. **Improved LLM Reasoning**: The demonstrated gains in Pass@4 suggest that models trained with ICT can explore more diverse and correct reasoning paths, leading to higher-quality solutions. 2. **Enhanced Training Efficiency**: The "less is more" finding, where updating only 10% of tokens yields superior results, implies potential for significant computational savings during RL fine-tuning, making it more accessible and scalable. 3. **Deeper Understanding of LLM Dynamics**: The shift from scalar entropy to distributional properties provides a new lens for understanding how LLMs make decisions and explore, potentially inspiring further research into information-theoretic approaches for policy optimization. 4. **Foundation for Future RLVR Algorithms**: The token-centric foundation established by this work could serve as a building block for the next generation of RLVR algorithms, moving beyond uniform updates to more intelligent, selective gradient application. This paper introduces the Independent Combinatorial Tokens (ICT) framework, a novel approach to stabilize Reinforcement Learning with Verifiable Rewards (RLVR) for LLM reasoning by focusing sparse updates on "unique tokens" identified via Jensen-Shannon divergence of logit distributions. The work provides a strong theoretical foundation, meticulously analyzing entropy dynamics with second-order Rényi entropy, and demonstrates significant, consistent empirical gains across multiple LLM scales and diverse reasoning benchmarks, offering a principled and efficient method for enhancing LLM exploration and reasoning capabilities.
Commercial large language models bill, scale latency, and budget context per token. Yet tokenizers assign more subword tokens to the same meaning in some languages than in others, so speakers of languages with high token-fertility pay a structural penalty before a model is ever invoked. This penalty is documented for multilingual settings in general, but it has not been measured systematically for African languages at the level of enterprise deployment economics and cognitive context capacity. We measure it across 20 African languages spanning five language families and three scripts (Latin, Ge'ez/Ethiopic, N'Ko; 19 appear in the primary FLORES-200+ corpus, with Nigerian Pidgin measured via MAFAND-MT only), using parallel corpora so that the language effect is isolated from content. Across 11 frontier and open tokenizers on FLORES-200+, every African language carries a tokenization premium above English (median 1.88x on GPT-5 / o200k_base, up to 8.92x for N'Ko); the penalty is largest for Ethiopic and N'Ko scripts (reaching 7-9x) and is near-invariant across corpora (FLORES vs SIB-200 Pearson r = 0.9998). Translated into deployment terms, this results in up to 8.9x inference cost and an equivalent generation-latency multiplier (N'Ko vs English on GPT-5; 7.4x for Amharic), and as little as 11% of English's effective context window. The best currently available tokenizer for African languages, Gemma 4, reduces the mean premium from 3.31x (cl100k_base) to 2.38x, but no tokenizer eliminates the penalty. We release an open measurement tool (afri-fertility), a public leaderboard, a results dataset, and mitigation guidance for African builders. The penalty falls hardest on the languages whose speakers can least afford it, a digital divide encoded directly into the subword vocabulary.
Primary: DataLens Africa Research
All Institutions: DataLens Africa Research, CipherSense AI
This paper has profound broader impact, particularly for AI equity and the digital divide. 1. **AI Equity and Fairness:** It rigorously quantifies a structural unfairness baked into LLMs, where speakers of African languages pay more and get less utility (shorter context, longer latency) for the same service. This highlights a critical dimension of AI fairness beyond accuracy. 2. **Economic Viability:** The translation of the "tax" into concrete enterprise costs directly impacts the economic viability of deploying LLM-powered applications in African markets, where compute resources are often least affordable. This can hinder innovation and access to AI technologies. 3. **Call to Action for Developers:** The findings, especially the success of Gemma 4 for Ethiopic, provide clear evidence that tokenizer design and vocabulary inclusion can significantly mitigate the penalty. This serves as a strong call to action for LLM developers to explicitly consider low-resource and non-Latin script languages in tokenizer training. 4. **Empowerment for African Builders:** The release of the `afri-fertility` tool, leaderboard, dataset, and mitigation guidance empowers African builders to measure the penalty themselves, make informed decisions about tokenizer/model choice, and advocate for more equitable LLM infrastructure. 5. **Research Agenda:** The paper establishes a foundational benchmark and framework for future research on multilingual tokenization, fairness, and cost optimization, particularly for underrepresented languages. It links the pre-inference cost layer to existing accuracy benchmarks, encouraging a holistic view of LLM performance. This work is a crucial step towards making LLMs truly global and equitable. The paper quantifies the "African Language Tax," a structural penalty where African languages incur significantly higher tokenization costs, latency, and reduced context window efficiency in frontier LLMs due to tokenizer design. This comprehensive study, using parallel corpora across 20 African languages and 11 tokenizers, reveals a median 1.88x premium over English (up to 8.92x for N'Ko), translating into substantial economic burdens and operational limitations for African builders, and provides an open measurement tool and mitigation guidance.
The methodology is exceptionally robust and well-designed to isolate and quantify the "African Language Tax." The core strength lies in the use of parallel corpora, which ensures that differences in token counts are attributed solely to the language and tokenizer, not content variations. The definition of metrics (Fertility, Premium, CPT, BPT, Context Efficiency) is clear and appropriate. The aggregation method ("sum-then-divide") correctly handles corpus-level metrics, avoiding biases from short sentences, and the inclusion of bootstrap confidence intervals demonstrates statistical rigor. A significant methodological contribution is the enterprise cost model, which translates abstract tokenization premiums into tangible economic terms (USD, local currency, latency, context erosion). This model is instantiated with realistic deployment scenarios (high-volume chat, output-heavy generation, context-constrained advisory), making the impact concrete for decision-makers. The "Economic Sensitivity" analysis, which accounts for the compounding effect of FX volatility on USD-denominated API pricing, is a particularly insightful and novel aspect of the cost model, directly addressing a critical real-world challenge for African builders. The `afri-fertility` tool itself is a methodological artifact, designed for determinism, reproducibility (caching, run manifest, `reproduce` command), and extensibility, which is a strong point. The inclusion of script-level controls and the consideration of normalization forms for non-Latin scripts further demonstrate careful methodological planning.
The experimental evaluation is comprehensive and meticulously executed. The study covers 20 African languages spanning five language families and three scripts (Latin, Ge'ez/Ethiopic, N'Ko), providing a diverse and representative sample. The inclusion of dual-script languages (Hausa Latin/Ajami, Bambara Latin/N'Ko) is a clever design choice to isolate the script effect. Eleven frontier and open tokenizers are tested, including commercially dominant ones (OpenAI's o200k_base, Llama, Gemma, Mistral, Qwen, DeepSeek) and multilingual baselines (BLOOM, Aya), as well as opaque API-based tokenizers (Claude, Gemini) for spot checks. This broad coverage ensures the findings are relevant to current LLM deployment. Three parallel corpora (FLORES-200+, SIB-200, MAFAND-MT) are used, with FLORES-200+ as the primary, providing robustness checks across different text registers. The results are striking and clearly presented: 1. **Universal Premium (H1 confirmed):** Every African language in the study carries a tokenization premium above English (median 1.88x on o200k_base, up to 8.92x for N'Ko), with the lowest observed premium still 1.29x. 2. **Dominant Script Effect (H2 confirmed):** Non-Latin scripts incur significantly higher penalties (Ethiopic mean 7.08x, N'Ko 8.92x on o200k_base) compared to Latin-script African languages (mean 1.76x). 3. **Tokenizer Performance:** Gemma 4 is identified as a standout for Ethiopic languages, reducing the premium from 7-9x to ~2.65x, demonstrating that targeted vocabulary improvements can significantly mitigate the penalty. Qwen 3 also shows a notable reduction for N'Ko. 4. **Economic Impact:** The cost model translates these premiums into substantial annual inference costs (e.g., N'Ko on GPT-5 costs up to $1.6M/year vs. $183k for English), equivalent generation latency multipliers, and severe context window erosion (N'Ko having only 11% of English's effective context). 5. **FX Compounding:** The paper effectively illustrates how FX depreciation further compounds the tokenization tax for African builders, leading to even higher effective costs in local currency. The experimental results are empirically sound, statistically supported, and translated into highly actionable insights for both LLM developers and African deployers.
Reproducibility is a major strength of this paper. The authors release `afri-fertility`, an open-source measurement tool (Apache-2.0 license) that performs all measurements deterministically. Key features ensuring reproducibility include: * **Determinism:** Tokenization is deterministic, and the only randomness (bootstrap CIs) is seeded. * **Caching:** Counts are cached on disk, keyed by content and tokenizer version, ensuring consistent results across re-runs. * **Run Manifest:** Every run generates a manifest detailing tool version, tokenizer versions, price/FX snapshots, config hash, and segmentation method, allowing precise traceability. * **Locked Study Config:** The entire study configuration is provided as a YAML file. * **`afri-fertility reproduce` command:** A simple command is provided to run a small offline reference suite for quick verification. * **Open Artifacts:** Beyond the tool, a public leaderboard and results dataset are released. This commitment to open science and reproducibility is exemplary and significantly enhances the paper's impact and trustworthiness.
The authors acknowledge several limitations: 1. **UAX-29 Word Segmentation:** The standard UAX-29 word segmentation, while applied uniformly, is imperfect for highly agglutinative languages (e.g., Kinyarwanda, isiXhosa) and Ethiopic script, where word boundaries may not align cleanly. The authors mitigate this by reporting character- and byte-normalized metrics (CPT, BPT) alongside fertility, ensuring conclusions don't solely rely on word counts. 2. **Opaque Tokenizers:** Claude and Gemini are included as count-only API checks, meaning their subword segmentation cannot be inspected, limiting deeper analysis of their internal mechanisms. 3. **Corpus Dependence:** While multiple corpora are used, the primary reliance on FLORES-200+ (a professionally translated, general-domain corpus) means the findings might vary slightly for highly specialized or informal text registers not covered. However, the robustness checks with SIB-200 and MAFAND-MT show near-invariance of rankings. 4. **Snapshot Nature:** The cost and FX rates are based on specific snapshots (June 2026), meaning the absolute monetary figures will change over time. However, the *relative* premiums and the *mechanism* of FX compounding remain valid.
This paper has profound broader impact, particularly for AI equity and the digital divide. 1. **AI Equity and Fairness:** It rigorously quantifies a structural unfairness baked into LLMs, where speakers of African languages pay more and get less utility (shorter context, longer latency) for the same service. This highlights a critical dimension of AI fairness beyond accuracy. 2. **Economic Viability:** The translation of the "tax" into concrete enterprise costs directly impacts the economic viability of deploying LLM-powered applications in African markets, where compute resources are often least affordable. This can hinder innovation and access to AI technologies. 3. **Call to Action for Developers:** The findings, especially the success of Gemma 4 for Ethiopic, provide clear evidence that tokenizer design and vocabulary inclusion can significantly mitigate the penalty. This serves as a strong call to action for LLM developers to explicitly consider low-resource and non-Latin script languages in tokenizer training. 4. **Empowerment for African Builders:** The release of the `afri-fertility` tool, leaderboard, dataset, and mitigation guidance empowers African builders to measure the penalty themselves, make informed decisions about tokenizer/model choice, and advocate for more equitable LLM infrastructure. 5. **Research Agenda:** The paper establishes a foundational benchmark and framework for future research on multilingual tokenization, fairness, and cost optimization, particularly for underrepresented languages. It links the pre-inference cost layer to existing accuracy benchmarks, encouraging a holistic view of LLM performance. This work is a crucial step towards making LLMs truly global and equitable. The paper quantifies the "African Language Tax," a structural penalty where African languages incur significantly higher tokenization costs, latency, and reduced context window efficiency in frontier LLMs due to tokenizer design. This comprehensive study, using parallel corpora across 20 African languages and 11 tokenizers, reveals a median 1.88x premium over English (up to 8.92x for N'Ko), translating into substantial economic burdens and operational limitations for African builders, and provides an open measurement tool and mitigation guidance.
Vision-language-action (VLA) models have made rapid progress in generalist robot manipulation by harnessing semantic knowledge from pretrained vision-language backbones, but their visual tokens remain grounded in 2D image coordinates rather than the calibrated geometry of the robot's cameras -- a mismatch especially pronounced in multi-camera setups, where views are coupled by known intrinsics and extrinsics yet processed as independent images. We propose G$^3$VLA, a camera-aware geometric module that injects calibrated structure into the visual-token stream of a pretrained VLA without altering its action space or imitation objective, combining intrinsic-conditioned ray embeddings, projective positional encoding (PRoPE), and bidirectional cross-view fusion. Geometric supervision is provided either from ground-truth point maps when available, or from confidence-gated $π^3$X teacher predictions, requiring no depth sensors or manual annotations. Instantiated on $π_0$, G$^3$VLA yields consistent gains across the LIBERO suites, RoboCasa24, RoboTwin2.0, and real-robot settings, with the largest improvements on spatially and object-sensitive tasks. We further validate on $π_{0.5}$ and GR00T 1.5, with results suggesting that geometric transfer is most effective when geometry-aware tokens have direct access to the action generation pathway. Our project page is at https://sites.google.com/view/g3vla
Primary: unknown
All Institutions: unknown
G$^3$VLA represents a significant advancement towards making generalist robot manipulation more robust and precise, particularly in multi-camera environments. By injecting calibrated geometric inductive biases into existing VLA models without requiring architectural overhauls or explicit 3D sensor inputs, it provides a lightweight and practical pathway for improving spatial reasoning and out-of-distribution generalization. This approach could accelerate the deployment of VLAs in real-world settings where precise manipulation and adaptability to varying viewpoints are crucial. The method's compatibility with pretrained backbones means it can readily benefit from ongoing advancements in large vision-language models. The insights into architectural dependencies also guide future research in designing VLA models that can better leverage geometric information. This work contributes to bridging the gap between high-level semantic understanding and low-level spatial precision in robot learning. This paper introduces G$^3$VLA, a camera-aware geometric module that enhances pretrained Vision-Language-Action (VLA) models by injecting calibrated camera geometry into their visual-token stream without altering their core architecture or action space. The work presents a novel combination of intrinsic-conditioned ray embeddings, projective positional encoding (PRoPE), and bidirectional cross-view fusion, alongside a practical geometry distillation strategy from a $\pi^3$X teacher, to significantly improve spatial precision and out-of-distribution generalization in robot manipulation across diverse simulated and real-world benchmarks. The comprehensive experimental validation, including multi-backbone evaluation, extensive ablation studies, and crucial real-robot experiments demonstrating improved generalization under viewpoint shifts, firmly establishes G$^3$VLA as a valuable and practical advancement for the field of robot learning, offering a lightweight yet impactful solution to a critical limitation of current VLA systems.
The paper introduces G$^3$VLA, a camera-aware geometric module designed to inject calibrated structure into the visual-token stream of pretrained Vision-Language-Action (VLA) models. This addresses a crucial limitation where VLA models often process visual tokens grounded in 2D image coordinates, neglecting the known calibrated geometry of multi-camera setups. A key strength of G$^3$VLA is its "lightweight" and "backbone-preserving" nature, meaning it integrates with existing VLA architectures without altering their action space or imitation objective, making it highly compatible for practical adoption. The module comprises three main components: intrinsic-conditioned ray embeddings, which enrich each ViT patch token with its back-projected viewing direction; Projective Positional Encoding (PRoPE), which leverages camera intrinsics and extrinsics to provide a calibration-derived attention bias for cross-view projective relationships; and bidirectional cross-view fusion, which facilitates the exchange of geometric context across camera streams. This combination effectively imbues 2D visual tokens with essential 3D geometric awareness. For supervision, G$^3$VLA offers flexibility: it can use ground-truth point maps in simulation or, more practically, confidence-gated predictions from a $\pi^3$X teacher model, eliminating the need for depth sensors or manual 3D annotations. The training employs a two-stage curriculum: an initial pre-training phase for the geometric module with a dominant distillation loss, followed by full policy fine-tuning where the action loss takes precedence, with distillation serving as a regularizer. This staged approach is a well-considered strategy for effectively integrating a new module into a pretrained system.
The experimental evaluation is exceptionally comprehensive and rigorous, providing strong evidence for G$^3$VLA's effectiveness. The authors validate the method across three architecturally distinct VLA backbones ($\pi_0$, $\pi_{0.5}$, and GR00T 1.5), demonstrating broad generalizability. Performance is assessed on an extensive suite of simulation benchmarks, including the LIBERO suites (Goal, Spatial, Object, and 10), RoboCasa24, and RoboTwin2.0. Results consistently show significant gains, particularly on spatially and object-sensitive tasks within LIBERO, directly supporting the paper's core hypothesis. For instance, on $\pi_0$, G$^3$VLA (GT) improves LIBERO's macro-average success rate by +3.5 points, with even larger improvements on Object (+5.0) and Spatial (+4.0) tasks. The evaluation on $\pi_{0.5}$ confirms compatibility with stronger baselines, yielding consistent, albeit smaller, improvements even when the baseline is near saturation. An insightful finding emerges from GR00T 1.5, where mixed gains suggest that the effectiveness of geometric injection depends on how directly geometry-aware tokens access the action generation pathway, highlighting an important architectural consideration for future VLA designs. Crucially, the paper includes real-robot experiments on two manipulation tasks (Pick-and-Place Test Tube, Pouring Nut) using a bimanual UR5 setup. These experiments demonstrate substantial improvements in out-of-distribution (OOD) generalization under viewpoint shifts, a critical capability for robust robot deployment. For example, on the pouring task, OOD performance for $\pi_0$ improved from 70.8-75.0% to 83.3-87.5%. Thorough ablation studies confirm the individual contributions of ray embeddings, PRoPE, and the two-stage training curriculum. The comparison between ground-truth and $\pi^3$X distillation shows that while GT provides the strongest signal, $\pi^3$X distillation recovers most of the gains, making it a practical alternative. The identified failure case of $\pi^3$X in visually clean synthetic scenes (RoboTwin2.0) also provides valuable insight into the teacher model's limitations.
The paper provides a clear and detailed description of the G$^3$VLA module's architecture and the two-stage training process. It explicitly states that implementation details, camera-geometry preprocessing, teacher-target generation, and backbone-specific training hyperparameters are provided in the Appendix, which is excellent practice for reproducibility. The use of established benchmarks and publicly available VLA backbones (like $\pi_0$, $\pi_{0.5}$, GR00T 1.5) further aids in replicating the results. The inclusion of a project page URL also suggests that code and/or additional resources might be available. Given the level of detail in the main paper and the promise of comprehensive appendices, the work appears to be highly reproducible.
The authors thoughtfully discuss several limitations. G$^3$VLA relies on accurate camera intrinsics and extrinsics, making it sensitive to calibration drift, synchronization errors, and train-test mismatches. The dependence on a visual geometry teacher ($\pi^3$X) means its targets can be imperfect under challenging visual conditions such as occlusion, specularities, blur, or weak-prior viewpoints, even with confidence gating. The architectural dependence is another key limitation, as evidenced by the attenuated gains on GR00T 1.5, suggesting that the benefits are maximized when geometry-aware tokens have direct access to the action generation pathway. The method focuses solely on enhancing the visual-token representation, leaving other potential failure modes (e.g., in the action space, limited demonstrations, or weak language-action grounding) unaddressed. Finally, the teacher caches and auxiliary-head training add offline computational cost, although they are not needed at deployment.
G$^3$VLA represents a significant advancement towards making generalist robot manipulation more robust and precise, particularly in multi-camera environments. By injecting calibrated geometric inductive biases into existing VLA models without requiring architectural overhauls or explicit 3D sensor inputs, it provides a lightweight and practical pathway for improving spatial reasoning and out-of-distribution generalization. This approach could accelerate the deployment of VLAs in real-world settings where precise manipulation and adaptability to varying viewpoints are crucial. The method's compatibility with pretrained backbones means it can readily benefit from ongoing advancements in large vision-language models. The insights into architectural dependencies also guide future research in designing VLA models that can better leverage geometric information. This work contributes to bridging the gap between high-level semantic understanding and low-level spatial precision in robot learning. This paper introduces G$^3$VLA, a camera-aware geometric module that enhances pretrained Vision-Language-Action (VLA) models by injecting calibrated camera geometry into their visual-token stream without altering their core architecture or action space. The work presents a novel combination of intrinsic-conditioned ray embeddings, projective positional encoding (PRoPE), and bidirectional cross-view fusion, alongside a practical geometry distillation strategy from a $\pi^3$X teacher, to significantly improve spatial precision and out-of-distribution generalization in robot manipulation across diverse simulated and real-world benchmarks. The comprehensive experimental validation, including multi-backbone evaluation, extensive ablation studies, and crucial real-robot experiments demonstrating improved generalization under viewpoint shifts, firmly establishes G$^3$VLA as a valuable and practical advancement for the field of robot learning, offering a lightweight yet impactful solution to a critical limitation of current VLA systems.
Long-horizon robot manipulation policies trained with reward shaping can still exploit dense rewards through inefficient interaction, while rare efficient behaviors may be forgotten during training. We argue that temporal efficiency itself provides a powerful and underutilized source of self-supervision for reinforcement learning. We introduce Temporal Self-Imitation Learning (TSIL), a reinforcement learning framework that mines temporally efficient successful trajectories generated during learning and converts them into reusable supervision for future policy improvement. TSIL progressively refines learning using configuration-conditioned adaptive temporal targets derived from fast successful trajectories, while preserving and replaying efficient behaviors through efficiency-weighted self-imitation learning. Across 15 distinct long-horizon manipulation tasks, TSIL consistently improves learning efficiency, task-completion efficiency, revisitation of fast successful behaviors, and robustness to unstable training conditions. More broadly, our results suggest that the temporal structure of successful behavior itself provides a scalable self-supervisory signal for reinforcement learning beyond manually engineered reward shaping alone.
Primary: Duke University
All Institutions: Duke University
TSIL offers a powerful paradigm shift in how self-supervision can be generated for reinforcement learning, moving beyond manually engineered reward shaping. By demonstrating that the temporal structure of successful behavior itself provides a scalable self-supervisory signal, it opens new avenues for designing more efficient, robust, and autonomous learning systems for complex robotic tasks. This could significantly reduce the burden of reward engineering, accelerate policy learning, and lead to more dexterous and capable robots in real-world applications. The principles could also inspire similar approaches in other long-horizon decision-making problems beyond robotics. This paper introduces Temporal Self-Imitation Learning (TSIL), a novel reinforcement learning framework that leverages temporal efficiency as a self-supervisory signal to improve long-horizon robot manipulation. The method's innovative use of configuration-conditioned adaptive temporal targets and efficiency-weighted self-imitation learning, coupled with extensive empirical validation across 15 distinct tasks, demonstrates significant improvements in learning efficiency, task-completion efficiency, and robustness, positioning it as a highly impactful contribution to the field of robotics and reinforcement learning.
The proposed Temporal Self-Imitation Learning (TSIL) framework presents a well-conceived approach to address critical challenges in long-horizon robot manipulation: inefficient reward exploitation and the forgetting of rare, efficient behaviors. TSIL's core innovation lies in leveraging temporal efficiency itself as a self-supervisory signal. This is achieved through two main mechanisms: 1. **Configuration-conditioned adaptive temporal targets:** Instead of relying on static reward shaping, TSIL dynamically derives temporal targets from the fastest successful trajectories observed so far, conditioned on the current state (configuration). This makes the learning targets progressively more challenging and context-aware, pushing the policy towards increasingly efficient solutions. This adaptive mechanism is a significant improvement over fixed reward functions, which can often be exploited or become suboptimal as the policy improves. 2. **Efficiency-weighted self-imitation learning:** TSIL explicitly preserves and replays these fast, successful behaviors. By weighting the imitation loss based on the temporal efficiency of past trajectories, it prioritizes learning from the most optimal experiences. This directly combats the problem of catastrophic forgetting of rare but highly effective actions, ensuring that the policy continuously refines its understanding of efficient pathways. The methodology is coherent, directly targets known limitations of existing RL approaches in complex robotic tasks, and offers a scalable way to generate self-supervision.
The experimental evaluation is exceptionally strong, claiming consistent improvements across "15 distinct long-horizon manipulation tasks." This breadth of evaluation is crucial for demonstrating the generalizability and robustness of the TSIL framework beyond specific, hand-picked scenarios. The metrics of interest—learning efficiency, task-completion efficiency, revisitation of fast successful behaviors, and robustness to unstable training conditions—are all highly relevant and impactful for practical robot learning. The abstract's claim of "consistently improves" suggests statistically significant and repeatable gains, which is a high bar for empirical success in this domain. If these claims hold, the empirical evidence strongly supports the method's effectiveness and practical utility, making it a significant contribution to the field.
The mention of a project URL (`https://generalroboticslab.com/TSIL`) is a strong positive indicator for reproducibility. Project pages often include code implementations, detailed experimental setups, datasets, and potentially pre-trained models or videos, which are essential for researchers to verify and build upon the work. The structured nature of the paper (Method, Experiments sections) also implies a detailed description of the algorithm and experimental protocols.
While the paper presents a very strong case, potential limitations might include: 1. **Initial Success Requirement:** TSIL relies on mining "fast successful trajectories." If initial task success is extremely rare or non-existent, the method might struggle to bootstrap. 2. **Computational Overhead:** Mining, storing, and adaptively managing a growing set of efficient trajectories, especially in high-dimensional state spaces, could introduce computational overhead. 3. **Definition of "Configuration-conditioned":** The complexity of defining and implementing "configuration-conditioned" targets might vary significantly with the task and state representation, potentially requiring careful engineering. 4. **Generalizability beyond temporal efficiency:** While temporal efficiency is critical, some tasks might have other primary optimization criteria (e.g., energy consumption, safety, precision) that TSIL, in its current form, might not directly optimize.
TSIL offers a powerful paradigm shift in how self-supervision can be generated for reinforcement learning, moving beyond manually engineered reward shaping. By demonstrating that the temporal structure of successful behavior itself provides a scalable self-supervisory signal, it opens new avenues for designing more efficient, robust, and autonomous learning systems for complex robotic tasks. This could significantly reduce the burden of reward engineering, accelerate policy learning, and lead to more dexterous and capable robots in real-world applications. The principles could also inspire similar approaches in other long-horizon decision-making problems beyond robotics. This paper introduces Temporal Self-Imitation Learning (TSIL), a novel reinforcement learning framework that leverages temporal efficiency as a self-supervisory signal to improve long-horizon robot manipulation. The method's innovative use of configuration-conditioned adaptive temporal targets and efficiency-weighted self-imitation learning, coupled with extensive empirical validation across 15 distinct tasks, demonstrates significant improvements in learning efficiency, task-completion efficiency, and robustness, positioning it as a highly impactful contribution to the field of robotics and reinforcement learning.