Last 14 Days (April 25 – May 08, 2026)
Every document format in existence was designed for a human reader moving linearly through text. Autonomous LLM agents do not read - they retrieve. This fundamental mismatch forces agents to inject entire documents into their context window, wasting tokens on irrelevant content, compounding state across multi-turn loops, and broadcasting information indiscriminately across agent roles. We argue this is not a prompt engineering problem, not a retrieval problem, and not a compression problem: it is a format problem. We introduce OBJECTGRAPH (.og), a file format that reconceives the document as a typed, directed knowledge graph to be traversed rather than a string to be injected. OBJECTGRAPH is a strict superset of Markdown - every .md file is a valid .og file - requires no infrastructure beyond a two-primitive query protocol, and is readable by both humans and agents without tooling. We formalize the Document Consumption Problem, characterise six structural properties no existing format satisfies simultaneously, and prove OBJECTGRAPH satisfies all six. We further introduce the Progressive Disclosure Model, the Role-Scoped Access Protocol, and Executable Assertion Nodes as native format primitives. Empirical evaluation across five document classes and eight agent task types demonstrates up to 95.3 percent token reduction with no statistically significant degradation in task accuracy (p > 0.05). Transpiler fidelity reaches 98.7 percent content preservation on a held-out document benchmark.
Primary: Open Gigantic
All Institutions: Open Gigantic
ObjectGraph has the potential for significant broader impact across several dimensions: 1. **Cost and Efficiency**: The dramatic reduction in token consumption (up to 95.3%) and mitigation of context compounding (36.5x reduction) can substantially lower the operational costs of LLM agents and enable more complex, multi-turn workflows within existing context window limits. 2. **Agent Capabilities**: By providing structured, queryable knowledge, ObjectGraph can enhance agent reasoning, planning, and execution capabilities, leading to more reliable and autonomous agents. 3. **System Simplification**: The "ObjectGraph as Infrastructure" concept is powerful. Role-scoped access control, executable assertions, and delta loading natively within the document format can eliminate the need for external middleware, validation prompt templates, and change tracking systems, simplifying the architecture of multi-agent deployments. 4. **Human-Agent Collaboration**: Being a strict superset of Markdown, ObjectGraph allows both humans and agents to interact with the same source document, reducing maintenance overhead and fostering better alignment between human-authored instructions and agent execution. 5. **Knowledge Management**: It offers a more robust framework for managing agent knowledge bases, enabling features like automated staleness detection and structured updates. 6. **New Paradigm for Documents**: This work challenges the fundamental assumption of linear document consumption, proposing a new paradigm for how information is structured and accessed in the agentic era. If widely adopted, it could lead to a new ecosystem of tools and practices for agent-native content creation and consumption. This paper introduces ObjectGraph, a novel file format that re-imagines documents as typed knowledge graphs for LLM agents, achieving up to 95.3% token reduction and significant context compounding mitigation without degrading task accuracy. The work presents a comprehensive, well-designed solution to a fundamental problem in LLM agent deployment, offering a paradigm shift in document consumption that promises to enhance agent efficiency, capabilities, and simplify multi-agent system architectures.
The paper introduces ObjectGraph (.og), a novel file format designed to address the "Document Consumption Problem" for LLM agents. The core methodology reconceives documents as typed, directed knowledge graphs rather than linear text strings. The authors formalize this problem and derive six structural properties (Query-Addressable Index, Layered Compression, Typed Dependency Graph, Role-Scoped Access Control, Executable Assertions, Human Readability) that existing formats fail to satisfy simultaneously. ObjectGraph is presented as a strict superset of Markdown, ensuring backward compatibility. Key methodological components include: 1. **ObjectGraph Format Specification**: A detailed structure comprising a file-level manifest (meta, index, changelog blocks) and atomic knowledge units (nodes). Nodes are typed containers with stable identifiers, scope annotations, confidence scores, and versioning metadata. Content-type tags (e.g., `code`, `steps`, `warning`) provide explicit semantic meaning beyond visual cues. 2. **Progressive Disclosure Model (PDM)**: A three-pass reading model (Index, Dense, Full) that enables agents to retrieve only relevant information at the necessary fidelity level, significantly reducing token consumption. 3. **Typed Edge Declarations**: Supports explicit, machine-traversable relationships between nodes (e.g., `:requires`, `:precedes`, `:see-also`), allowing for automatic dependency resolution. 4. **Role-Based Access Control**: The `scope` attribute on nodes and index entries enables content filtering at the format level, eliminating the need for external middleware in multi-agent systems. 5. **Executable Assertion Nodes**: Allows embedding validation logic, retry mechanisms, and escalation paths directly within the document, triggered by the query protocol. 6. **Delta Loading via Changelog**: A `__changelog` meta-node facilitates incremental document updates, reducing the cost of checking for changes. 7. **LLM-Native Query Protocol**: A minimal two-primitive interface (`search_index`, `resolve_context`) that leverages the LLM itself as a "Router" for semantic index search, rather than relying on traditional keyword matching or embeddings. This is a particularly clever design choice. 8. **Transpiler**: A hybrid Markdown-to-ObjectGraph transpiler that uses deterministic parsers for content extraction and bounded LLM calls for metadata synthesis (dense blocks, index keywords), ensuring high fidelity and bounding hallucination risk. The methodology is comprehensive, well-articulated, and addresses the identified problems systematically. The design choices, such as Markdown superset and LLM-as-Router, are pragmatic and innovative.
The empirical evaluation is robust and addresses key research questions effectively. 1. **Corpus**: A benchmark of 240 documents across five classes (Skill Files, Operational Runbooks, Execution Plans, Technical Documentation, Knowledge Bases), ranging from 200 to 15,000 tokens, provides a diverse testbed. 2. **Task Suite**: Eight distinct task types (information lookup, procedure execution, multi-step planning, role-conditional access, cross-node reasoning, update detection, assertion verification, multi-agent handoff) cover a broad range of agent interactions. 3. **Models & Baselines**: Evaluation uses Claude Sonnet 4.5 (primary), Claude Haiku 4.5 (Router), and GPT-4o (cross-model validation). Baselines include Full Markdown injection, RAG (text-embedding-3-large), and SkillReducer-optimized Markdown. 4. **RQ1: Token Consumption**: ObjectGraph achieved a mean token reduction from 2,340 to 187 tokens (92.0% average, up to 95.3%), demonstrating significant cost savings. 5. **RQ2: Context Compounding Reduction**: In a 5-turn workflow, ObjectGraph (Architecture B) reduced cumulative token cost by 36.5x compared to Markdown (46,000 vs. 1,260 tokens), effectively mitigating the super-linear growth of context. 6. **RQ3: Task Accuracy**: ObjectGraph matched or exceeded Markdown accuracy on 7 of 8 task types. Notably, it showed dramatic improvements on Role-conditional access (+18.4%) and Update detection (+30.1%), tasks where Markdown lacks native support. The "less-is-more" effect, where reduced context improves accuracy by reducing attention dilution, is a significant finding. 7. **RQ4: Transpiler Fidelity**: The transpiler achieved a mean fidelity of 0.987 (SD=0.018) on 180 held-out documents, ensuring high content preservation. 8. **RQ5: Human Authoring Burden**: A user study with 18 participants rated authoring burden as low (mean 2.8/7), suggesting good usability for human authors. 9. **Ablation Study**: An ablation study clearly demonstrated the individual contributions of different ObjectGraph features to token reduction, providing valuable insights into the design's effectiveness. The experimental setup is comprehensive, the results are statistically significant (p > 0.05 for accuracy degradation), and the findings strongly support the claims of the paper.
The paper provides a detailed specification of the ObjectGraph format, including its structure, node types, edge syntax, and query protocol. The LLM prompt template for metadata synthesis is explicitly provided. The algorithms for structural extraction and the query protocol are outlined. While no direct code repository or dataset links are provided, the level of detail in the format specification and methodology sections is high enough that a motivated researcher could likely implement the format and protocol. The benchmark corpus is described in terms of document classes and token ranges, but the specific documents are not publicly available. The LLM models used are identified. Overall, the paper offers a strong foundation for reproducibility, though direct code access would enhance it further.
The authors acknowledge several limitations: 1. **Scale**: The benchmark of 240 documents, while curated, may not fully represent the diversity of real-world enterprise-scale corpora. 2. **Cross-file Federation**: The current specification does not support cross-file edge resolution, limiting its applicability to mono-repo or single-domain knowledge bases. This is a significant limitation for truly distributed knowledge graphs. 3. **Standardisation**: Without a standards body or broad community adoption, the format risks fragmentation into incompatible dialects. 4. **Adversarial Inputs**: The evaluation did not consider adversarial document authors who might craft misleading `dense` blocks or `index` entries to manipulate agent routing. Additional minor limitations could include the reliance on LLMs for routing, which, while a feature, could introduce its own set of challenges (e.g., prompt engineering for optimal routing, potential for misinterpretation if the index is poorly crafted).
ObjectGraph has the potential for significant broader impact across several dimensions: 1. **Cost and Efficiency**: The dramatic reduction in token consumption (up to 95.3%) and mitigation of context compounding (36.5x reduction) can substantially lower the operational costs of LLM agents and enable more complex, multi-turn workflows within existing context window limits. 2. **Agent Capabilities**: By providing structured, queryable knowledge, ObjectGraph can enhance agent reasoning, planning, and execution capabilities, leading to more reliable and autonomous agents. 3. **System Simplification**: The "ObjectGraph as Infrastructure" concept is powerful. Role-scoped access control, executable assertions, and delta loading natively within the document format can eliminate the need for external middleware, validation prompt templates, and change tracking systems, simplifying the architecture of multi-agent deployments. 4. **Human-Agent Collaboration**: Being a strict superset of Markdown, ObjectGraph allows both humans and agents to interact with the same source document, reducing maintenance overhead and fostering better alignment between human-authored instructions and agent execution. 5. **Knowledge Management**: It offers a more robust framework for managing agent knowledge bases, enabling features like automated staleness detection and structured updates. 6. **New Paradigm for Documents**: This work challenges the fundamental assumption of linear document consumption, proposing a new paradigm for how information is structured and accessed in the agentic era. If widely adopted, it could lead to a new ecosystem of tools and practices for agent-native content creation and consumption. This paper introduces ObjectGraph, a novel file format that re-imagines documents as typed knowledge graphs for LLM agents, achieving up to 95.3% token reduction and significant context compounding mitigation without degrading task accuracy. The work presents a comprehensive, well-designed solution to a fundamental problem in LLM agent deployment, offering a paradigm shift in document consumption that promises to enhance agent efficiency, capabilities, and simplify multi-agent system architectures.
Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting errors can matter more than average benchmark performance. We introduce SaFE-Scale, a framework for measuring how clinical LLM safety changes across model scale, evidence quality, retrieval strategy, context exposure, and inference-time compute. To instantiate this framework, we introduce RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined clean evidence, conflict evidence, and option-level labels for high-risk error, unsafe answer, and evidence contradiction. We evaluated 34 locally deployed LLMs across six deployment conditions: closed-book prompting (zero-shot), clean evidence, conflict evidence, standard RAG, agentic RAG, and max-context prompting. Clean evidence produced the strongest improvement, increasing mean accuracy from 73.5% to 94.1%, while reducing high-risk error from 12.0% to 2.6%, contradiction from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. Standard RAG and agentic RAG did not reproduce this safety profile: agentic RAG improved accuracy over standard RAG and reduced contradiction, but high-risk error and dangerous overconfidence remained elevated. Max-context prompting increased latency without closing the safety gap, and additional inference-time compute produced only limited gains. Worst-case analysis showed that clinically consequential errors concentrated in a small subset of questions. Clinical LLM safety is therefore not a passive consequence of scaling, but a deployment property shaped by evidence quality, retrieval design, context construction, and collective failure behavior.
Primary: Friedrich-Alexander-Universität Erlangen-Nürnberg
All Institutions: Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen National High Performance Computing Center, Institute of Radiology, University Hospital Erlangen, Lab for AI in Medicine, RWTH Aachen University, Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Chair of Computer Science 10
This paper has significant broader impact, particularly for the development and deployment of AI in high-stakes domains like medicine. It fundamentally challenges the prevailing assumption that scaling LLMs (larger models, longer contexts, more compute) automatically leads to safer behavior. The finding that evidence quality is paramount and that safety and accuracy decouple will necessitate a paradigm shift in how clinical LLMs are evaluated and deployed. It highlights the critical need for multi-dimensional safety metrics beyond accuracy, including high-risk error, contradiction, and dangerous overconfidence. The identification of "synchronized failure" in ensembles is a crucial warning for system designers relying on model agreement for robustness. The paper provides a valuable framework (SaFE-Scale) and benchmark (RadSaFE-200) that can guide future research and development towards truly safe and reliable clinical AI systems. Its insights are also relevant to other high-stakes applications of LLMs where confident, high-risk errors are unacceptable. This study rigorously demonstrates that clinical LLM safety is not a passive consequence of scaling but a deployment property critically shaped by evidence quality, retrieval design, and context construction, often decoupling from accuracy. The paper introduces SaFE-Scale, a novel framework, and RadSaFE-200, a benchmark with clinician-defined multi-dimensional safety labels, to empirically show that clean evidence dramatically improves both accuracy and safety, while model scale, retrieval, and inference-time compute offer limited or even misleading safety gains, particularly due to unreliable confidence and synchronized failures in ensembles. This comprehensive analysis provides crucial insights for developing and deploying safer LLMs in high-stakes clinical environments, urging a shift from accuracy-centric evaluation to explicit safety-focused monitoring of high-risk errors.
The paper introduces SaFE-Scale, a well-structured framework for evaluating clinical LLM safety across various scaling dimensions. This framework is instantiated with RadSaFE-200, a novel benchmark of 200 multiple-choice radiology questions. A key methodological strength is the clinician-defined, multi-dimensional safety labels at the option level: high-risk error, unsafe answer, and evidence contradiction. This moves beyond simple accuracy to capture the nuanced risks in clinical settings. The experimental design is comprehensive, evaluating 34 diverse LLMs across six deployment conditions (closed-book, clean evidence, conflict evidence, standard RAG, agentic RAG, max-context prompting) and additional inference-time compute strategies (self-consistency, ensembling). The use of Radiopaedia as an external evidence source for RAG is appropriate for the radiology domain. The metrics chosen (high-risk error rate, unsafe answer rate, contradiction rate, dangerous overconfidence rate, alongside accuracy) are directly relevant to clinical safety. The variance decomposition analysis to quantify the contributions of model family vs. deployment condition is a robust statistical approach. The worst-case analysis at the question level further strengthens the methodology by identifying specific, recurrent failure modes.
The experimental evaluation is exceptionally thorough and rigorous. The study's scale, involving 34 LLMs from various families and sizes, provides a broad and representative assessment of current LLM capabilities. The comparison across six distinct deployment conditions is critical for understanding how practical choices impact safety. The results consistently demonstrate that evidence quality, specifically clinician-written clean evidence, is the most dominant factor for both accuracy and safety, far outweighing model scale or inference-time compute. This is a significant empirical finding. The decoupling of accuracy and safety is clearly illustrated, with agentic RAG improving accuracy but not necessarily safety. The analysis of confidence as an unreliable safety signal, with high confidence observed even in high-risk errors, is a crucial and concerning finding. The investigation into self-consistency and ensembling reveals their limited safety gains and introduces the important concept of "synchronized failure" in ensembles, where multiple models make the same high-risk error. The worst-case analysis effectively highlights that critical failures are not random but concentrate in specific, challenging questions, which is highly valuable for targeted mitigation efforts. The statistical analysis, including variance decomposition, supports the conclusions robustly.
The paper states that "Full prompt templates, output-format instructions, and inference protocols are provided in Supplementary Note [REF]". This commitment to detailing the experimental setup is a strong indicator of reproducibility. The RadSaFE-200 benchmark is intended for public release, albeit with source-specific redistribution restrictions for some components, which is understandable given the use of copyrighted material like RSNA Case Collection and Radiopaedia. The detailed description of benchmark construction, safety augmentation protocol, and model panel specifications further aids reproducibility. While no direct code repository URL is provided in the text, the level of detail suggests that the experiments could be replicated by other researchers with sufficient effort and access to the benchmark.
The authors provide a comprehensive and transparent discussion of limitations. These include: 1. **Benchmark Scope:** Text-based, multiple-choice format does not capture the full complexity of radiology practice (image interpretation, open-ended reasoning, multimodal aspects). 2. **Benchmark Size:** 200 questions, while curated, may be insufficient for highly granular subgroup analyses. 3. **Question Balance:** The benchmark is primarily diagnostic/classification-oriented, reflecting Radiopaedia case structures, and not fully balanced across all question types. 4. **Subjectivity of Safety Labels:** Clinician-defined labels, while informed by rules, involve clinical judgment and implicit assumptions, especially for technical, physics, radiation therapy, and negation-type questions. Future work should include multiple annotators and inter-rater agreement. 5. **Null Responses:** Final null responses were scored as incorrect but not assigned safety labels, potentially underestimating option-level safety failures. 6. **Controlled Evidence:** Clean and conflict evidence are experimental constructs; real-world RAG evidence can be noisier, redundant, or irrelevant in more complex ways. 7. **Specific Implementations:** The RAG and agentic RAG implementations are specific choices; other methods might yield different safety profiles. 8. **Confidence Measurement:** Confidence was derived from entropy-normalized repeated-sampling stability, not calibrated token probabilities, limiting its interpretation as a full calibration study. 9. **Inference-time Compute:** Self-consistency and ensemble experiments were targeted, not exhaustive, leaving room for more advanced aggregation methods.
This paper has significant broader impact, particularly for the development and deployment of AI in high-stakes domains like medicine. It fundamentally challenges the prevailing assumption that scaling LLMs (larger models, longer contexts, more compute) automatically leads to safer behavior. The finding that evidence quality is paramount and that safety and accuracy decouple will necessitate a paradigm shift in how clinical LLMs are evaluated and deployed. It highlights the critical need for multi-dimensional safety metrics beyond accuracy, including high-risk error, contradiction, and dangerous overconfidence. The identification of "synchronized failure" in ensembles is a crucial warning for system designers relying on model agreement for robustness. The paper provides a valuable framework (SaFE-Scale) and benchmark (RadSaFE-200) that can guide future research and development towards truly safe and reliable clinical AI systems. Its insights are also relevant to other high-stakes applications of LLMs where confident, high-risk errors are unacceptable. This study rigorously demonstrates that clinical LLM safety is not a passive consequence of scaling but a deployment property critically shaped by evidence quality, retrieval design, and context construction, often decoupling from accuracy. The paper introduces SaFE-Scale, a novel framework, and RadSaFE-200, a benchmark with clinician-defined multi-dimensional safety labels, to empirically show that clean evidence dramatically improves both accuracy and safety, while model scale, retrieval, and inference-time compute offer limited or even misleading safety gains, particularly due to unreliable confidence and synchronized failures in ensembles. This comprehensive analysis provides crucial insights for developing and deploying safer LLMs in high-stakes clinical environments, urging a shift from accuracy-centric evaluation to explicit safety-focused monitoring of high-risk errors.
Recent megakernel designs for Mixture-of-Experts (MoE) inference fuse expert computation with fine-grained, GPU-initiated communication into a single persistent GPU kernel, and outperform collective-based MoE on a single node by overlapping data transfer with compute at tile granularity. This benefit does not carry over cleanly to multi-node inference, where experts span many nodes connected by an RDMA fabric. Communication-bound MoE models regress by up to $10\times$ on 8 nodes, and the regression worsens with node count. We trace this regression to hidden serialization in proxy-based RDMA transports. The ordering requirement between each tile transfer and its completion signal forces a fence that drains the NIC pipeline, and its cost grows with the number of concurrent transfers. As a result, models whose per-expert compute is too small to absorb this inflated network latency expose communication on the critical path. We present \emph{Perseus}, which eliminates this serialization through two techniques. \emph{Decoupled signaling} batches fences at per-destination granularity, reducing fence count by $8\times$. \emph{NIC-side ordering} replaces proxy stalls with hardware fence flags, so the proxy never blocks. On proxy-based transports, Perseus achieves up to 10.3$\times$ end-to-end speedup. Perseus on IBRC matches or exceeds IBGDA GPU-direct by up to 1.2$\times$, which shows that serialization, rather than the choice between proxy-based and GPU-direct transport, is what bounds multi-node megakernel performance.
Primary: Cornell University
All Institutions: Cornell University
Perseus has significant broader impact for the field of large-scale machine learning and distributed systems: - **Enables larger MoE models:** By effectively addressing a critical scaling bottleneck, Perseus allows MoE models to be deployed and inferred across many nodes with much higher efficiency, pushing the boundaries of what's possible with LLMs. - **Challenges conventional wisdom:** The finding that optimized proxy-based RDMA can outperform GPU-direct RDMA is a major insight that could shift design paradigms for distributed ML systems, potentially simplifying development by making proxy-based approaches more viable. - **Influences future hardware/software design:** The identification of hidden serialization and the effectiveness of NIC-side ordering could inform future designs of network interface cards (NICs) and communication libraries, encouraging more hardware-level support for flexible ordering and completion signaling. - **Applicability beyond MoE:** The principles of eliminating hidden serialization in fine-grained, GPU-initiated communication could be beneficial for other distributed workloads that exhibit similar communication patterns. This paper identifies a critical, previously hidden performance bottleneck in multi-node megakernel communication for MoE inference, deeply analyzes its root cause, and proposes elegant and highly effective solutions that yield up to 10.3x speedup and challenge conventional wisdom regarding proxy-based vs. GPU-direct RDMA. The work is technically profound, experimentally rigorous, and has significant implications for scaling large language models and distributed systems research.
The paper identifies a critical and previously "hidden" serialization bottleneck in multi-node megakernel communication for Mixture-of-Experts (MoE) inference, specifically within proxy-based RDMA transports. The core insight is that the ordering requirement between each fine-grained tile transfer and its completion signal (a doorbell write) forces a `wmb` (write memory barrier) on the CPU-side proxy. This `wmb` drains the NIC pipeline, and its cost grows with the number of concurrent transfers, leading to significant performance regression for communication-bound MoE models. Perseus proposes two technically sound and elegant solutions: 1. **Decoupled Signaling:** This technique batches multiple tile transfers before issuing a single doorbell write and its associated `wmb`. By reducing the number of `wmb`s by up to 8x, it significantly mitigates the serialization overhead. The GPU manages completion tracking for these batches. 2. **NIC-side Ordering:** This more fundamental solution leverages the RDMA write with immediate (RDMAD_WRITE_IMM) capability. By embedding the completion signal (immediate value) within the same RDMA operation as the data transfer, the NIC inherently guarantees ordering. This completely eliminates the need for a CPU-side `wmb`, allowing the proxy to never block. This is a particularly clever use of existing hardware capabilities to solve a software-induced serialization problem. The methodology is robust, clearly dissecting the problem, proposing targeted solutions, and explaining their mechanisms in detail.
The experimental evaluation is comprehensive and rigorous. - **Setup:** Experiments are conducted on a realistic multi-node cluster (8 nodes, 16 A100 GPUs) connected by an InfiniBand HDR fabric, which is highly relevant for large-scale ML deployments. - **Baselines:** The evaluation compares Perseus against strong baselines: IBRC (proxy-based RDMA, representing the problematic baseline) and IBGDA (GPU-direct RDMA, often considered the gold standard for high-performance communication). - **Workloads:** Real-world MoE models, including Switch Transformer (1.6B, 2.3B, 137B parameters) and GShard (600M), are used, demonstrating the practical applicability of the solution. - **Key Results:** - Perseus on IBRC achieves up to 10.3x end-to-end speedup over the baseline IBRC, a truly remarkable improvement. - Crucially, Perseus on IBRC matches or even exceeds IBGDA (GPU-direct) by up to 1.2x. This is a surprising and highly impactful finding, challenging the conventional wisdom that GPU-direct is inherently superior to proxy-based approaches for fine-grained communication. It demonstrates that serialization, not the choice of transport mechanism, was the primary bottleneck. - The paper provides a clear breakdown of the individual contributions of Decoupled Signaling and NIC-side Ordering, showing how they incrementally contribute to the overall speedup. - Microbenchmarks confirm the reduction in fence latency, validating the underlying hypothesis. - Sensitivity analysis to expert size further clarifies when Perseus provides the most benefit (models with smaller per-expert compute, where communication is more exposed). The results are convincing, well-supported, and clearly demonstrate the effectiveness and significance of Perseus.
The paper provides a detailed description of the problem, the proposed solutions, and their implementation within a modified UCX transport layer. The experimental setup, including hardware specifications and workloads, is also well-documented. While the source code is not provided (common for arXiv preprints), the level of detail should enable skilled systems researchers to reproduce the core ideas and potentially the results, given access to similar hardware.
- **Specificity to RDMA/InfiniBand:** The solutions are tailored to the specifics of RDMA transports and the `wmb` behavior in proxy-based communication. While the underlying principle of serialization might exist in other network fabrics, the exact solutions might not directly apply without adaptation. - **Generalizability to other communication patterns:** While the paper suggests applicability to other fine-grained, GPU-initiated communication, the primary focus and evaluation are on MoE's all-to-all pattern. Its effectiveness for other patterns would need further investigation. - **Overhead for extremely small messages:** While MoE tiles are fine-grained, they are not necessarily extremely tiny. For truly byte-level communication, the batching overhead of decoupled signaling or the immediate value processing might introduce new trade-offs, though this is not the target use case.
Perseus has significant broader impact for the field of large-scale machine learning and distributed systems: - **Enables larger MoE models:** By effectively addressing a critical scaling bottleneck, Perseus allows MoE models to be deployed and inferred across many nodes with much higher efficiency, pushing the boundaries of what's possible with LLMs. - **Challenges conventional wisdom:** The finding that optimized proxy-based RDMA can outperform GPU-direct RDMA is a major insight that could shift design paradigms for distributed ML systems, potentially simplifying development by making proxy-based approaches more viable. - **Influences future hardware/software design:** The identification of hidden serialization and the effectiveness of NIC-side ordering could inform future designs of network interface cards (NICs) and communication libraries, encouraging more hardware-level support for flexible ordering and completion signaling. - **Applicability beyond MoE:** The principles of eliminating hidden serialization in fine-grained, GPU-initiated communication could be beneficial for other distributed workloads that exhibit similar communication patterns. This paper identifies a critical, previously hidden performance bottleneck in multi-node megakernel communication for MoE inference, deeply analyzes its root cause, and proposes elegant and highly effective solutions that yield up to 10.3x speedup and challenge conventional wisdom regarding proxy-based vs. GPU-direct RDMA. The work is technically profound, experimentally rigorous, and has significant implications for scaling large language models and distributed systems research.
Modern Mixture-of-Experts (MoE) architectures allocate expert capacity through a rigid per-layer rule: each transformer layer owns a separate expert set. This convention couples depth scaling with linear expert-parameter growth and assumes that every layer needs isolated expert capacity. However, recent analyses and our routing probe challenge this allocation rule: replacing a deeper layer's learned top-k router with uniform random routing drops downstream accuracy by only 1.0-1.6 points across multiple production MoE models. Motivated by this redundancy, we propose UniPool, an MoE architecture that treats expert capacity as a global architectural budget by replacing per-layer expert ownership with a single shared pool accessed by independent per-layer routers. To enable stable and balanced training under sharing, we introduce a pool-level auxiliary loss that balances expert utilization across the entire pool, and adopt NormRouter to provide sparse and scale-stable routing into the shared expert pool. Across five LLaMA-architecture model scales (182M, 469M, 650M, 830M, and 978M parameters) trained on 30B tokens from the Pile, UniPool consistently improves validation loss and perplexity over the matched vanilla MoE baselines. Across these scales, UniPool reduces validation loss by up to 0.0386 relative to vanilla MoE. Beyond raw loss improvement, our results identify pool size as an explicit depth-scaling hyperparameter: reduced-pool UniPool variants using only 41.6%-66.7% of the vanilla expert-parameter budget match or outperform layer-wise MoE at the tested scales. This shows that, under a shared-pool design, expert parameters need not grow linearly with depth; they can grow sublinearly while remaining more efficient and effective than vanilla MoE. Further analysis shows that UniPool's benefits compose with finer-grained expert decomposition.
Primary: Unknown
All Institutions: Unknown
UniPool introduces a novel Mixture-of-Experts architecture that replaces layer-private expert ownership with a globally shared expert pool, demonstrating consistent performance improvements and significant parameter efficiency by enabling sublinear expert parameter growth with depth. This paper presents a well-motivated architectural innovation, supported by rigorous experiments across multiple scales, thorough ablation studies, and insightful analyses, offering a compelling new direction for scaling MoE models more efficiently.
The methodology for UniPool is robust, well-motivated, and addresses a critical architectural limitation in modern Mixture-of-Experts (MoE) models. The core idea of replacing rigid per-layer expert ownership with a single, global shared expert pool is a significant architectural departure. This is directly motivated by an empirical routing probe showing redundancy in deeper layers of vanilla MoE models. To enable stable and balanced training under this shared paradigm, the paper introduces two key technical components: a novel pool-level auxiliary loss that ensures balanced expert utilization across the entire global pool, and the adoption of NormRouter for sparse and scale-stable routing. The derivation of the pool-level auxiliary loss is clearly presented, and its necessity is well-justified by the global ownership structure. NormRouter's properties, such as L2 normalization and ReLU activation, are well-suited for routing into a larger, shared expert set where hidden state norms and logit scales might vary across layers. The overall approach effectively converts depth-induced redundancy into architectural reuse, decoupling the total expert parameter count from linear growth with depth.
The experimental evaluation is comprehensive and rigorously conducted for the chosen scales. The authors train five LLaMA-architecture models ranging from 182M to 978M parameters on 30B tokens from the Pile dataset, providing a solid foundation for their claims. Crucially, UniPool is compared against vanilla MoE baselines that are matched in total expert FFNs and per-token FLOPs, ensuring that performance gains are attributable to the architectural changes rather than increased compute. UniPool consistently outperforms vanilla MoE in validation loss and perplexity across all tested scales, with significant reductions (up to 0.0386). The most impactful experimental finding is the performance of "reduced-pool" UniPool variants, which achieve comparable or superior performance to vanilla MoE using only 41.6%-66.7% of the expert parameters, demonstrating substantial parameter efficiency. The ablation studies are thorough, clearly isolating the contributions of the shared pool, pool-level auxiliary loss, and NormRouter, showing their synergistic effects. Further analyses, including a routing-randomization probe, effectively demonstrate that UniPool's routers become more "load-bearing" and experts more specialized, supporting the core hypothesis of reduced redundancy. Downstream zero-shot evaluations on seven benchmarks generally show UniPool performing on par or slightly better, indicating that perplexity gains translate to some task-level benefits. Training dynamics presented in the appendix further confirm consistent gains throughout optimization.
The paper demonstrates a strong commitment to reproducibility. The code for UniPool is open-sourced on GitHub, which is a critical factor. Detailed architectural specifications, MoE configurations, and complete hyperparameter settings are provided in the appendix, allowing for replication of the experiments. The authors also perform variance checks for the 182M model scale by averaging results over three random seeds, which adds confidence to the findings. Implementation details, such as the use of Megatron-LM, specific optimizer settings (AdamW, cosine LR, bf16), and distributed training strategies (sequence parallelism, distributed optimizer, activation checkpointing), are clearly stated.
The authors candidly acknowledge several limitations. The primary limitation is the scale of experiments, which are conducted up to 978M parameters and 30B training tokens. While consistent improvements across these scales are encouraging, validation at billion-parameter scales with longer training horizons is an important next step. The paper also notes that wall-clock throughput comparisons are not reported. While reduced-pool UniPool variants offer memory savings, the full-pool UniPool has the same expert parameter count as vanilla MoE, and potential overheads from the pool auxiliary loss (cross-layer statistic accumulation) and routing into a larger candidate pool are identified as areas for future work. Finally, the authors suggest that a broader downstream evaluation, including few-shot settings, would further strengthen the findings.
UniPool has significant broader impact potential for the design and scaling of large language models. By challenging the conventional per-layer expert ownership in MoE architectures, it introduces a more parameter-efficient paradigm where expert capacity can be treated as a reusable global budget. The demonstration that expert parameters can grow sublinearly with depth while improving performance offers a fundamental shift in MoE scaling laws, potentially leading to the development of larger, more capable models with reduced computational and memory footprints for their expert components. This work provides a valuable architectural blueprint and a methodological approach for identifying and addressing redundancy through targeted parameter sharing, which could inspire similar innovations in other complex neural network architectures. UniPool introduces a novel Mixture-of-Experts architecture that replaces layer-private expert ownership with a globally shared expert pool, demonstrating consistent performance improvements and significant parameter efficiency by enabling sublinear expert parameter growth with depth. This paper presents a well-motivated architectural innovation, supported by rigorous experiments across multiple scales, thorough ablation studies, and insightful analyses, offering a compelling new direction for scaling MoE models more efficiently.
We introduce PALACE (Persistence Adaptive-Landmark Analytic Classification Engine), the data-adaptive companion to PLACE, paying a small cross-validation tier on three knobs (budget, radii, bandwidth; $\leq 5$ choices each). A cover-theoretic core (Lebesgue-number criterion on the landmark cover) yields four closed-form guarantees. (i) A structural lower distortion bound $λ(τ;ν)$ on $\mathcal{D}_n$ under cross-diagram non-interference, with a $(D/L)^2$ budget reduction over the uniform grid when diagrams concentrate. (ii) Equal weights $w_k = K^{-1/2}$ maximizing $λ$, and farthest-point-sampling positions $2$-approximating the optimal $k$-center covering radius; both derived from training labels alone, no gradient training. (iii) A kernel-RKHS classification rate $O((k-1)\sqrt{K}/(γ\sqrt{m_{\min}}))$ with binary necessity threshold $m = Ω(\sqrt K/γ)$ from a matching Le Cam lower bound, and a closed-form filtration-selection rule. The kernel-Mahalanobis margin $\hatρ_{\mathrm{Mah}}$ is the strongest closed-form ranker across the chemical-graph pool (mean Spearman $ρ\approx +0.60$); the isotropic surrogate $\hatγ/\sqrt{K}$ admits a selection-consistency rate, and $\widehatλ$ from (i) provides an independent data-level signal (positive on COX2 and PTC). (iv) A per-prediction certificate, in non-asymptotic Pinelis and asymptotic Gaussian forms, with no calibration split. Empirically, PALACE is the strongest closed-form diagram-based method on Orbit5k ($91.3 \pm 1.0\%$, matching Persformer), leads every diagram-based competitor on COX2 and MUTAG, and is competitive on DHFR (within 1 pp of ECP). At $8\times$ domain inflation, adaptive placement maintains $94\%$ while the uniform grid collapses to chance ($25\%$ on 4-class data).
Primary: not specified
All Institutions: not specified
PALACE has significant broader impact potential, particularly in domains requiring trustworthy and interpretable machine learning. * **Certified AI**: The per-prediction certificates are a major step towards certified AI, offering quantifiable confidence in individual predictions. This is critical for high-stakes applications in medicine, materials science, and security where TDA is increasingly used. * **Topological Data Analysis**: It advances the field of TDA by providing a principled, data-adaptive, and theoretically grounded method for persistence diagram vectorization and classification. It offers a strong alternative to purely black-box deep learning approaches, especially for researchers who prioritize mathematical guarantees and interpretability. * **Graph and Point Cloud Learning**: By improving classification on graph and point cloud data, PALACE can benefit various applications in chemistry (drug discovery), materials science, computer graphics, and robotics. * **Bridging Theory and Practice**: The paper successfully bridges advanced mathematical theory (cover theory, RKHS) with practical machine learning, demonstrating how rigorous theoretical guarantees can lead to competitive empirical performance. This could inspire further research into theoretically sound ML methods. * **Reduced Budget**: The $(D/L)^2$ budget reduction mechanism for landmark placement is important for efficiency, especially when dealing with large datasets or complex persistence diagrams, making TDA more scalable. PALACE introduces a data-adaptive, closed-form kernel for persistence diagram classification, providing novel theoretical guarantees including a lower distortion bound, optimal landmark placement, a kernel-RKHS classification rate, and per-prediction certificates, achieving strong empirical performance. This paper makes a profound contribution to topological data analysis and certified machine learning by providing a robust, theoretically grounded framework for classifying structured data, offering explicit guarantees that are often missing in modern ML approaches. Its rigorous mathematical development, combined with competitive empirical results and the unique feature of per-prediction certificates, positions it as a highly significant work that can influence the development of more trustworthy and interpretable AI systems in TDA applications.
PALACE (Persistence Adaptive-Landmark Analytic Classification Engine) is a significant advancement over its predecessor, PLACE, addressing key limitations of fixed-grid persistence diagram vectorizations. The core methodological innovation lies in its data-adaptive landmark placement, which replaces a uniform grid with a configuration learned from training data via class-aware farthest-point sampling (FPS). This allows landmarks to concentrate where diagrams live, leading to a theoretically proven $(D/L)^2$ budget reduction. The paper develops a self-contained non-uniform cover theory based on a Lebesgue-number criterion to establish four closed-form guarantees: 1. **Structural Lower Distortion Bound**: A non-trivial lower bound $\lambda(\tau; \mathcal{C})$ on the embedding distortion, ensuring that bottleneck-separated diagrams remain separated in the embedding space. This is a crucial theoretical contribution, as most existing vectorizations only offer upper bounds. 2. **Optimal Configuration Choices**: It derives that equal weights $w_k = K^{-1/2}$ maximize the certificate and that FPS provides a 2-approximation to the optimal $k$-center covering radius for landmark positions. These choices are derived from training labels alone, without gradient training, maintaining the "closed-form" ethos. 3. **Kernel-RKHS Classification Rate**: A classification rate $O((k-1)\sqrt{K}/(\gamma\sqrt{m_{\min}}))$ for an RKHS-lifted embedding, with a matching Le Cam lower bound. This extends the analysis beyond linear classifiers, which is empirically shown to be necessary. The paper also provides closed-form filtration selection rules (e.g., kernel-Mahalanobis margin) with selection-consistency rates. 4. **Per-Prediction Certificate**: A non-asymptotic Pinelis and asymptotic Gaussian form certificate for individual predictions, requiring no calibration split. This is a strong feature for certified machine learning. The methodology is rigorously grounded in mathematics, leveraging concepts from cover theory, metric geometry, and kernel methods. The transition from raw embedding to an RKHS via an additive landmark kernel is well-justified, and the paper carefully details the connections between the cover-level certificate and the kernel-margin-based classification.
The experimental evaluation is comprehensive and well-structured, covering both synthetic and real-world datasets. * **Datasets**: PALACE is evaluated on Orbit5k (point clouds), and five chemical graph benchmarks (COX2, DHFR, MUTAG, NCI1, PTC). It also includes a synthetic task to demonstrate the budget reduction. * **Performance**: * On Orbit5k, PALACE achieves $91.3 \pm 1.0\%$, matching Persformer (a gradient-trained black-box transformer) and outperforming all other closed-form diagram-based methods, including its predecessor PLACE. * On COX2 and MUTAG, PALACE leads every diagram-based competitor. * On DHFR, it is competitive, within 1 percentage point of ECP. * The Mahalanobis-margin ranker is shown to be the strongest closed-form ranker across the chemical-graph pool (mean Spearman $\rho \approx +0.60$), providing a consistent positive signal. * The synthetic task clearly demonstrates the $(D/L)^2$ budget reduction, with adaptive placement maintaining $94\%$ accuracy at $8\times$ domain inflation where the uniform grid collapses to chance ($25\%$). * **Certificates and Diagnostics**: The paper provides empirical validation of the certificate $\widehat{\lambda}$ as an independent data-level signal, positive on COX2 and PTC. It also includes an empirical audit of the non-interference hypothesis, acknowledging that it is rarely met pointwise on chemical diagrams but clarifying that the classification machinery operates at the kernel-margin level, which is robust. * **Limitations in Experiments**: The paper notes "descriptor blindness" on NCI1 and PTC, indicating areas where the current features might not be sufficiently discriminative. It also defers headline accuracies for PROTEINS, DD, IMDB-B, IMDB-M, and NCI109 to a future revision, which slightly limits the completeness of the empirical picture but does not detract from the core claims validated. Overall, the experiments provide strong empirical support for PALACE's theoretical claims and its competitive performance against state-of-the-art methods, especially considering its closed-form and certified nature.
The paper emphasizes its "closed-form" nature, meaning many components (weights, landmark placement strategy, classification rate, certificates) are analytically derived rather than learned via gradient descent. This inherently aids reproducibility. The small cross-validation tier (budget, radii, bandwidth; $\leq 5$ choices each) is clearly stated, indicating a limited hyperparameter search space. The use of `sklearn.svm.SVC` with `kernel='precomputed'` is mentioned, providing a specific implementation detail. The detailed theoretical derivations in the main text and appendix (not provided here, but implied by the text) would further support reproducibility. The methodology is described with sufficient detail for a technically proficient researcher to implement.
1. **Non-Interference Hypothesis**: The paper acknowledges that the non-interference condition (a prerequisite for the lower distortion bound) is "essentially never met on chemical persistence diagrams" empirically. While the authors clarify that the classification rate relies on the kernel margin, this highlights a gap between the theoretical ideal and practical data characteristics. 2. **Descriptor Blindness**: PALACE exhibits "descriptor blindness" on NCI1 and PTC, suggesting that the current persistence diagram features, even with adaptive placement, may not capture sufficient information for all datasets. 3. **Cross-Validation Tier**: While small, the need for a cross-validation tier for budget, radii, and bandwidth means PALACE is not entirely tuning-free, unlike its predecessor PLACE. This is a trade-off for adaptivity and RKHS lift. 4. **Deferred Results**: The deferral of headline accuracies for several graph datasets (PROTEINS, DD, IMDB-B, IMDB-M, NCI109) means the full empirical scope is not yet presented. 5. **Computational Cost**: While the paper mentions $O(K|A|)$ for coordinate calculation, the overall computational cost for FPS on large datasets or the Lebesgue number calculation is not explicitly detailed, though it's likely manageable given the closed-form nature.
PALACE has significant broader impact potential, particularly in domains requiring trustworthy and interpretable machine learning. * **Certified AI**: The per-prediction certificates are a major step towards certified AI, offering quantifiable confidence in individual predictions. This is critical for high-stakes applications in medicine, materials science, and security where TDA is increasingly used. * **Topological Data Analysis**: It advances the field of TDA by providing a principled, data-adaptive, and theoretically grounded method for persistence diagram vectorization and classification. It offers a strong alternative to purely black-box deep learning approaches, especially for researchers who prioritize mathematical guarantees and interpretability. * **Graph and Point Cloud Learning**: By improving classification on graph and point cloud data, PALACE can benefit various applications in chemistry (drug discovery), materials science, computer graphics, and robotics. * **Bridging Theory and Practice**: The paper successfully bridges advanced mathematical theory (cover theory, RKHS) with practical machine learning, demonstrating how rigorous theoretical guarantees can lead to competitive empirical performance. This could inspire further research into theoretically sound ML methods. * **Reduced Budget**: The $(D/L)^2$ budget reduction mechanism for landmark placement is important for efficiency, especially when dealing with large datasets or complex persistence diagrams, making TDA more scalable. PALACE introduces a data-adaptive, closed-form kernel for persistence diagram classification, providing novel theoretical guarantees including a lower distortion bound, optimal landmark placement, a kernel-RKHS classification rate, and per-prediction certificates, achieving strong empirical performance. This paper makes a profound contribution to topological data analysis and certified machine learning by providing a robust, theoretically grounded framework for classifying structured data, offering explicit guarantees that are often missing in modern ML approaches. Its rigorous mathematical development, combined with competitive empirical results and the unique feature of per-prediction certificates, positions it as a highly significant work that can influence the development of more trustworthy and interpretable AI systems in TDA applications.
Every document format in existence was designed for a human reader moving linearly through text. Autonomous LLM agents do not read - they retrieve. This fundamental mismatch forces agents to inject entire documents into their context window, wasting tokens on irrelevant content, compounding state across multi-turn loops, and broadcasting information indiscriminately across agent roles. We argue this is not a prompt engineering problem, not a retrieval problem, and not a compression problem: it is a format problem. We introduce OBJECTGRAPH (.og), a file format that reconceives the document as a typed, directed knowledge graph to be traversed rather than a string to be injected. OBJECTGRAPH is a strict superset of Markdown - every .md file is a valid .og file - requires no infrastructure beyond a two-primitive query protocol, and is readable by both humans and agents without tooling. We formalize the Document Consumption Problem, characterise six structural properties no existing format satisfies simultaneously, and prove OBJECTGRAPH satisfies all six. We further introduce the Progressive Disclosure Model, the Role-Scoped Access Protocol, and Executable Assertion Nodes as native format primitives. Empirical evaluation across five document classes and eight agent task types demonstrates up to 95.3 percent token reduction with no statistically significant degradation in task accuracy (p > 0.05). Transpiler fidelity reaches 98.7 percent content preservation on a held-out document benchmark.
Primary: Open Gigantic
All Institutions: Open Gigantic
ObjectGraph has the potential for significant broader impact across several dimensions: 1. **Cost and Efficiency**: The dramatic reduction in token consumption (up to 95.3%) and mitigation of context compounding (36.5x reduction) can substantially lower the operational costs of LLM agents and enable more complex, multi-turn workflows within existing context window limits. 2. **Agent Capabilities**: By providing structured, queryable knowledge, ObjectGraph can enhance agent reasoning, planning, and execution capabilities, leading to more reliable and autonomous agents. 3. **System Simplification**: The "ObjectGraph as Infrastructure" concept is powerful. Role-scoped access control, executable assertions, and delta loading natively within the document format can eliminate the need for external middleware, validation prompt templates, and change tracking systems, simplifying the architecture of multi-agent deployments. 4. **Human-Agent Collaboration**: Being a strict superset of Markdown, ObjectGraph allows both humans and agents to interact with the same source document, reducing maintenance overhead and fostering better alignment between human-authored instructions and agent execution. 5. **Knowledge Management**: It offers a more robust framework for managing agent knowledge bases, enabling features like automated staleness detection and structured updates. 6. **New Paradigm for Documents**: This work challenges the fundamental assumption of linear document consumption, proposing a new paradigm for how information is structured and accessed in the agentic era. If widely adopted, it could lead to a new ecosystem of tools and practices for agent-native content creation and consumption. This paper introduces ObjectGraph, a novel file format that re-imagines documents as typed knowledge graphs for LLM agents, achieving up to 95.3% token reduction and significant context compounding mitigation without degrading task accuracy. The work presents a comprehensive, well-designed solution to a fundamental problem in LLM agent deployment, offering a paradigm shift in document consumption that promises to enhance agent efficiency, capabilities, and simplify multi-agent system architectures.
The paper introduces ObjectGraph (.og), a novel file format designed to address the "Document Consumption Problem" for LLM agents. The core methodology reconceives documents as typed, directed knowledge graphs rather than linear text strings. The authors formalize this problem and derive six structural properties (Query-Addressable Index, Layered Compression, Typed Dependency Graph, Role-Scoped Access Control, Executable Assertions, Human Readability) that existing formats fail to satisfy simultaneously. ObjectGraph is presented as a strict superset of Markdown, ensuring backward compatibility. Key methodological components include: 1. **ObjectGraph Format Specification**: A detailed structure comprising a file-level manifest (meta, index, changelog blocks) and atomic knowledge units (nodes). Nodes are typed containers with stable identifiers, scope annotations, confidence scores, and versioning metadata. Content-type tags (e.g., `code`, `steps`, `warning`) provide explicit semantic meaning beyond visual cues. 2. **Progressive Disclosure Model (PDM)**: A three-pass reading model (Index, Dense, Full) that enables agents to retrieve only relevant information at the necessary fidelity level, significantly reducing token consumption. 3. **Typed Edge Declarations**: Supports explicit, machine-traversable relationships between nodes (e.g., `:requires`, `:precedes`, `:see-also`), allowing for automatic dependency resolution. 4. **Role-Based Access Control**: The `scope` attribute on nodes and index entries enables content filtering at the format level, eliminating the need for external middleware in multi-agent systems. 5. **Executable Assertion Nodes**: Allows embedding validation logic, retry mechanisms, and escalation paths directly within the document, triggered by the query protocol. 6. **Delta Loading via Changelog**: A `__changelog` meta-node facilitates incremental document updates, reducing the cost of checking for changes. 7. **LLM-Native Query Protocol**: A minimal two-primitive interface (`search_index`, `resolve_context`) that leverages the LLM itself as a "Router" for semantic index search, rather than relying on traditional keyword matching or embeddings. This is a particularly clever design choice. 8. **Transpiler**: A hybrid Markdown-to-ObjectGraph transpiler that uses deterministic parsers for content extraction and bounded LLM calls for metadata synthesis (dense blocks, index keywords), ensuring high fidelity and bounding hallucination risk. The methodology is comprehensive, well-articulated, and addresses the identified problems systematically. The design choices, such as Markdown superset and LLM-as-Router, are pragmatic and innovative.
The empirical evaluation is robust and addresses key research questions effectively. 1. **Corpus**: A benchmark of 240 documents across five classes (Skill Files, Operational Runbooks, Execution Plans, Technical Documentation, Knowledge Bases), ranging from 200 to 15,000 tokens, provides a diverse testbed. 2. **Task Suite**: Eight distinct task types (information lookup, procedure execution, multi-step planning, role-conditional access, cross-node reasoning, update detection, assertion verification, multi-agent handoff) cover a broad range of agent interactions. 3. **Models & Baselines**: Evaluation uses Claude Sonnet 4.5 (primary), Claude Haiku 4.5 (Router), and GPT-4o (cross-model validation). Baselines include Full Markdown injection, RAG (text-embedding-3-large), and SkillReducer-optimized Markdown. 4. **RQ1: Token Consumption**: ObjectGraph achieved a mean token reduction from 2,340 to 187 tokens (92.0% average, up to 95.3%), demonstrating significant cost savings. 5. **RQ2: Context Compounding Reduction**: In a 5-turn workflow, ObjectGraph (Architecture B) reduced cumulative token cost by 36.5x compared to Markdown (46,000 vs. 1,260 tokens), effectively mitigating the super-linear growth of context. 6. **RQ3: Task Accuracy**: ObjectGraph matched or exceeded Markdown accuracy on 7 of 8 task types. Notably, it showed dramatic improvements on Role-conditional access (+18.4%) and Update detection (+30.1%), tasks where Markdown lacks native support. The "less-is-more" effect, where reduced context improves accuracy by reducing attention dilution, is a significant finding. 7. **RQ4: Transpiler Fidelity**: The transpiler achieved a mean fidelity of 0.987 (SD=0.018) on 180 held-out documents, ensuring high content preservation. 8. **RQ5: Human Authoring Burden**: A user study with 18 participants rated authoring burden as low (mean 2.8/7), suggesting good usability for human authors. 9. **Ablation Study**: An ablation study clearly demonstrated the individual contributions of different ObjectGraph features to token reduction, providing valuable insights into the design's effectiveness. The experimental setup is comprehensive, the results are statistically significant (p > 0.05 for accuracy degradation), and the findings strongly support the claims of the paper.
The paper provides a detailed specification of the ObjectGraph format, including its structure, node types, edge syntax, and query protocol. The LLM prompt template for metadata synthesis is explicitly provided. The algorithms for structural extraction and the query protocol are outlined. While no direct code repository or dataset links are provided, the level of detail in the format specification and methodology sections is high enough that a motivated researcher could likely implement the format and protocol. The benchmark corpus is described in terms of document classes and token ranges, but the specific documents are not publicly available. The LLM models used are identified. Overall, the paper offers a strong foundation for reproducibility, though direct code access would enhance it further.
The authors acknowledge several limitations: 1. **Scale**: The benchmark of 240 documents, while curated, may not fully represent the diversity of real-world enterprise-scale corpora. 2. **Cross-file Federation**: The current specification does not support cross-file edge resolution, limiting its applicability to mono-repo or single-domain knowledge bases. This is a significant limitation for truly distributed knowledge graphs. 3. **Standardisation**: Without a standards body or broad community adoption, the format risks fragmentation into incompatible dialects. 4. **Adversarial Inputs**: The evaluation did not consider adversarial document authors who might craft misleading `dense` blocks or `index` entries to manipulate agent routing. Additional minor limitations could include the reliance on LLMs for routing, which, while a feature, could introduce its own set of challenges (e.g., prompt engineering for optimal routing, potential for misinterpretation if the index is poorly crafted).
ObjectGraph has the potential for significant broader impact across several dimensions: 1. **Cost and Efficiency**: The dramatic reduction in token consumption (up to 95.3%) and mitigation of context compounding (36.5x reduction) can substantially lower the operational costs of LLM agents and enable more complex, multi-turn workflows within existing context window limits. 2. **Agent Capabilities**: By providing structured, queryable knowledge, ObjectGraph can enhance agent reasoning, planning, and execution capabilities, leading to more reliable and autonomous agents. 3. **System Simplification**: The "ObjectGraph as Infrastructure" concept is powerful. Role-scoped access control, executable assertions, and delta loading natively within the document format can eliminate the need for external middleware, validation prompt templates, and change tracking systems, simplifying the architecture of multi-agent deployments. 4. **Human-Agent Collaboration**: Being a strict superset of Markdown, ObjectGraph allows both humans and agents to interact with the same source document, reducing maintenance overhead and fostering better alignment between human-authored instructions and agent execution. 5. **Knowledge Management**: It offers a more robust framework for managing agent knowledge bases, enabling features like automated staleness detection and structured updates. 6. **New Paradigm for Documents**: This work challenges the fundamental assumption of linear document consumption, proposing a new paradigm for how information is structured and accessed in the agentic era. If widely adopted, it could lead to a new ecosystem of tools and practices for agent-native content creation and consumption. This paper introduces ObjectGraph, a novel file format that re-imagines documents as typed knowledge graphs for LLM agents, achieving up to 95.3% token reduction and significant context compounding mitigation without degrading task accuracy. The work presents a comprehensive, well-designed solution to a fundamental problem in LLM agent deployment, offering a paradigm shift in document consumption that promises to enhance agent efficiency, capabilities, and simplify multi-agent system architectures.
Multimodal Large Language Models (MLLMs) have demonstrated strong image-level visual understanding and reasoning, yet their pixel-level perception across both images and videos remains limited. Foundation segmentation models such as the SAM series produce high-quality masks, but they rely on low-level visual prompts and cannot natively interpret complex conversational instructions. Existing segmentation MLLMs narrow this gap, but are usually specialized for either images or videos and rarely support both textual and visual prompts in one interface. We introduce X2SAM, a unified segmentation MLLM that extends any-segmentation capabilities from images to videos. Given conversational instructions and visual prompts, X2SAM couples an LLM with a Mask Memory module that stores guided vision features for temporally consistent video mask generation. The same formulation supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. We further introduce the Video Visual Grounded (V-VGD) segmentation benchmark, which evaluates whether a model can segment object tracks in videos from interactive visual prompts. With a unified joint training strategy over heterogeneous image and video datasets, X2SAM delivers strong video segmentation performance, remains competitive on image segmentation benchmarks, and preserves general image and video chat ability.
Primary: unknown
All Institutions: unknown
X2SAM represents a significant step towards more generalized and intuitive multimodal AI. * **Enhanced Human-Computer Interaction:** The conversational interface supporting both text and visual prompts for pixel-level control across images and videos could lead to more natural and powerful interaction paradigms for visual editing, content creation, and data annotation. * **Advanced Video Understanding:** The ability to perform complex segmentation tasks with temporal consistency in videos opens doors for applications in autonomous driving, surveillance, robotics, and medical imaging, where precise spatio-temporal object understanding is critical. * **Foundation for Future MLLMs:** By demonstrating effective unification of image and video segmentation within an MLLM, X2SAM provides a strong baseline and architectural insights for developing even more capable multimodal foundation models. * **New Benchmarking:** The V-VGD benchmark provides a valuable tool for the community to evaluate and advance research in video visual grounded segmentation. X2SAM introduces a unified MLLM framework that extends "any-segmentation" from images to videos, integrating a novel Mask Memory module for temporal consistency and a unified joint training strategy. This paper makes a substantial technical contribution by enabling a single model to perform a wide array of image and video segmentation tasks with both textual and visual prompts, achieving strong performance across modalities and introducing a valuable new benchmark for video visual grounded segmentation.
X2SAM proposes a unified segmentation MLLM designed to extend "any-segmentation" capabilities from images to videos, supporting both textual and visual prompts. The core methodology addresses three key challenges: comprehensive prompt integration, spatio-temporal task formulation, and temporal coherence. 1. **Comprehensive Prompt Integration:** The model augments an LLM to process interleaved textual instructions and visual prompts (V-Prompts) for both image and video inputs. This is achieved by using special `
` and `
` tokens to demarcate object conditions and a `The experimental evaluation is comprehensive and rigorous, covering 14 segmentation tasks across images and videos, along with out-of-domain benchmarks. * **Task Coverage:** X2SAM is evaluated on a broad suite of tasks including generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation for both images and videos. * **Datasets:** Training involves SA-1B for agnostic segmentation, and a diverse mix of image (COCO, RefCOCO/+/g, ReasonSeg, GLaMM-derived, COCO-VGD, LLaVA-1.5) and video (VIPSeg, VSPW, YT-VIS19, YT-RefVOS21, DAVIS17-RefVOS, ReVOS, VideoGLaMM-derived, YT-VOS19, YT19-VGD, VIPSeg-VGD, VideoInstruct100K) datasets. The introduction of the Video Visual Grounded (V-VGD) segmentation benchmark (YT19-VGD and VIPSeg-VGD) is a significant contribution. * **Performance:** * **Image Segmentation:** X2SAM remains competitive with image-centric generalists like X-SAM, notably improving image open-vocabulary segmentation (I-OV) from 20.9 to 31.2 PQ. * **Video Segmentation:** It significantly outperforms existing MLLM-based video generalists. For instance, it improves V-Ref. on Ref-YT21 and Ref-DV17 over UniPixel-7B, and achieves a +21.5 mIoU gain on V-GCG over VideoGLaMM (75.8 vs. 54.3). * **Reasoning Segmentation:** Achieves state-of-the-art results on both image (I-Rea. Seg.) and video (V-Rea. Seg.) reasoning tasks, outperforming HyperSeg and even the video-specialist ReferFormer-B. * **Out-of-Domain Generalization:** Demonstrates strong generalization on gRefCOCO, ADE20K, and YT-VIS-21, surpassing specialists and other MLLM generalists. * **Visual Grounded Segmentation:** Shows substantial improvements over SAM2-H in the video domain (V-VGD Seg.), with impressive AP scores on YT-VIS19 and VIPSeg. * **Ablation Studies:** Thorough ablations validate key components: * **Mask Decoder:** Zero-initialization for Token-to-Image Attention is shown to be crucial for stable training and performance gains. * **Joint Training:** The unified joint training strategy significantly reduces training cost (3.3K vs 5.2K GPU hours) while maintaining performance. * **Mask Memory:** Mask guidance, class guidance, and multi-scale features in the Mask Memory module are shown to bring consistent and substantial gains, especially for video tasks. * **Memory Size:** An optimal memory size of 6 frames is identified, balancing historical information with potential noise.
The paper provides a good level of detail for reproducibility. * **Model Initialization:** Vision encoder, projector, and LLM from Qwen3-VL; mask encoder from SAM2; mask decoder from pre-trained agnostic segmentor. LoRA used for LLM fine-tuning. * **Training Details:** Specifics for both agnostic segmentor training (batch size 128, LR 1e-4) and unified joint training (projectors, LoRA, encoders, decoder, memory optimized; LR 1e-5 for mask encoder, 1e-4 for others; effective batch size 32 for video, 128 for image; AdamW optimizer, weight decay 0.05). * **Loss Functions:** Mask loss (BCE + Dice), auto-regressive loss, and focal loss. * **Dataset Sampling:** Consecutive frame sampling for video segmentation, global sampling for video GCG, 64 frames for video chat. * **Memory Capacity:** Default K=8 for ablations, K=6 for final model. The level of detail provided in the "Implementation Details" and "More Model Details" sections is sufficient for researchers to attempt to reproduce the results, although the sheer scale of training (32 NVIDIA H800 GPUs) might be a practical barrier for some.
The authors candidly discuss several limitations: 1. **Computational Expense:** Unified training over heterogeneous image and video datasets remains computationally expensive, especially for video samples with high memory costs. 2. **Fixed-Size Memory:** The fixed-size FIFO memory (K=6 frames) may be insufficient for very long videos, scenarios with prolonged occlusions, large appearance changes, or sparse target reappearance, limiting long-term temporal understanding. 3. **Generalist vs. Specialist Performance:** As a unified generalist model, X2SAM may still lag behind highly specialized models on narrowly focused tasks (e.g., optimized video object segmentation or image-only segmentation).
X2SAM represents a significant step towards more generalized and intuitive multimodal AI. * **Enhanced Human-Computer Interaction:** The conversational interface supporting both text and visual prompts for pixel-level control across images and videos could lead to more natural and powerful interaction paradigms for visual editing, content creation, and data annotation. * **Advanced Video Understanding:** The ability to perform complex segmentation tasks with temporal consistency in videos opens doors for applications in autonomous driving, surveillance, robotics, and medical imaging, where precise spatio-temporal object understanding is critical. * **Foundation for Future MLLMs:** By demonstrating effective unification of image and video segmentation within an MLLM, X2SAM provides a strong baseline and architectural insights for developing even more capable multimodal foundation models. * **New Benchmarking:** The V-VGD benchmark provides a valuable tool for the community to evaluate and advance research in video visual grounded segmentation. X2SAM introduces a unified MLLM framework that extends "any-segmentation" from images to videos, integrating a novel Mask Memory module for temporal consistency and a unified joint training strategy. This paper makes a substantial technical contribution by enabling a single model to perform a wide array of image and video segmentation tasks with both textual and visual prompts, achieving strong performance across modalities and introducing a valuable new benchmark for video visual grounded segmentation.
Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting errors can matter more than average benchmark performance. We introduce SaFE-Scale, a framework for measuring how clinical LLM safety changes across model scale, evidence quality, retrieval strategy, context exposure, and inference-time compute. To instantiate this framework, we introduce RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined clean evidence, conflict evidence, and option-level labels for high-risk error, unsafe answer, and evidence contradiction. We evaluated 34 locally deployed LLMs across six deployment conditions: closed-book prompting (zero-shot), clean evidence, conflict evidence, standard RAG, agentic RAG, and max-context prompting. Clean evidence produced the strongest improvement, increasing mean accuracy from 73.5% to 94.1%, while reducing high-risk error from 12.0% to 2.6%, contradiction from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. Standard RAG and agentic RAG did not reproduce this safety profile: agentic RAG improved accuracy over standard RAG and reduced contradiction, but high-risk error and dangerous overconfidence remained elevated. Max-context prompting increased latency without closing the safety gap, and additional inference-time compute produced only limited gains. Worst-case analysis showed that clinically consequential errors concentrated in a small subset of questions. Clinical LLM safety is therefore not a passive consequence of scaling, but a deployment property shaped by evidence quality, retrieval design, context construction, and collective failure behavior.
Primary: Friedrich-Alexander-Universität Erlangen-Nürnberg
All Institutions: Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen National High Performance Computing Center, Institute of Radiology, University Hospital Erlangen, Lab for AI in Medicine, RWTH Aachen University, Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Chair of Computer Science 10
This paper has significant broader impact, particularly for the development and deployment of AI in high-stakes domains like medicine. It fundamentally challenges the prevailing assumption that scaling LLMs (larger models, longer contexts, more compute) automatically leads to safer behavior. The finding that evidence quality is paramount and that safety and accuracy decouple will necessitate a paradigm shift in how clinical LLMs are evaluated and deployed. It highlights the critical need for multi-dimensional safety metrics beyond accuracy, including high-risk error, contradiction, and dangerous overconfidence. The identification of "synchronized failure" in ensembles is a crucial warning for system designers relying on model agreement for robustness. The paper provides a valuable framework (SaFE-Scale) and benchmark (RadSaFE-200) that can guide future research and development towards truly safe and reliable clinical AI systems. Its insights are also relevant to other high-stakes applications of LLMs where confident, high-risk errors are unacceptable. This study rigorously demonstrates that clinical LLM safety is not a passive consequence of scaling but a deployment property critically shaped by evidence quality, retrieval design, and context construction, often decoupling from accuracy. The paper introduces SaFE-Scale, a novel framework, and RadSaFE-200, a benchmark with clinician-defined multi-dimensional safety labels, to empirically show that clean evidence dramatically improves both accuracy and safety, while model scale, retrieval, and inference-time compute offer limited or even misleading safety gains, particularly due to unreliable confidence and synchronized failures in ensembles. This comprehensive analysis provides crucial insights for developing and deploying safer LLMs in high-stakes clinical environments, urging a shift from accuracy-centric evaluation to explicit safety-focused monitoring of high-risk errors.
The paper introduces SaFE-Scale, a well-structured framework for evaluating clinical LLM safety across various scaling dimensions. This framework is instantiated with RadSaFE-200, a novel benchmark of 200 multiple-choice radiology questions. A key methodological strength is the clinician-defined, multi-dimensional safety labels at the option level: high-risk error, unsafe answer, and evidence contradiction. This moves beyond simple accuracy to capture the nuanced risks in clinical settings. The experimental design is comprehensive, evaluating 34 diverse LLMs across six deployment conditions (closed-book, clean evidence, conflict evidence, standard RAG, agentic RAG, max-context prompting) and additional inference-time compute strategies (self-consistency, ensembling). The use of Radiopaedia as an external evidence source for RAG is appropriate for the radiology domain. The metrics chosen (high-risk error rate, unsafe answer rate, contradiction rate, dangerous overconfidence rate, alongside accuracy) are directly relevant to clinical safety. The variance decomposition analysis to quantify the contributions of model family vs. deployment condition is a robust statistical approach. The worst-case analysis at the question level further strengthens the methodology by identifying specific, recurrent failure modes.
The experimental evaluation is exceptionally thorough and rigorous. The study's scale, involving 34 LLMs from various families and sizes, provides a broad and representative assessment of current LLM capabilities. The comparison across six distinct deployment conditions is critical for understanding how practical choices impact safety. The results consistently demonstrate that evidence quality, specifically clinician-written clean evidence, is the most dominant factor for both accuracy and safety, far outweighing model scale or inference-time compute. This is a significant empirical finding. The decoupling of accuracy and safety is clearly illustrated, with agentic RAG improving accuracy but not necessarily safety. The analysis of confidence as an unreliable safety signal, with high confidence observed even in high-risk errors, is a crucial and concerning finding. The investigation into self-consistency and ensembling reveals their limited safety gains and introduces the important concept of "synchronized failure" in ensembles, where multiple models make the same high-risk error. The worst-case analysis effectively highlights that critical failures are not random but concentrate in specific, challenging questions, which is highly valuable for targeted mitigation efforts. The statistical analysis, including variance decomposition, supports the conclusions robustly.
The paper states that "Full prompt templates, output-format instructions, and inference protocols are provided in Supplementary Note [REF]". This commitment to detailing the experimental setup is a strong indicator of reproducibility. The RadSaFE-200 benchmark is intended for public release, albeit with source-specific redistribution restrictions for some components, which is understandable given the use of copyrighted material like RSNA Case Collection and Radiopaedia. The detailed description of benchmark construction, safety augmentation protocol, and model panel specifications further aids reproducibility. While no direct code repository URL is provided in the text, the level of detail suggests that the experiments could be replicated by other researchers with sufficient effort and access to the benchmark.
The authors provide a comprehensive and transparent discussion of limitations. These include: 1. **Benchmark Scope:** Text-based, multiple-choice format does not capture the full complexity of radiology practice (image interpretation, open-ended reasoning, multimodal aspects). 2. **Benchmark Size:** 200 questions, while curated, may be insufficient for highly granular subgroup analyses. 3. **Question Balance:** The benchmark is primarily diagnostic/classification-oriented, reflecting Radiopaedia case structures, and not fully balanced across all question types. 4. **Subjectivity of Safety Labels:** Clinician-defined labels, while informed by rules, involve clinical judgment and implicit assumptions, especially for technical, physics, radiation therapy, and negation-type questions. Future work should include multiple annotators and inter-rater agreement. 5. **Null Responses:** Final null responses were scored as incorrect but not assigned safety labels, potentially underestimating option-level safety failures. 6. **Controlled Evidence:** Clean and conflict evidence are experimental constructs; real-world RAG evidence can be noisier, redundant, or irrelevant in more complex ways. 7. **Specific Implementations:** The RAG and agentic RAG implementations are specific choices; other methods might yield different safety profiles. 8. **Confidence Measurement:** Confidence was derived from entropy-normalized repeated-sampling stability, not calibrated token probabilities, limiting its interpretation as a full calibration study. 9. **Inference-time Compute:** Self-consistency and ensemble experiments were targeted, not exhaustive, leaving room for more advanced aggregation methods.
This paper has significant broader impact, particularly for the development and deployment of AI in high-stakes domains like medicine. It fundamentally challenges the prevailing assumption that scaling LLMs (larger models, longer contexts, more compute) automatically leads to safer behavior. The finding that evidence quality is paramount and that safety and accuracy decouple will necessitate a paradigm shift in how clinical LLMs are evaluated and deployed. It highlights the critical need for multi-dimensional safety metrics beyond accuracy, including high-risk error, contradiction, and dangerous overconfidence. The identification of "synchronized failure" in ensembles is a crucial warning for system designers relying on model agreement for robustness. The paper provides a valuable framework (SaFE-Scale) and benchmark (RadSaFE-200) that can guide future research and development towards truly safe and reliable clinical AI systems. Its insights are also relevant to other high-stakes applications of LLMs where confident, high-risk errors are unacceptable. This study rigorously demonstrates that clinical LLM safety is not a passive consequence of scaling but a deployment property critically shaped by evidence quality, retrieval design, and context construction, often decoupling from accuracy. The paper introduces SaFE-Scale, a novel framework, and RadSaFE-200, a benchmark with clinician-defined multi-dimensional safety labels, to empirically show that clean evidence dramatically improves both accuracy and safety, while model scale, retrieval, and inference-time compute offer limited or even misleading safety gains, particularly due to unreliable confidence and synchronized failures in ensembles. This comprehensive analysis provides crucial insights for developing and deploying safer LLMs in high-stakes clinical environments, urging a shift from accuracy-centric evaluation to explicit safety-focused monitoring of high-risk errors.
Recent megakernel designs for Mixture-of-Experts (MoE) inference fuse expert computation with fine-grained, GPU-initiated communication into a single persistent GPU kernel, and outperform collective-based MoE on a single node by overlapping data transfer with compute at tile granularity. This benefit does not carry over cleanly to multi-node inference, where experts span many nodes connected by an RDMA fabric. Communication-bound MoE models regress by up to $10\times$ on 8 nodes, and the regression worsens with node count. We trace this regression to hidden serialization in proxy-based RDMA transports. The ordering requirement between each tile transfer and its completion signal forces a fence that drains the NIC pipeline, and its cost grows with the number of concurrent transfers. As a result, models whose per-expert compute is too small to absorb this inflated network latency expose communication on the critical path. We present \emph{Perseus}, which eliminates this serialization through two techniques. \emph{Decoupled signaling} batches fences at per-destination granularity, reducing fence count by $8\times$. \emph{NIC-side ordering} replaces proxy stalls with hardware fence flags, so the proxy never blocks. On proxy-based transports, Perseus achieves up to 10.3$\times$ end-to-end speedup. Perseus on IBRC matches or exceeds IBGDA GPU-direct by up to 1.2$\times$, which shows that serialization, rather than the choice between proxy-based and GPU-direct transport, is what bounds multi-node megakernel performance.
Primary: Cornell University
All Institutions: Cornell University
Perseus has significant broader impact for the field of large-scale machine learning and distributed systems: - **Enables larger MoE models:** By effectively addressing a critical scaling bottleneck, Perseus allows MoE models to be deployed and inferred across many nodes with much higher efficiency, pushing the boundaries of what's possible with LLMs. - **Challenges conventional wisdom:** The finding that optimized proxy-based RDMA can outperform GPU-direct RDMA is a major insight that could shift design paradigms for distributed ML systems, potentially simplifying development by making proxy-based approaches more viable. - **Influences future hardware/software design:** The identification of hidden serialization and the effectiveness of NIC-side ordering could inform future designs of network interface cards (NICs) and communication libraries, encouraging more hardware-level support for flexible ordering and completion signaling. - **Applicability beyond MoE:** The principles of eliminating hidden serialization in fine-grained, GPU-initiated communication could be beneficial for other distributed workloads that exhibit similar communication patterns. This paper identifies a critical, previously hidden performance bottleneck in multi-node megakernel communication for MoE inference, deeply analyzes its root cause, and proposes elegant and highly effective solutions that yield up to 10.3x speedup and challenge conventional wisdom regarding proxy-based vs. GPU-direct RDMA. The work is technically profound, experimentally rigorous, and has significant implications for scaling large language models and distributed systems research.
The paper identifies a critical and previously "hidden" serialization bottleneck in multi-node megakernel communication for Mixture-of-Experts (MoE) inference, specifically within proxy-based RDMA transports. The core insight is that the ordering requirement between each fine-grained tile transfer and its completion signal (a doorbell write) forces a `wmb` (write memory barrier) on the CPU-side proxy. This `wmb` drains the NIC pipeline, and its cost grows with the number of concurrent transfers, leading to significant performance regression for communication-bound MoE models. Perseus proposes two technically sound and elegant solutions: 1. **Decoupled Signaling:** This technique batches multiple tile transfers before issuing a single doorbell write and its associated `wmb`. By reducing the number of `wmb`s by up to 8x, it significantly mitigates the serialization overhead. The GPU manages completion tracking for these batches. 2. **NIC-side Ordering:** This more fundamental solution leverages the RDMA write with immediate (RDMAD_WRITE_IMM) capability. By embedding the completion signal (immediate value) within the same RDMA operation as the data transfer, the NIC inherently guarantees ordering. This completely eliminates the need for a CPU-side `wmb`, allowing the proxy to never block. This is a particularly clever use of existing hardware capabilities to solve a software-induced serialization problem. The methodology is robust, clearly dissecting the problem, proposing targeted solutions, and explaining their mechanisms in detail.
The experimental evaluation is comprehensive and rigorous. - **Setup:** Experiments are conducted on a realistic multi-node cluster (8 nodes, 16 A100 GPUs) connected by an InfiniBand HDR fabric, which is highly relevant for large-scale ML deployments. - **Baselines:** The evaluation compares Perseus against strong baselines: IBRC (proxy-based RDMA, representing the problematic baseline) and IBGDA (GPU-direct RDMA, often considered the gold standard for high-performance communication). - **Workloads:** Real-world MoE models, including Switch Transformer (1.6B, 2.3B, 137B parameters) and GShard (600M), are used, demonstrating the practical applicability of the solution. - **Key Results:** - Perseus on IBRC achieves up to 10.3x end-to-end speedup over the baseline IBRC, a truly remarkable improvement. - Crucially, Perseus on IBRC matches or even exceeds IBGDA (GPU-direct) by up to 1.2x. This is a surprising and highly impactful finding, challenging the conventional wisdom that GPU-direct is inherently superior to proxy-based approaches for fine-grained communication. It demonstrates that serialization, not the choice of transport mechanism, was the primary bottleneck. - The paper provides a clear breakdown of the individual contributions of Decoupled Signaling and NIC-side Ordering, showing how they incrementally contribute to the overall speedup. - Microbenchmarks confirm the reduction in fence latency, validating the underlying hypothesis. - Sensitivity analysis to expert size further clarifies when Perseus provides the most benefit (models with smaller per-expert compute, where communication is more exposed). The results are convincing, well-supported, and clearly demonstrate the effectiveness and significance of Perseus.
The paper provides a detailed description of the problem, the proposed solutions, and their implementation within a modified UCX transport layer. The experimental setup, including hardware specifications and workloads, is also well-documented. While the source code is not provided (common for arXiv preprints), the level of detail should enable skilled systems researchers to reproduce the core ideas and potentially the results, given access to similar hardware.
- **Specificity to RDMA/InfiniBand:** The solutions are tailored to the specifics of RDMA transports and the `wmb` behavior in proxy-based communication. While the underlying principle of serialization might exist in other network fabrics, the exact solutions might not directly apply without adaptation. - **Generalizability to other communication patterns:** While the paper suggests applicability to other fine-grained, GPU-initiated communication, the primary focus and evaluation are on MoE's all-to-all pattern. Its effectiveness for other patterns would need further investigation. - **Overhead for extremely small messages:** While MoE tiles are fine-grained, they are not necessarily extremely tiny. For truly byte-level communication, the batching overhead of decoupled signaling or the immediate value processing might introduce new trade-offs, though this is not the target use case.
Perseus has significant broader impact for the field of large-scale machine learning and distributed systems: - **Enables larger MoE models:** By effectively addressing a critical scaling bottleneck, Perseus allows MoE models to be deployed and inferred across many nodes with much higher efficiency, pushing the boundaries of what's possible with LLMs. - **Challenges conventional wisdom:** The finding that optimized proxy-based RDMA can outperform GPU-direct RDMA is a major insight that could shift design paradigms for distributed ML systems, potentially simplifying development by making proxy-based approaches more viable. - **Influences future hardware/software design:** The identification of hidden serialization and the effectiveness of NIC-side ordering could inform future designs of network interface cards (NICs) and communication libraries, encouraging more hardware-level support for flexible ordering and completion signaling. - **Applicability beyond MoE:** The principles of eliminating hidden serialization in fine-grained, GPU-initiated communication could be beneficial for other distributed workloads that exhibit similar communication patterns. This paper identifies a critical, previously hidden performance bottleneck in multi-node megakernel communication for MoE inference, deeply analyzes its root cause, and proposes elegant and highly effective solutions that yield up to 10.3x speedup and challenge conventional wisdom regarding proxy-based vs. GPU-direct RDMA. The work is technically profound, experimentally rigorous, and has significant implications for scaling large language models and distributed systems research.