Week of April 26 – May 03, 2026
Every document format in existence was designed for a human reader moving linearly through text. Autonomous LLM agents do not read - they retrieve. This fundamental mismatch forces agents to inject entire documents into their context window, wasting tokens on irrelevant content, compounding state across multi-turn loops, and broadcasting information indiscriminately across agent roles. We argue this is not a prompt engineering problem, not a retrieval problem, and not a compression problem: it is a format problem. We introduce OBJECTGRAPH (.og), a file format that reconceives the document as a typed, directed knowledge graph to be traversed rather than a string to be injected. OBJECTGRAPH is a strict superset of Markdown - every .md file is a valid .og file - requires no infrastructure beyond a two-primitive query protocol, and is readable by both humans and agents without tooling. We formalize the Document Consumption Problem, characterise six structural properties no existing format satisfies simultaneously, and prove OBJECTGRAPH satisfies all six. We further introduce the Progressive Disclosure Model, the Role-Scoped Access Protocol, and Executable Assertion Nodes as native format primitives. Empirical evaluation across five document classes and eight agent task types demonstrates up to 95.3 percent token reduction with no statistically significant degradation in task accuracy (p > 0.05). Transpiler fidelity reaches 98.7 percent content preservation on a held-out document benchmark.
Primary: Open Gigantic
All Institutions: Open Gigantic
ObjectGraph has the potential for significant broader impact across several dimensions: 1. **Cost and Efficiency**: The dramatic reduction in token consumption (up to 95.3%) and mitigation of context compounding (36.5x reduction) can substantially lower the operational costs of LLM agents and enable more complex, multi-turn workflows within existing context window limits. 2. **Agent Capabilities**: By providing structured, queryable knowledge, ObjectGraph can enhance agent reasoning, planning, and execution capabilities, leading to more reliable and autonomous agents. 3. **System Simplification**: The "ObjectGraph as Infrastructure" concept is powerful. Role-scoped access control, executable assertions, and delta loading natively within the document format can eliminate the need for external middleware, validation prompt templates, and change tracking systems, simplifying the architecture of multi-agent deployments. 4. **Human-Agent Collaboration**: Being a strict superset of Markdown, ObjectGraph allows both humans and agents to interact with the same source document, reducing maintenance overhead and fostering better alignment between human-authored instructions and agent execution. 5. **Knowledge Management**: It offers a more robust framework for managing agent knowledge bases, enabling features like automated staleness detection and structured updates. 6. **New Paradigm for Documents**: This work challenges the fundamental assumption of linear document consumption, proposing a new paradigm for how information is structured and accessed in the agentic era. If widely adopted, it could lead to a new ecosystem of tools and practices for agent-native content creation and consumption. This paper introduces ObjectGraph, a novel file format that re-imagines documents as typed knowledge graphs for LLM agents, achieving up to 95.3% token reduction and significant context compounding mitigation without degrading task accuracy. The work presents a comprehensive, well-designed solution to a fundamental problem in LLM agent deployment, offering a paradigm shift in document consumption that promises to enhance agent efficiency, capabilities, and simplify multi-agent system architectures.
The paper introduces ObjectGraph (.og), a novel file format designed to address the "Document Consumption Problem" for LLM agents. The core methodology reconceives documents as typed, directed knowledge graphs rather than linear text strings. The authors formalize this problem and derive six structural properties (Query-Addressable Index, Layered Compression, Typed Dependency Graph, Role-Scoped Access Control, Executable Assertions, Human Readability) that existing formats fail to satisfy simultaneously. ObjectGraph is presented as a strict superset of Markdown, ensuring backward compatibility. Key methodological components include: 1. **ObjectGraph Format Specification**: A detailed structure comprising a file-level manifest (meta, index, changelog blocks) and atomic knowledge units (nodes). Nodes are typed containers with stable identifiers, scope annotations, confidence scores, and versioning metadata. Content-type tags (e.g., `code`, `steps`, `warning`) provide explicit semantic meaning beyond visual cues. 2. **Progressive Disclosure Model (PDM)**: A three-pass reading model (Index, Dense, Full) that enables agents to retrieve only relevant information at the necessary fidelity level, significantly reducing token consumption. 3. **Typed Edge Declarations**: Supports explicit, machine-traversable relationships between nodes (e.g., `:requires`, `:precedes`, `:see-also`), allowing for automatic dependency resolution. 4. **Role-Based Access Control**: The `scope` attribute on nodes and index entries enables content filtering at the format level, eliminating the need for external middleware in multi-agent systems. 5. **Executable Assertion Nodes**: Allows embedding validation logic, retry mechanisms, and escalation paths directly within the document, triggered by the query protocol. 6. **Delta Loading via Changelog**: A `__changelog` meta-node facilitates incremental document updates, reducing the cost of checking for changes. 7. **LLM-Native Query Protocol**: A minimal two-primitive interface (`search_index`, `resolve_context`) that leverages the LLM itself as a "Router" for semantic index search, rather than relying on traditional keyword matching or embeddings. This is a particularly clever design choice. 8. **Transpiler**: A hybrid Markdown-to-ObjectGraph transpiler that uses deterministic parsers for content extraction and bounded LLM calls for metadata synthesis (dense blocks, index keywords), ensuring high fidelity and bounding hallucination risk. The methodology is comprehensive, well-articulated, and addresses the identified problems systematically. The design choices, such as Markdown superset and LLM-as-Router, are pragmatic and innovative.
The empirical evaluation is robust and addresses key research questions effectively. 1. **Corpus**: A benchmark of 240 documents across five classes (Skill Files, Operational Runbooks, Execution Plans, Technical Documentation, Knowledge Bases), ranging from 200 to 15,000 tokens, provides a diverse testbed. 2. **Task Suite**: Eight distinct task types (information lookup, procedure execution, multi-step planning, role-conditional access, cross-node reasoning, update detection, assertion verification, multi-agent handoff) cover a broad range of agent interactions. 3. **Models & Baselines**: Evaluation uses Claude Sonnet 4.5 (primary), Claude Haiku 4.5 (Router), and GPT-4o (cross-model validation). Baselines include Full Markdown injection, RAG (text-embedding-3-large), and SkillReducer-optimized Markdown. 4. **RQ1: Token Consumption**: ObjectGraph achieved a mean token reduction from 2,340 to 187 tokens (92.0% average, up to 95.3%), demonstrating significant cost savings. 5. **RQ2: Context Compounding Reduction**: In a 5-turn workflow, ObjectGraph (Architecture B) reduced cumulative token cost by 36.5x compared to Markdown (46,000 vs. 1,260 tokens), effectively mitigating the super-linear growth of context. 6. **RQ3: Task Accuracy**: ObjectGraph matched or exceeded Markdown accuracy on 7 of 8 task types. Notably, it showed dramatic improvements on Role-conditional access (+18.4%) and Update detection (+30.1%), tasks where Markdown lacks native support. The "less-is-more" effect, where reduced context improves accuracy by reducing attention dilution, is a significant finding. 7. **RQ4: Transpiler Fidelity**: The transpiler achieved a mean fidelity of 0.987 (SD=0.018) on 180 held-out documents, ensuring high content preservation. 8. **RQ5: Human Authoring Burden**: A user study with 18 participants rated authoring burden as low (mean 2.8/7), suggesting good usability for human authors. 9. **Ablation Study**: An ablation study clearly demonstrated the individual contributions of different ObjectGraph features to token reduction, providing valuable insights into the design's effectiveness. The experimental setup is comprehensive, the results are statistically significant (p > 0.05 for accuracy degradation), and the findings strongly support the claims of the paper.
The paper provides a detailed specification of the ObjectGraph format, including its structure, node types, edge syntax, and query protocol. The LLM prompt template for metadata synthesis is explicitly provided. The algorithms for structural extraction and the query protocol are outlined. While no direct code repository or dataset links are provided, the level of detail in the format specification and methodology sections is high enough that a motivated researcher could likely implement the format and protocol. The benchmark corpus is described in terms of document classes and token ranges, but the specific documents are not publicly available. The LLM models used are identified. Overall, the paper offers a strong foundation for reproducibility, though direct code access would enhance it further.
The authors acknowledge several limitations: 1. **Scale**: The benchmark of 240 documents, while curated, may not fully represent the diversity of real-world enterprise-scale corpora. 2. **Cross-file Federation**: The current specification does not support cross-file edge resolution, limiting its applicability to mono-repo or single-domain knowledge bases. This is a significant limitation for truly distributed knowledge graphs. 3. **Standardisation**: Without a standards body or broad community adoption, the format risks fragmentation into incompatible dialects. 4. **Adversarial Inputs**: The evaluation did not consider adversarial document authors who might craft misleading `dense` blocks or `index` entries to manipulate agent routing. Additional minor limitations could include the reliance on LLMs for routing, which, while a feature, could introduce its own set of challenges (e.g., prompt engineering for optimal routing, potential for misinterpretation if the index is poorly crafted).
ObjectGraph has the potential for significant broader impact across several dimensions: 1. **Cost and Efficiency**: The dramatic reduction in token consumption (up to 95.3%) and mitigation of context compounding (36.5x reduction) can substantially lower the operational costs of LLM agents and enable more complex, multi-turn workflows within existing context window limits. 2. **Agent Capabilities**: By providing structured, queryable knowledge, ObjectGraph can enhance agent reasoning, planning, and execution capabilities, leading to more reliable and autonomous agents. 3. **System Simplification**: The "ObjectGraph as Infrastructure" concept is powerful. Role-scoped access control, executable assertions, and delta loading natively within the document format can eliminate the need for external middleware, validation prompt templates, and change tracking systems, simplifying the architecture of multi-agent deployments. 4. **Human-Agent Collaboration**: Being a strict superset of Markdown, ObjectGraph allows both humans and agents to interact with the same source document, reducing maintenance overhead and fostering better alignment between human-authored instructions and agent execution. 5. **Knowledge Management**: It offers a more robust framework for managing agent knowledge bases, enabling features like automated staleness detection and structured updates. 6. **New Paradigm for Documents**: This work challenges the fundamental assumption of linear document consumption, proposing a new paradigm for how information is structured and accessed in the agentic era. If widely adopted, it could lead to a new ecosystem of tools and practices for agent-native content creation and consumption. This paper introduces ObjectGraph, a novel file format that re-imagines documents as typed knowledge graphs for LLM agents, achieving up to 95.3% token reduction and significant context compounding mitigation without degrading task accuracy. The work presents a comprehensive, well-designed solution to a fundamental problem in LLM agent deployment, offering a paradigm shift in document consumption that promises to enhance agent efficiency, capabilities, and simplify multi-agent system architectures.
Recent megakernel designs for Mixture-of-Experts (MoE) inference fuse expert computation with fine-grained, GPU-initiated communication into a single persistent GPU kernel, and outperform collective-based MoE on a single node by overlapping data transfer with compute at tile granularity. This benefit does not carry over cleanly to multi-node inference, where experts span many nodes connected by an RDMA fabric. Communication-bound MoE models regress by up to $10\times$ on 8 nodes, and the regression worsens with node count. We trace this regression to hidden serialization in proxy-based RDMA transports. The ordering requirement between each tile transfer and its completion signal forces a fence that drains the NIC pipeline, and its cost grows with the number of concurrent transfers. As a result, models whose per-expert compute is too small to absorb this inflated network latency expose communication on the critical path. We present \emph{Perseus}, which eliminates this serialization through two techniques. \emph{Decoupled signaling} batches fences at per-destination granularity, reducing fence count by $8\times$. \emph{NIC-side ordering} replaces proxy stalls with hardware fence flags, so the proxy never blocks. On proxy-based transports, Perseus achieves up to 10.3$\times$ end-to-end speedup. Perseus on IBRC matches or exceeds IBGDA GPU-direct by up to 1.2$\times$, which shows that serialization, rather than the choice between proxy-based and GPU-direct transport, is what bounds multi-node megakernel performance.
Primary: Cornell University
All Institutions: Cornell University
Perseus has significant broader impact for the field of large-scale machine learning and distributed systems: - **Enables larger MoE models:** By effectively addressing a critical scaling bottleneck, Perseus allows MoE models to be deployed and inferred across many nodes with much higher efficiency, pushing the boundaries of what's possible with LLMs. - **Challenges conventional wisdom:** The finding that optimized proxy-based RDMA can outperform GPU-direct RDMA is a major insight that could shift design paradigms for distributed ML systems, potentially simplifying development by making proxy-based approaches more viable. - **Influences future hardware/software design:** The identification of hidden serialization and the effectiveness of NIC-side ordering could inform future designs of network interface cards (NICs) and communication libraries, encouraging more hardware-level support for flexible ordering and completion signaling. - **Applicability beyond MoE:** The principles of eliminating hidden serialization in fine-grained, GPU-initiated communication could be beneficial for other distributed workloads that exhibit similar communication patterns. This paper identifies a critical, previously hidden performance bottleneck in multi-node megakernel communication for MoE inference, deeply analyzes its root cause, and proposes elegant and highly effective solutions that yield up to 10.3x speedup and challenge conventional wisdom regarding proxy-based vs. GPU-direct RDMA. The work is technically profound, experimentally rigorous, and has significant implications for scaling large language models and distributed systems research.
The paper identifies a critical and previously "hidden" serialization bottleneck in multi-node megakernel communication for Mixture-of-Experts (MoE) inference, specifically within proxy-based RDMA transports. The core insight is that the ordering requirement between each fine-grained tile transfer and its completion signal (a doorbell write) forces a `wmb` (write memory barrier) on the CPU-side proxy. This `wmb` drains the NIC pipeline, and its cost grows with the number of concurrent transfers, leading to significant performance regression for communication-bound MoE models. Perseus proposes two technically sound and elegant solutions: 1. **Decoupled Signaling:** This technique batches multiple tile transfers before issuing a single doorbell write and its associated `wmb`. By reducing the number of `wmb`s by up to 8x, it significantly mitigates the serialization overhead. The GPU manages completion tracking for these batches. 2. **NIC-side Ordering:** This more fundamental solution leverages the RDMA write with immediate (RDMAD_WRITE_IMM) capability. By embedding the completion signal (immediate value) within the same RDMA operation as the data transfer, the NIC inherently guarantees ordering. This completely eliminates the need for a CPU-side `wmb`, allowing the proxy to never block. This is a particularly clever use of existing hardware capabilities to solve a software-induced serialization problem. The methodology is robust, clearly dissecting the problem, proposing targeted solutions, and explaining their mechanisms in detail.
The experimental evaluation is comprehensive and rigorous. - **Setup:** Experiments are conducted on a realistic multi-node cluster (8 nodes, 16 A100 GPUs) connected by an InfiniBand HDR fabric, which is highly relevant for large-scale ML deployments. - **Baselines:** The evaluation compares Perseus against strong baselines: IBRC (proxy-based RDMA, representing the problematic baseline) and IBGDA (GPU-direct RDMA, often considered the gold standard for high-performance communication). - **Workloads:** Real-world MoE models, including Switch Transformer (1.6B, 2.3B, 137B parameters) and GShard (600M), are used, demonstrating the practical applicability of the solution. - **Key Results:** - Perseus on IBRC achieves up to 10.3x end-to-end speedup over the baseline IBRC, a truly remarkable improvement. - Crucially, Perseus on IBRC matches or even exceeds IBGDA (GPU-direct) by up to 1.2x. This is a surprising and highly impactful finding, challenging the conventional wisdom that GPU-direct is inherently superior to proxy-based approaches for fine-grained communication. It demonstrates that serialization, not the choice of transport mechanism, was the primary bottleneck. - The paper provides a clear breakdown of the individual contributions of Decoupled Signaling and NIC-side Ordering, showing how they incrementally contribute to the overall speedup. - Microbenchmarks confirm the reduction in fence latency, validating the underlying hypothesis. - Sensitivity analysis to expert size further clarifies when Perseus provides the most benefit (models with smaller per-expert compute, where communication is more exposed). The results are convincing, well-supported, and clearly demonstrate the effectiveness and significance of Perseus.
The paper provides a detailed description of the problem, the proposed solutions, and their implementation within a modified UCX transport layer. The experimental setup, including hardware specifications and workloads, is also well-documented. While the source code is not provided (common for arXiv preprints), the level of detail should enable skilled systems researchers to reproduce the core ideas and potentially the results, given access to similar hardware.
- **Specificity to RDMA/InfiniBand:** The solutions are tailored to the specifics of RDMA transports and the `wmb` behavior in proxy-based communication. While the underlying principle of serialization might exist in other network fabrics, the exact solutions might not directly apply without adaptation. - **Generalizability to other communication patterns:** While the paper suggests applicability to other fine-grained, GPU-initiated communication, the primary focus and evaluation are on MoE's all-to-all pattern. Its effectiveness for other patterns would need further investigation. - **Overhead for extremely small messages:** While MoE tiles are fine-grained, they are not necessarily extremely tiny. For truly byte-level communication, the batching overhead of decoupled signaling or the immediate value processing might introduce new trade-offs, though this is not the target use case.
Perseus has significant broader impact for the field of large-scale machine learning and distributed systems: - **Enables larger MoE models:** By effectively addressing a critical scaling bottleneck, Perseus allows MoE models to be deployed and inferred across many nodes with much higher efficiency, pushing the boundaries of what's possible with LLMs. - **Challenges conventional wisdom:** The finding that optimized proxy-based RDMA can outperform GPU-direct RDMA is a major insight that could shift design paradigms for distributed ML systems, potentially simplifying development by making proxy-based approaches more viable. - **Influences future hardware/software design:** The identification of hidden serialization and the effectiveness of NIC-side ordering could inform future designs of network interface cards (NICs) and communication libraries, encouraging more hardware-level support for flexible ordering and completion signaling. - **Applicability beyond MoE:** The principles of eliminating hidden serialization in fine-grained, GPU-initiated communication could be beneficial for other distributed workloads that exhibit similar communication patterns. This paper identifies a critical, previously hidden performance bottleneck in multi-node megakernel communication for MoE inference, deeply analyzes its root cause, and proposes elegant and highly effective solutions that yield up to 10.3x speedup and challenge conventional wisdom regarding proxy-based vs. GPU-direct RDMA. The work is technically profound, experimentally rigorous, and has significant implications for scaling large language models and distributed systems research.
Multimodal Large Language Models (MLLMs) have demonstrated strong image-level visual understanding and reasoning, yet their pixel-level perception across both images and videos remains limited. Foundation segmentation models such as the SAM series produce high-quality masks, but they rely on low-level visual prompts and cannot natively interpret complex conversational instructions. Existing segmentation MLLMs narrow this gap, but are usually specialized for either images or videos and rarely support both textual and visual prompts in one interface. We introduce X2SAM, a unified segmentation MLLM that extends any-segmentation capabilities from images to videos. Given conversational instructions and visual prompts, X2SAM couples an LLM with a Mask Memory module that stores guided vision features for temporally consistent video mask generation. The same formulation supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. We further introduce the Video Visual Grounded (V-VGD) segmentation benchmark, which evaluates whether a model can segment object tracks in videos from interactive visual prompts. With a unified joint training strategy over heterogeneous image and video datasets, X2SAM delivers strong video segmentation performance, remains competitive on image segmentation benchmarks, and preserves general image and video chat ability.
Primary: unknown
All Institutions: unknown
X2SAM represents a significant step towards more generalized and intuitive multimodal AI. * **Enhanced Human-Computer Interaction:** The conversational interface supporting both text and visual prompts for pixel-level control across images and videos could lead to more natural and powerful interaction paradigms for visual editing, content creation, and data annotation. * **Advanced Video Understanding:** The ability to perform complex segmentation tasks with temporal consistency in videos opens doors for applications in autonomous driving, surveillance, robotics, and medical imaging, where precise spatio-temporal object understanding is critical. * **Foundation for Future MLLMs:** By demonstrating effective unification of image and video segmentation within an MLLM, X2SAM provides a strong baseline and architectural insights for developing even more capable multimodal foundation models. * **New Benchmarking:** The V-VGD benchmark provides a valuable tool for the community to evaluate and advance research in video visual grounded segmentation. X2SAM introduces a unified MLLM framework that extends "any-segmentation" from images to videos, integrating a novel Mask Memory module for temporal consistency and a unified joint training strategy. This paper makes a substantial technical contribution by enabling a single model to perform a wide array of image and video segmentation tasks with both textual and visual prompts, achieving strong performance across modalities and introducing a valuable new benchmark for video visual grounded segmentation.
X2SAM proposes a unified segmentation MLLM designed to extend "any-segmentation" capabilities from images to videos, supporting both textual and visual prompts. The core methodology addresses three key challenges: comprehensive prompt integration, spatio-temporal task formulation, and temporal coherence. 1. **Comprehensive Prompt Integration:** The model augments an LLM to process interleaved textual instructions and visual prompts (V-Prompts) for both image and video inputs. This is achieved by using special `
` and `
` tokens to demarcate object conditions and a `The experimental evaluation is comprehensive and rigorous, covering 14 segmentation tasks across images and videos, along with out-of-domain benchmarks. * **Task Coverage:** X2SAM is evaluated on a broad suite of tasks including generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation for both images and videos. * **Datasets:** Training involves SA-1B for agnostic segmentation, and a diverse mix of image (COCO, RefCOCO/+/g, ReasonSeg, GLaMM-derived, COCO-VGD, LLaVA-1.5) and video (VIPSeg, VSPW, YT-VIS19, YT-RefVOS21, DAVIS17-RefVOS, ReVOS, VideoGLaMM-derived, YT-VOS19, YT19-VGD, VIPSeg-VGD, VideoInstruct100K) datasets. The introduction of the Video Visual Grounded (V-VGD) segmentation benchmark (YT19-VGD and VIPSeg-VGD) is a significant contribution. * **Performance:** * **Image Segmentation:** X2SAM remains competitive with image-centric generalists like X-SAM, notably improving image open-vocabulary segmentation (I-OV) from 20.9 to 31.2 PQ. * **Video Segmentation:** It significantly outperforms existing MLLM-based video generalists. For instance, it improves V-Ref. on Ref-YT21 and Ref-DV17 over UniPixel-7B, and achieves a +21.5 mIoU gain on V-GCG over VideoGLaMM (75.8 vs. 54.3). * **Reasoning Segmentation:** Achieves state-of-the-art results on both image (I-Rea. Seg.) and video (V-Rea. Seg.) reasoning tasks, outperforming HyperSeg and even the video-specialist ReferFormer-B. * **Out-of-Domain Generalization:** Demonstrates strong generalization on gRefCOCO, ADE20K, and YT-VIS-21, surpassing specialists and other MLLM generalists. * **Visual Grounded Segmentation:** Shows substantial improvements over SAM2-H in the video domain (V-VGD Seg.), with impressive AP scores on YT-VIS19 and VIPSeg. * **Ablation Studies:** Thorough ablations validate key components: * **Mask Decoder:** Zero-initialization for Token-to-Image Attention is shown to be crucial for stable training and performance gains. * **Joint Training:** The unified joint training strategy significantly reduces training cost (3.3K vs 5.2K GPU hours) while maintaining performance. * **Mask Memory:** Mask guidance, class guidance, and multi-scale features in the Mask Memory module are shown to bring consistent and substantial gains, especially for video tasks. * **Memory Size:** An optimal memory size of 6 frames is identified, balancing historical information with potential noise.
The paper provides a good level of detail for reproducibility. * **Model Initialization:** Vision encoder, projector, and LLM from Qwen3-VL; mask encoder from SAM2; mask decoder from pre-trained agnostic segmentor. LoRA used for LLM fine-tuning. * **Training Details:** Specifics for both agnostic segmentor training (batch size 128, LR 1e-4) and unified joint training (projectors, LoRA, encoders, decoder, memory optimized; LR 1e-5 for mask encoder, 1e-4 for others; effective batch size 32 for video, 128 for image; AdamW optimizer, weight decay 0.05). * **Loss Functions:** Mask loss (BCE + Dice), auto-regressive loss, and focal loss. * **Dataset Sampling:** Consecutive frame sampling for video segmentation, global sampling for video GCG, 64 frames for video chat. * **Memory Capacity:** Default K=8 for ablations, K=6 for final model. The level of detail provided in the "Implementation Details" and "More Model Details" sections is sufficient for researchers to attempt to reproduce the results, although the sheer scale of training (32 NVIDIA H800 GPUs) might be a practical barrier for some.
The authors candidly discuss several limitations: 1. **Computational Expense:** Unified training over heterogeneous image and video datasets remains computationally expensive, especially for video samples with high memory costs. 2. **Fixed-Size Memory:** The fixed-size FIFO memory (K=6 frames) may be insufficient for very long videos, scenarios with prolonged occlusions, large appearance changes, or sparse target reappearance, limiting long-term temporal understanding. 3. **Generalist vs. Specialist Performance:** As a unified generalist model, X2SAM may still lag behind highly specialized models on narrowly focused tasks (e.g., optimized video object segmentation or image-only segmentation).
X2SAM represents a significant step towards more generalized and intuitive multimodal AI. * **Enhanced Human-Computer Interaction:** The conversational interface supporting both text and visual prompts for pixel-level control across images and videos could lead to more natural and powerful interaction paradigms for visual editing, content creation, and data annotation. * **Advanced Video Understanding:** The ability to perform complex segmentation tasks with temporal consistency in videos opens doors for applications in autonomous driving, surveillance, robotics, and medical imaging, where precise spatio-temporal object understanding is critical. * **Foundation for Future MLLMs:** By demonstrating effective unification of image and video segmentation within an MLLM, X2SAM provides a strong baseline and architectural insights for developing even more capable multimodal foundation models. * **New Benchmarking:** The V-VGD benchmark provides a valuable tool for the community to evaluate and advance research in video visual grounded segmentation. X2SAM introduces a unified MLLM framework that extends "any-segmentation" from images to videos, integrating a novel Mask Memory module for temporal consistency and a unified joint training strategy. This paper makes a substantial technical contribution by enabling a single model to perform a wide array of image and video segmentation tasks with both textual and visual prompts, achieving strong performance across modalities and introducing a valuable new benchmark for video visual grounded segmentation.
Every document format in existence was designed for a human reader moving linearly through text. Autonomous LLM agents do not read - they retrieve. This fundamental mismatch forces agents to inject entire documents into their context window, wasting tokens on irrelevant content, compounding state across multi-turn loops, and broadcasting information indiscriminately across agent roles. We argue this is not a prompt engineering problem, not a retrieval problem, and not a compression problem: it is a format problem. We introduce OBJECTGRAPH (.og), a file format that reconceives the document as a typed, directed knowledge graph to be traversed rather than a string to be injected. OBJECTGRAPH is a strict superset of Markdown - every .md file is a valid .og file - requires no infrastructure beyond a two-primitive query protocol, and is readable by both humans and agents without tooling. We formalize the Document Consumption Problem, characterise six structural properties no existing format satisfies simultaneously, and prove OBJECTGRAPH satisfies all six. We further introduce the Progressive Disclosure Model, the Role-Scoped Access Protocol, and Executable Assertion Nodes as native format primitives. Empirical evaluation across five document classes and eight agent task types demonstrates up to 95.3 percent token reduction with no statistically significant degradation in task accuracy (p > 0.05). Transpiler fidelity reaches 98.7 percent content preservation on a held-out document benchmark.
Primary: Open Gigantic
All Institutions: Open Gigantic
ObjectGraph has the potential for significant broader impact across several dimensions: 1. **Cost and Efficiency**: The dramatic reduction in token consumption (up to 95.3%) and mitigation of context compounding (36.5x reduction) can substantially lower the operational costs of LLM agents and enable more complex, multi-turn workflows within existing context window limits. 2. **Agent Capabilities**: By providing structured, queryable knowledge, ObjectGraph can enhance agent reasoning, planning, and execution capabilities, leading to more reliable and autonomous agents. 3. **System Simplification**: The "ObjectGraph as Infrastructure" concept is powerful. Role-scoped access control, executable assertions, and delta loading natively within the document format can eliminate the need for external middleware, validation prompt templates, and change tracking systems, simplifying the architecture of multi-agent deployments. 4. **Human-Agent Collaboration**: Being a strict superset of Markdown, ObjectGraph allows both humans and agents to interact with the same source document, reducing maintenance overhead and fostering better alignment between human-authored instructions and agent execution. 5. **Knowledge Management**: It offers a more robust framework for managing agent knowledge bases, enabling features like automated staleness detection and structured updates. 6. **New Paradigm for Documents**: This work challenges the fundamental assumption of linear document consumption, proposing a new paradigm for how information is structured and accessed in the agentic era. If widely adopted, it could lead to a new ecosystem of tools and practices for agent-native content creation and consumption. This paper introduces ObjectGraph, a novel file format that re-imagines documents as typed knowledge graphs for LLM agents, achieving up to 95.3% token reduction and significant context compounding mitigation without degrading task accuracy. The work presents a comprehensive, well-designed solution to a fundamental problem in LLM agent deployment, offering a paradigm shift in document consumption that promises to enhance agent efficiency, capabilities, and simplify multi-agent system architectures.
The paper introduces ObjectGraph (.og), a novel file format designed to address the "Document Consumption Problem" for LLM agents. The core methodology reconceives documents as typed, directed knowledge graphs rather than linear text strings. The authors formalize this problem and derive six structural properties (Query-Addressable Index, Layered Compression, Typed Dependency Graph, Role-Scoped Access Control, Executable Assertions, Human Readability) that existing formats fail to satisfy simultaneously. ObjectGraph is presented as a strict superset of Markdown, ensuring backward compatibility. Key methodological components include: 1. **ObjectGraph Format Specification**: A detailed structure comprising a file-level manifest (meta, index, changelog blocks) and atomic knowledge units (nodes). Nodes are typed containers with stable identifiers, scope annotations, confidence scores, and versioning metadata. Content-type tags (e.g., `code`, `steps`, `warning`) provide explicit semantic meaning beyond visual cues. 2. **Progressive Disclosure Model (PDM)**: A three-pass reading model (Index, Dense, Full) that enables agents to retrieve only relevant information at the necessary fidelity level, significantly reducing token consumption. 3. **Typed Edge Declarations**: Supports explicit, machine-traversable relationships between nodes (e.g., `:requires`, `:precedes`, `:see-also`), allowing for automatic dependency resolution. 4. **Role-Based Access Control**: The `scope` attribute on nodes and index entries enables content filtering at the format level, eliminating the need for external middleware in multi-agent systems. 5. **Executable Assertion Nodes**: Allows embedding validation logic, retry mechanisms, and escalation paths directly within the document, triggered by the query protocol. 6. **Delta Loading via Changelog**: A `__changelog` meta-node facilitates incremental document updates, reducing the cost of checking for changes. 7. **LLM-Native Query Protocol**: A minimal two-primitive interface (`search_index`, `resolve_context`) that leverages the LLM itself as a "Router" for semantic index search, rather than relying on traditional keyword matching or embeddings. This is a particularly clever design choice. 8. **Transpiler**: A hybrid Markdown-to-ObjectGraph transpiler that uses deterministic parsers for content extraction and bounded LLM calls for metadata synthesis (dense blocks, index keywords), ensuring high fidelity and bounding hallucination risk. The methodology is comprehensive, well-articulated, and addresses the identified problems systematically. The design choices, such as Markdown superset and LLM-as-Router, are pragmatic and innovative.
The empirical evaluation is robust and addresses key research questions effectively. 1. **Corpus**: A benchmark of 240 documents across five classes (Skill Files, Operational Runbooks, Execution Plans, Technical Documentation, Knowledge Bases), ranging from 200 to 15,000 tokens, provides a diverse testbed. 2. **Task Suite**: Eight distinct task types (information lookup, procedure execution, multi-step planning, role-conditional access, cross-node reasoning, update detection, assertion verification, multi-agent handoff) cover a broad range of agent interactions. 3. **Models & Baselines**: Evaluation uses Claude Sonnet 4.5 (primary), Claude Haiku 4.5 (Router), and GPT-4o (cross-model validation). Baselines include Full Markdown injection, RAG (text-embedding-3-large), and SkillReducer-optimized Markdown. 4. **RQ1: Token Consumption**: ObjectGraph achieved a mean token reduction from 2,340 to 187 tokens (92.0% average, up to 95.3%), demonstrating significant cost savings. 5. **RQ2: Context Compounding Reduction**: In a 5-turn workflow, ObjectGraph (Architecture B) reduced cumulative token cost by 36.5x compared to Markdown (46,000 vs. 1,260 tokens), effectively mitigating the super-linear growth of context. 6. **RQ3: Task Accuracy**: ObjectGraph matched or exceeded Markdown accuracy on 7 of 8 task types. Notably, it showed dramatic improvements on Role-conditional access (+18.4%) and Update detection (+30.1%), tasks where Markdown lacks native support. The "less-is-more" effect, where reduced context improves accuracy by reducing attention dilution, is a significant finding. 7. **RQ4: Transpiler Fidelity**: The transpiler achieved a mean fidelity of 0.987 (SD=0.018) on 180 held-out documents, ensuring high content preservation. 8. **RQ5: Human Authoring Burden**: A user study with 18 participants rated authoring burden as low (mean 2.8/7), suggesting good usability for human authors. 9. **Ablation Study**: An ablation study clearly demonstrated the individual contributions of different ObjectGraph features to token reduction, providing valuable insights into the design's effectiveness. The experimental setup is comprehensive, the results are statistically significant (p > 0.05 for accuracy degradation), and the findings strongly support the claims of the paper.
The paper provides a detailed specification of the ObjectGraph format, including its structure, node types, edge syntax, and query protocol. The LLM prompt template for metadata synthesis is explicitly provided. The algorithms for structural extraction and the query protocol are outlined. While no direct code repository or dataset links are provided, the level of detail in the format specification and methodology sections is high enough that a motivated researcher could likely implement the format and protocol. The benchmark corpus is described in terms of document classes and token ranges, but the specific documents are not publicly available. The LLM models used are identified. Overall, the paper offers a strong foundation for reproducibility, though direct code access would enhance it further.
The authors acknowledge several limitations: 1. **Scale**: The benchmark of 240 documents, while curated, may not fully represent the diversity of real-world enterprise-scale corpora. 2. **Cross-file Federation**: The current specification does not support cross-file edge resolution, limiting its applicability to mono-repo or single-domain knowledge bases. This is a significant limitation for truly distributed knowledge graphs. 3. **Standardisation**: Without a standards body or broad community adoption, the format risks fragmentation into incompatible dialects. 4. **Adversarial Inputs**: The evaluation did not consider adversarial document authors who might craft misleading `dense` blocks or `index` entries to manipulate agent routing. Additional minor limitations could include the reliance on LLMs for routing, which, while a feature, could introduce its own set of challenges (e.g., prompt engineering for optimal routing, potential for misinterpretation if the index is poorly crafted).
ObjectGraph has the potential for significant broader impact across several dimensions: 1. **Cost and Efficiency**: The dramatic reduction in token consumption (up to 95.3%) and mitigation of context compounding (36.5x reduction) can substantially lower the operational costs of LLM agents and enable more complex, multi-turn workflows within existing context window limits. 2. **Agent Capabilities**: By providing structured, queryable knowledge, ObjectGraph can enhance agent reasoning, planning, and execution capabilities, leading to more reliable and autonomous agents. 3. **System Simplification**: The "ObjectGraph as Infrastructure" concept is powerful. Role-scoped access control, executable assertions, and delta loading natively within the document format can eliminate the need for external middleware, validation prompt templates, and change tracking systems, simplifying the architecture of multi-agent deployments. 4. **Human-Agent Collaboration**: Being a strict superset of Markdown, ObjectGraph allows both humans and agents to interact with the same source document, reducing maintenance overhead and fostering better alignment between human-authored instructions and agent execution. 5. **Knowledge Management**: It offers a more robust framework for managing agent knowledge bases, enabling features like automated staleness detection and structured updates. 6. **New Paradigm for Documents**: This work challenges the fundamental assumption of linear document consumption, proposing a new paradigm for how information is structured and accessed in the agentic era. If widely adopted, it could lead to a new ecosystem of tools and practices for agent-native content creation and consumption. This paper introduces ObjectGraph, a novel file format that re-imagines documents as typed knowledge graphs for LLM agents, achieving up to 95.3% token reduction and significant context compounding mitigation without degrading task accuracy. The work presents a comprehensive, well-designed solution to a fundamental problem in LLM agent deployment, offering a paradigm shift in document consumption that promises to enhance agent efficiency, capabilities, and simplify multi-agent system architectures.
Multimodal Large Language Models (MLLMs) have demonstrated strong image-level visual understanding and reasoning, yet their pixel-level perception across both images and videos remains limited. Foundation segmentation models such as the SAM series produce high-quality masks, but they rely on low-level visual prompts and cannot natively interpret complex conversational instructions. Existing segmentation MLLMs narrow this gap, but are usually specialized for either images or videos and rarely support both textual and visual prompts in one interface. We introduce X2SAM, a unified segmentation MLLM that extends any-segmentation capabilities from images to videos. Given conversational instructions and visual prompts, X2SAM couples an LLM with a Mask Memory module that stores guided vision features for temporally consistent video mask generation. The same formulation supports generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation across image and video inputs. We further introduce the Video Visual Grounded (V-VGD) segmentation benchmark, which evaluates whether a model can segment object tracks in videos from interactive visual prompts. With a unified joint training strategy over heterogeneous image and video datasets, X2SAM delivers strong video segmentation performance, remains competitive on image segmentation benchmarks, and preserves general image and video chat ability.
Primary: unknown
All Institutions: unknown
X2SAM represents a significant step towards more generalized and intuitive multimodal AI. * **Enhanced Human-Computer Interaction:** The conversational interface supporting both text and visual prompts for pixel-level control across images and videos could lead to more natural and powerful interaction paradigms for visual editing, content creation, and data annotation. * **Advanced Video Understanding:** The ability to perform complex segmentation tasks with temporal consistency in videos opens doors for applications in autonomous driving, surveillance, robotics, and medical imaging, where precise spatio-temporal object understanding is critical. * **Foundation for Future MLLMs:** By demonstrating effective unification of image and video segmentation within an MLLM, X2SAM provides a strong baseline and architectural insights for developing even more capable multimodal foundation models. * **New Benchmarking:** The V-VGD benchmark provides a valuable tool for the community to evaluate and advance research in video visual grounded segmentation. X2SAM introduces a unified MLLM framework that extends "any-segmentation" from images to videos, integrating a novel Mask Memory module for temporal consistency and a unified joint training strategy. This paper makes a substantial technical contribution by enabling a single model to perform a wide array of image and video segmentation tasks with both textual and visual prompts, achieving strong performance across modalities and introducing a valuable new benchmark for video visual grounded segmentation.
X2SAM proposes a unified segmentation MLLM designed to extend "any-segmentation" capabilities from images to videos, supporting both textual and visual prompts. The core methodology addresses three key challenges: comprehensive prompt integration, spatio-temporal task formulation, and temporal coherence. 1. **Comprehensive Prompt Integration:** The model augments an LLM to process interleaved textual instructions and visual prompts (V-Prompts) for both image and video inputs. This is achieved by using special `
` and `
` tokens to demarcate object conditions and a `The experimental evaluation is comprehensive and rigorous, covering 14 segmentation tasks across images and videos, along with out-of-domain benchmarks. * **Task Coverage:** X2SAM is evaluated on a broad suite of tasks including generic, open-vocabulary, referring, reasoning, grounded conversation generation, interactive, and visual grounded segmentation for both images and videos. * **Datasets:** Training involves SA-1B for agnostic segmentation, and a diverse mix of image (COCO, RefCOCO/+/g, ReasonSeg, GLaMM-derived, COCO-VGD, LLaVA-1.5) and video (VIPSeg, VSPW, YT-VIS19, YT-RefVOS21, DAVIS17-RefVOS, ReVOS, VideoGLaMM-derived, YT-VOS19, YT19-VGD, VIPSeg-VGD, VideoInstruct100K) datasets. The introduction of the Video Visual Grounded (V-VGD) segmentation benchmark (YT19-VGD and VIPSeg-VGD) is a significant contribution. * **Performance:** * **Image Segmentation:** X2SAM remains competitive with image-centric generalists like X-SAM, notably improving image open-vocabulary segmentation (I-OV) from 20.9 to 31.2 PQ. * **Video Segmentation:** It significantly outperforms existing MLLM-based video generalists. For instance, it improves V-Ref. on Ref-YT21 and Ref-DV17 over UniPixel-7B, and achieves a +21.5 mIoU gain on V-GCG over VideoGLaMM (75.8 vs. 54.3). * **Reasoning Segmentation:** Achieves state-of-the-art results on both image (I-Rea. Seg.) and video (V-Rea. Seg.) reasoning tasks, outperforming HyperSeg and even the video-specialist ReferFormer-B. * **Out-of-Domain Generalization:** Demonstrates strong generalization on gRefCOCO, ADE20K, and YT-VIS-21, surpassing specialists and other MLLM generalists. * **Visual Grounded Segmentation:** Shows substantial improvements over SAM2-H in the video domain (V-VGD Seg.), with impressive AP scores on YT-VIS19 and VIPSeg. * **Ablation Studies:** Thorough ablations validate key components: * **Mask Decoder:** Zero-initialization for Token-to-Image Attention is shown to be crucial for stable training and performance gains. * **Joint Training:** The unified joint training strategy significantly reduces training cost (3.3K vs 5.2K GPU hours) while maintaining performance. * **Mask Memory:** Mask guidance, class guidance, and multi-scale features in the Mask Memory module are shown to bring consistent and substantial gains, especially for video tasks. * **Memory Size:** An optimal memory size of 6 frames is identified, balancing historical information with potential noise.
The paper provides a good level of detail for reproducibility. * **Model Initialization:** Vision encoder, projector, and LLM from Qwen3-VL; mask encoder from SAM2; mask decoder from pre-trained agnostic segmentor. LoRA used for LLM fine-tuning. * **Training Details:** Specifics for both agnostic segmentor training (batch size 128, LR 1e-4) and unified joint training (projectors, LoRA, encoders, decoder, memory optimized; LR 1e-5 for mask encoder, 1e-4 for others; effective batch size 32 for video, 128 for image; AdamW optimizer, weight decay 0.05). * **Loss Functions:** Mask loss (BCE + Dice), auto-regressive loss, and focal loss. * **Dataset Sampling:** Consecutive frame sampling for video segmentation, global sampling for video GCG, 64 frames for video chat. * **Memory Capacity:** Default K=8 for ablations, K=6 for final model. The level of detail provided in the "Implementation Details" and "More Model Details" sections is sufficient for researchers to attempt to reproduce the results, although the sheer scale of training (32 NVIDIA H800 GPUs) might be a practical barrier for some.
The authors candidly discuss several limitations: 1. **Computational Expense:** Unified training over heterogeneous image and video datasets remains computationally expensive, especially for video samples with high memory costs. 2. **Fixed-Size Memory:** The fixed-size FIFO memory (K=6 frames) may be insufficient for very long videos, scenarios with prolonged occlusions, large appearance changes, or sparse target reappearance, limiting long-term temporal understanding. 3. **Generalist vs. Specialist Performance:** As a unified generalist model, X2SAM may still lag behind highly specialized models on narrowly focused tasks (e.g., optimized video object segmentation or image-only segmentation).
X2SAM represents a significant step towards more generalized and intuitive multimodal AI. * **Enhanced Human-Computer Interaction:** The conversational interface supporting both text and visual prompts for pixel-level control across images and videos could lead to more natural and powerful interaction paradigms for visual editing, content creation, and data annotation. * **Advanced Video Understanding:** The ability to perform complex segmentation tasks with temporal consistency in videos opens doors for applications in autonomous driving, surveillance, robotics, and medical imaging, where precise spatio-temporal object understanding is critical. * **Foundation for Future MLLMs:** By demonstrating effective unification of image and video segmentation within an MLLM, X2SAM provides a strong baseline and architectural insights for developing even more capable multimodal foundation models. * **New Benchmarking:** The V-VGD benchmark provides a valuable tool for the community to evaluate and advance research in video visual grounded segmentation. X2SAM introduces a unified MLLM framework that extends "any-segmentation" from images to videos, integrating a novel Mask Memory module for temporal consistency and a unified joint training strategy. This paper makes a substantial technical contribution by enabling a single model to perform a wide array of image and video segmentation tasks with both textual and visual prompts, achieving strong performance across modalities and introducing a valuable new benchmark for video visual grounded segmentation.
Recent megakernel designs for Mixture-of-Experts (MoE) inference fuse expert computation with fine-grained, GPU-initiated communication into a single persistent GPU kernel, and outperform collective-based MoE on a single node by overlapping data transfer with compute at tile granularity. This benefit does not carry over cleanly to multi-node inference, where experts span many nodes connected by an RDMA fabric. Communication-bound MoE models regress by up to $10\times$ on 8 nodes, and the regression worsens with node count. We trace this regression to hidden serialization in proxy-based RDMA transports. The ordering requirement between each tile transfer and its completion signal forces a fence that drains the NIC pipeline, and its cost grows with the number of concurrent transfers. As a result, models whose per-expert compute is too small to absorb this inflated network latency expose communication on the critical path. We present \emph{Perseus}, which eliminates this serialization through two techniques. \emph{Decoupled signaling} batches fences at per-destination granularity, reducing fence count by $8\times$. \emph{NIC-side ordering} replaces proxy stalls with hardware fence flags, so the proxy never blocks. On proxy-based transports, Perseus achieves up to 10.3$\times$ end-to-end speedup. Perseus on IBRC matches or exceeds IBGDA GPU-direct by up to 1.2$\times$, which shows that serialization, rather than the choice between proxy-based and GPU-direct transport, is what bounds multi-node megakernel performance.
Primary: Cornell University
All Institutions: Cornell University
Perseus has significant broader impact for the field of large-scale machine learning and distributed systems: - **Enables larger MoE models:** By effectively addressing a critical scaling bottleneck, Perseus allows MoE models to be deployed and inferred across many nodes with much higher efficiency, pushing the boundaries of what's possible with LLMs. - **Challenges conventional wisdom:** The finding that optimized proxy-based RDMA can outperform GPU-direct RDMA is a major insight that could shift design paradigms for distributed ML systems, potentially simplifying development by making proxy-based approaches more viable. - **Influences future hardware/software design:** The identification of hidden serialization and the effectiveness of NIC-side ordering could inform future designs of network interface cards (NICs) and communication libraries, encouraging more hardware-level support for flexible ordering and completion signaling. - **Applicability beyond MoE:** The principles of eliminating hidden serialization in fine-grained, GPU-initiated communication could be beneficial for other distributed workloads that exhibit similar communication patterns. This paper identifies a critical, previously hidden performance bottleneck in multi-node megakernel communication for MoE inference, deeply analyzes its root cause, and proposes elegant and highly effective solutions that yield up to 10.3x speedup and challenge conventional wisdom regarding proxy-based vs. GPU-direct RDMA. The work is technically profound, experimentally rigorous, and has significant implications for scaling large language models and distributed systems research.
The paper identifies a critical and previously "hidden" serialization bottleneck in multi-node megakernel communication for Mixture-of-Experts (MoE) inference, specifically within proxy-based RDMA transports. The core insight is that the ordering requirement between each fine-grained tile transfer and its completion signal (a doorbell write) forces a `wmb` (write memory barrier) on the CPU-side proxy. This `wmb` drains the NIC pipeline, and its cost grows with the number of concurrent transfers, leading to significant performance regression for communication-bound MoE models. Perseus proposes two technically sound and elegant solutions: 1. **Decoupled Signaling:** This technique batches multiple tile transfers before issuing a single doorbell write and its associated `wmb`. By reducing the number of `wmb`s by up to 8x, it significantly mitigates the serialization overhead. The GPU manages completion tracking for these batches. 2. **NIC-side Ordering:** This more fundamental solution leverages the RDMA write with immediate (RDMAD_WRITE_IMM) capability. By embedding the completion signal (immediate value) within the same RDMA operation as the data transfer, the NIC inherently guarantees ordering. This completely eliminates the need for a CPU-side `wmb`, allowing the proxy to never block. This is a particularly clever use of existing hardware capabilities to solve a software-induced serialization problem. The methodology is robust, clearly dissecting the problem, proposing targeted solutions, and explaining their mechanisms in detail.
The experimental evaluation is comprehensive and rigorous. - **Setup:** Experiments are conducted on a realistic multi-node cluster (8 nodes, 16 A100 GPUs) connected by an InfiniBand HDR fabric, which is highly relevant for large-scale ML deployments. - **Baselines:** The evaluation compares Perseus against strong baselines: IBRC (proxy-based RDMA, representing the problematic baseline) and IBGDA (GPU-direct RDMA, often considered the gold standard for high-performance communication). - **Workloads:** Real-world MoE models, including Switch Transformer (1.6B, 2.3B, 137B parameters) and GShard (600M), are used, demonstrating the practical applicability of the solution. - **Key Results:** - Perseus on IBRC achieves up to 10.3x end-to-end speedup over the baseline IBRC, a truly remarkable improvement. - Crucially, Perseus on IBRC matches or even exceeds IBGDA (GPU-direct) by up to 1.2x. This is a surprising and highly impactful finding, challenging the conventional wisdom that GPU-direct is inherently superior to proxy-based approaches for fine-grained communication. It demonstrates that serialization, not the choice of transport mechanism, was the primary bottleneck. - The paper provides a clear breakdown of the individual contributions of Decoupled Signaling and NIC-side Ordering, showing how they incrementally contribute to the overall speedup. - Microbenchmarks confirm the reduction in fence latency, validating the underlying hypothesis. - Sensitivity analysis to expert size further clarifies when Perseus provides the most benefit (models with smaller per-expert compute, where communication is more exposed). The results are convincing, well-supported, and clearly demonstrate the effectiveness and significance of Perseus.
The paper provides a detailed description of the problem, the proposed solutions, and their implementation within a modified UCX transport layer. The experimental setup, including hardware specifications and workloads, is also well-documented. While the source code is not provided (common for arXiv preprints), the level of detail should enable skilled systems researchers to reproduce the core ideas and potentially the results, given access to similar hardware.
- **Specificity to RDMA/InfiniBand:** The solutions are tailored to the specifics of RDMA transports and the `wmb` behavior in proxy-based communication. While the underlying principle of serialization might exist in other network fabrics, the exact solutions might not directly apply without adaptation. - **Generalizability to other communication patterns:** While the paper suggests applicability to other fine-grained, GPU-initiated communication, the primary focus and evaluation are on MoE's all-to-all pattern. Its effectiveness for other patterns would need further investigation. - **Overhead for extremely small messages:** While MoE tiles are fine-grained, they are not necessarily extremely tiny. For truly byte-level communication, the batching overhead of decoupled signaling or the immediate value processing might introduce new trade-offs, though this is not the target use case.
Perseus has significant broader impact for the field of large-scale machine learning and distributed systems: - **Enables larger MoE models:** By effectively addressing a critical scaling bottleneck, Perseus allows MoE models to be deployed and inferred across many nodes with much higher efficiency, pushing the boundaries of what's possible with LLMs. - **Challenges conventional wisdom:** The finding that optimized proxy-based RDMA can outperform GPU-direct RDMA is a major insight that could shift design paradigms for distributed ML systems, potentially simplifying development by making proxy-based approaches more viable. - **Influences future hardware/software design:** The identification of hidden serialization and the effectiveness of NIC-side ordering could inform future designs of network interface cards (NICs) and communication libraries, encouraging more hardware-level support for flexible ordering and completion signaling. - **Applicability beyond MoE:** The principles of eliminating hidden serialization in fine-grained, GPU-initiated communication could be beneficial for other distributed workloads that exhibit similar communication patterns. This paper identifies a critical, previously hidden performance bottleneck in multi-node megakernel communication for MoE inference, deeply analyzes its root cause, and proposes elegant and highly effective solutions that yield up to 10.3x speedup and challenge conventional wisdom regarding proxy-based vs. GPU-direct RDMA. The work is technically profound, experimentally rigorous, and has significant implications for scaling large language models and distributed systems research.