# Simular AI's Agent Frameworks and the Behavior Best-of-N Scaling Paradigm *** ## 🌟 Executive Summary: The Wide Scaling Paradigm 🌟 This document provides a comprehensive analysis of the evolution of Simular AI's computer-use agent (CUA) frameworks, culminating in the development of **Agent S3** and the introduction of the **Behavior Best-of-N (bBoN) scaling paradigm**. The core challenge addressed is the inherent **unreliability and high variance of CUAs** in long-horizon, complex digital tasks, where minor errors can compound and lead to catastrophic failure. ### **The Primary Innovation: Wide Scaling (bBoN)** Instead of focusing on perfecting a single agent's execution, bBoN generates **multiple, diverse agent trajectories (rollouts) in parallel** and systematically selects the most successful one. - This method has proven exceptionally effective, pushing the state-of-the-art on the **OSWorld benchmark to a 69.9% success rate**, a **10% absolute improvement** over prior methods and approaching the **72% human performance level**. ### **Breakthrough Mechanisms** This breakthrough is enabled by two novel mechanisms: 1. **Behavior Narratives:** A method for converting dense, raw trajectories into **concise, factual summaries** that describe what an agent did and how the environment changed. This filters out irrelevant details and makes different rollouts directly comparable. 2. **Comparative Judge Selection:** A **Vision-Language Model (VLM) judge** evaluates all behavior narratives simultaneously, allowing for a **principled and holistic selection** of the best overall trajectory. The bBoN framework is built upon **Agent S3**, a simplified and more flexible agent architecture that improves upon its predecessors, Agent S and Agent S2. Agent S3 alone achieves state-of-the-art performance by replacing a rigid hierarchical structure with a **flat policy** and integrating a **native coding agent**. The combination of the improved Agent S3 baseline with the bBoN scaling method demonstrates a powerful and practical pathway toward building **robust, near-human-level autonomous agents** for computer use. *** ## 1. The Core Challenge: High Variance in Long-Horizon Tasks Computer-use agents (CUAs) hold the promise of automating everyday digital tasks, but their practical application is hindered by their **unreliability**, especially on complex, multi-step workflows. The primary bottleneck is **high variance**: an agent's performance can be unpredictable, succeeding on one attempt but failing catastrophically on another. ### **Source of Fragility** This fragility stems from several factors inherent to real-world digital environments: - ➡️ **Accumulation of Errors:** Small mistakes, such as a misclick or a misinterpretation of a UI element, can compound over dozens or hundreds of interactions. - 🚦 **Environmental Noise:** Unexpected events like UI changes, pop-up notifications, and network latency can destabilize an agent's performance. - ⏳ **Delayed Feedback:** The consequences of an incorrect action may not become apparent until several steps later, making recovery difficult. - 🧩 **Branching Solution Paths:** Many tasks admit multiple valid solutions, and an agent might commit to a difficult or incorrect path early on. > 📢 **Quote:** "The same agent might nail a task once and then completely blow it the next time." This inconsistency makes CUAs unpredictable and highlights why achieving reliability in complex, everyday workflows remains a significant challenge. *** ## 2. The Agent S Series: An Evolutionary Framework Simular AI has developed a series of open-source agentic frameworks, hosted on GitHub under `simular-ai/Agent-S`, with each iteration introducing significant architectural advancements. #### 2.1. Agent S: Foundational Hierarchical Planning and Memory Agent S introduced a framework designed to mimic human-like computer interaction by tackling domain knowledge acquisition, long-horizon planning, and dynamic interface navigation. - **Experience-Augmented Hierarchical Planning:** A **Manager** module decomposes complex tasks into subtasks, leveraging external web searches and internal memories. A **Worker** module executes subtasks, drawing on "Episodic Memory." - **Agent-Computer Interface (ACI):** An abstraction layer using a **dual-input strategy** (screenshots and an accessibility tree) and a bounded, language-based action space (e.g., `click(element_id)`). - **Continual Learning:** "Narrative" and "Episodic" memories are continually updated through self-supervised exploration and inference tasks. > 📊 **Result:** Agent S achieved a **20.58%** success rate on the OSWorld benchmark, outperforming the baseline of 11.21% by a relative improvement of 83.6%. #### 2.2. Agent S2: A Compositional Framework with Specialist Modules Agent S2 evolved into a compositional framework, delegating cognitive responsibilities across specialist models. - **Mixture of Grounding (MoG):** Routes actions to a team of specialized grounding experts (visual, textual, structural) to accurately locate GUI elements using **only screenshots**, eliminating the need for bulky accessibility trees. - **Proactive Hierarchical Planning:** The manager dynamically refines the action plan after the completion of **every subgoal**, allowing for continuous adaptation and self-correction. > 📊 **Result:** Agent S2 achieved a **34.5%** success rate on the OSWorld 50-step evaluation. Analysis showed MoG reduced grounding errors, shifting the primary bottleneck to planning failures. #### 2.3. Agent S3: A Simplified, More Flexible Baseline Agent S3 serves as the foundational agent for the bBoN scaling method, prioritizing simplicity, flexibility, and efficiency. - **Flat Policy:** The rigid hierarchical manager-worker setup was removed in favor of a **"flat policy."** This streamlines the framework, reduces overhead, and allows the agent to **replan at any time**. - **Native Coding Agent:** Natively integrates a coding agent for programmatic edits (e.g., bulk file operations, structured parsing), offering more reliable solution paths beyond direct GUI manipulation. > 🏆 **Result:** These refinements alone boosted single-agent performance significantly. Agent S3 achieved a **62.6%** success rate on OSWorld (100-step), establishing a new state-of-the-art **before** the application of wide scaling. *** ## 3. Behavior Best-of-N (bBoN): The Wide Scaling Paradigm The **Behavior Best-of-N (bBoN)** framework is the core innovation, systematically scaling the number of candidate solution trajectories ($N$) to address the high variance of single-agent rollouts. ### 3.1. Core Concept: From Single Trajectory to Parallel Rollouts The bBoN paradigm shifts the focus from optimizing a single path to selecting the best path from a **diverse portfolio of attempts**. Generating multiple rollouts in parallel dramatically increases the probability that at least one attempt will succeed. ### 3.2. Key Mechanism 1: Behavior Narrative Generation This solves the challenge of comparing information-dense, noisy long-horizon trajectories. 1. **Fact Generation:** A VLM analyzes the before-and-after screenshots and the agent's action to derive a concise **"fact."** 2. **Narrative Construction:** These facts are concatenated to form a **behavior narrative**—a compact, task-relevant summary. > ✅ **Validation:** An ablation study confirmed behavior narratives are a more effective representation for trajectory evaluation than using screenshots alone or a naive captioning approach. ### 3.3. Key Mechanism 2: Comparative Judge Selection bBoN employs a VLM-based judge for principled, holistic selection using the narratives. - **Comparative Evaluation:** The judge evaluates **all** candidate behavior narratives simultaneously in a single-round, **multiple-choice question (MCQ) format**. - **Fact-Grounded Reasoning:** The judge is instructed to **cite and contrast facts** from the narratives to justify its selection, ensuring a robust comparison. > 💡 **Superiority:** This approach was shown to be superior to both step-wise scaling (risk of over-committing to suboptimal paths) and independent ranking. *** ## 4. Performance, Benchmarks, and Analysis The combination of Agent S3 and bBoN has yielded state-of-the-art results, demonstrating "unreasonable effectiveness." #### 4.1. State-of-the-Art Performance Metrics **Benchmark/Agent/Method/Success Rate/Notes** - OSWorld (100-step): Previous SoTA (CoAct-1): 59.9% (Previous state-of-the-art) - OSWorld (100-step): Agent S3 (Single Run): 62.6% (New SoTA *before* scaling) - OSWorld (100-step): Agent S3 w/ bBoN (N=10): **69.9%** (**Near human performance (72%)**) - OSWorld (100-step): Agent S3 w/ bBoN (GPT-5 Mini): 60.2% (Shows scaling benefits smaller models) - WindowsAgentArena: Agent S3 w/ bBoN (N=3): 56.6% (Demonstrates strong generalization) - AndroidWorld: Agent S3 w/ bBoN (N=3): 71.6% (Demonstrates strong generalization) - Performance was found to generally **increase with the number of rollouts**. - A **mixture-of-models ensemble** (GPT-5 + Gemini 2.5 Pro) achieved the highest task coverage. #### 4.2. Judge and Human Alignment: True Performance Analysis of the bBoN judge's accuracy revealed a critical insight: automated evaluation scripts can be imperfect. **Alignment Type/Judge Accuracy/Finding** - Benchmark Alignment: 78.4% (Alignment with the rigid automated evaluation script) - Human Alignment: **92.8%** (Judge's selection found to be correct upon manual inspection of remaining failures) > 🚀 **Conclusion:** The high degree of human alignment suggests the agent's true performance on OSWorld is closer to **76.3%**, **potentially exceeding the reported human-level score of 72%**, because the agent often found valid alternative solutions the rigid evaluation scripts marked as incorrect. *** ## 5. The Broader CUA Landscape Simular AI's work is situated in a competitive landscape categorized into two traditional approaches: 1. **Monolithic (Generalist) Solutions:** Use a single large model for all task aspects. - *Examples:* OpenAI CUA (Operator), Anthropic's Claude Computer Use (CCU), UI-TARS, and Jedi-7B. 2. **Hierarchical and Compositional Solutions:** Decompose tasks and delegate to multiple modules. - *Examples:* Agent S, S2, Navi (WindowsAgentArena), MobileUse, and UI-Venus (AndroidWorld). > 💎 **Distinct Paradigm:** **Agent S3 with bBoN's wide scaling** represents a distinct **third paradigm**, focused on leveraging multiple agent runs rather than optimizing a single execution path. *** ## 6. Open Source Implementation and Project Details Simular AI's frameworks are developed as an open-source project to advance the field. **Detail / Information** - Repository: `github.com/simular-ai/Agent-S` - Library: Installable via Python: `pip install gui-agents` - Supported Platforms: Linux, Mac, and Windows (single-monitor setups). - Recommended Setup: **OpenAI gpt-5-2025-08-07** (main reasoning) + **UI-TARS-1.5-7B** (specialist grounding). - Project History: **ICLR 2025:** Agent S paper acceptance. **COLM 2025:** Agent S2 paper acceptance. **October 2, 2025:** Agent S3 paper release. *** ## CUA and Agent S Framework Index (Compact) This list summarizes key index words and concepts from the provided sources related to Computer-Use Agents (CUAs) and Simular AI's Agent S series. ### I. Core Frameworks and Methods **Category/Key Concepts** - Behavior Best-of-N (bBoN): Wide scaling paradigm, Multiple rollouts/trajectories, Behavior narratives, Principled trajectory selection, Comparative evaluation, Behavior Narrative Generator, bBoN Judge, MCQ. - Agent S3: Flat policy, Native coding agent, State-of-the-Art (SoTA). - Agent S2: Compositional framework, Mixture of Grounding (MoG), Proactive Hierarchical Planning (PHP), Reactive planning. - Agent S: Experience-Augmented Hierarchical Planning, Hierarchical Planning, Monolithic methods. ### II. Key Components and Technologies **Component/Function/Index Terms** - Agent Core: Computer-Use Agents (CUAs), GUI agents, High variance (CUA bottleneck), Agent-Computer Interface (ACI). - Grounding: Visual Grounding Expert, Textual Grounding Expert, Structural Grounding Expert, GUI localization. - Planning & Memory: Advanced reasoning, Memory, Narrative Memory, Episodic Memory, Online Web Knowledge, Self-Evaluator, Self-supervised Exploration, Continual Memory Update (continual learning), Retrieval-augmented-generation (RAG). - Models & Compute: MLLM, LLMs, Foundation models, Scaling agents/compute. ### III. Benchmarks and Evaluation **Index Terms** - OSWorld, WindowsAgentArena, AndroidWorld, State of the Art (SoTA), Human-level performance, Trajectory evaluation. ### IV. Technical and General Subject Index Terms **Subject Area/Concepts** - AI/CS Fields: Artificial Intelligence (cs.AI), Computation and Language (cs.CL), Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG). - General Tech: Computer automation, Modularity, Data privacy, Cyber attacks, AIOps, Federated learning, GPT, AGI. *** ## Simular AI Agent S Paper Sources on Hugging Face *** **Agent Version/Paper Title/Date/Acceptance/Hugging Face|arXiv URL** - Agent S: Agent S: An Open Agentic Framework that Uses Computers Like a Human, Accepted ICLR 2025 (Released Oct 2024), `https://huggingface.co/papers/2410.08164` (arXiv) - Agent S2: Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents, Accepted COLM 2025 (Released Apr 2025), `https://arxiv.org/abs/2504.00906` (arXiv) - Agent S3: The Unreasonable Effectiveness of Scaling Agents for Computer Use (Relates to the bBoN paradigm), Released Oct 2, 2025, `https://huggingface.co/papers/2510.02250` (Paper Page)
Comments
Post a Comment