ACE Podcast Episode 2 - Agent S3

# Simular AI's Agent Frameworks and the Behavior Best-of-N Scaling Paradigm

***

## 🌟 Executive Summary: The Wide Scaling Paradigm 🌟

This document provides a comprehensive analysis of the evolution of Simular AI's computer-use agent (CUA) frameworks, culminating in the development of **Agent S3** and the introduction of the **Behavior Best-of-N (bBoN) scaling paradigm**.

The core challenge addressed is the inherent **unreliability and high variance of CUAs** in long-horizon, complex digital tasks, where minor errors can compound and lead to catastrophic failure.

### **The Primary Innovation: Wide Scaling (bBoN)**

Instead of focusing on perfecting a single agent's execution, bBoN generates **multiple, diverse agent trajectories (rollouts) in parallel** and systematically selects the most successful one.

- This method has proven exceptionally effective, pushing the state-of-the-art on the **OSWorld benchmark to a 69.9% success rate**, a **10% absolute improvement** over prior methods and approaching the **72% human performance level**.

### **Breakthrough Mechanisms**

This breakthrough is enabled by two novel mechanisms:

1. **Behavior Narratives:** A method for converting dense, raw trajectories into **concise, factual summaries** that describe what an agent did and how the environment changed. This filters out irrelevant details and makes different rollouts directly comparable.
2. **Comparative Judge Selection:** A **Vision-Language Model (VLM) judge** evaluates all behavior narratives simultaneously, allowing for a **principled and holistic selection** of the best overall trajectory.

The bBoN framework is built upon **Agent S3**, a simplified and more flexible agent architecture that improves upon its predecessors, Agent S and Agent S2. Agent S3 alone achieves state-of-the-art performance by replacing a rigid hierarchical structure with a **flat policy** and integrating a **native coding agent**. The combination of the improved Agent S3 baseline with the bBoN scaling method demonstrates a powerful and practical pathway toward building **robust, near-human-level autonomous agents** for computer use.

***

## 1. The Core Challenge: High Variance in Long-Horizon Tasks

Computer-use agents (CUAs) hold the promise of automating everyday digital tasks, but their practical application is hindered by their **unreliability**, especially on complex, multi-step workflows. The primary bottleneck is **high variance**: an agent's performance can be unpredictable, succeeding on one attempt but failing catastrophically on another.

### **Source of Fragility**

This fragility stems from several factors inherent to real-world digital environments:

- ➡️ **Accumulation of Errors:** Small mistakes, such as a misclick or a misinterpretation of a UI element, can compound over dozens or hundreds of interactions.
- 🚦 **Environmental Noise:** Unexpected events like UI changes, pop-up notifications, and network latency can destabilize an agent's performance.
- ⏳ **Delayed Feedback:** The consequences of an incorrect action may not become apparent until several steps later, making recovery difficult.
- 🧩 **Branching Solution Paths:** Many tasks admit multiple valid solutions, and an agent might commit to a difficult or incorrect path early on.

> 📢 **Quote:** "The same agent might nail a task once and then completely blow it the next time." This inconsistency makes CUAs unpredictable and highlights why achieving reliability in complex, everyday workflows remains a significant challenge.

***

## 2. The Agent S Series: An Evolutionary Framework

Simular AI has developed a series of open-source agentic frameworks, hosted on GitHub under `simular-ai/Agent-S`, with each iteration introducing significant architectural advancements.

#### 2.1. Agent S: Foundational Hierarchical Planning and Memory

Agent S introduced a framework designed to mimic human-like computer interaction by tackling domain knowledge acquisition, long-horizon planning, and dynamic interface navigation.

- **Experience-Augmented Hierarchical Planning:** A **Manager** module decomposes complex tasks into subtasks, leveraging external web searches and internal memories. A **Worker** module executes subtasks, drawing on "Episodic Memory."
- **Agent-Computer Interface (ACI):** An abstraction layer using a **dual-input strategy** (screenshots and an accessibility tree) and a bounded, language-based action space (e.g., `click(element_id)`).
- **Continual Learning:** "Narrative" and "Episodic" memories are continually updated through self-supervised exploration and inference tasks.

> 📊 **Result:** Agent S achieved a **20.58%** success rate on the OSWorld benchmark, outperforming the baseline of 11.21% by a relative improvement of 83.6%.

#### 2.2. Agent S2: A Compositional Framework with Specialist Modules

Agent S2 evolved into a compositional framework, delegating cognitive responsibilities across specialist models.

- **Mixture of Grounding (MoG):** Routes actions to a team of specialized grounding experts (visual, textual, structural) to accurately locate GUI elements using **only screenshots**, eliminating the need for bulky accessibility trees.
- **Proactive Hierarchical Planning:** The manager dynamically refines the action plan after the completion of **every subgoal**, allowing for continuous adaptation and self-correction.

> 📊 **Result:** Agent S2 achieved a **34.5%** success rate on the OSWorld 50-step evaluation. Analysis showed MoG reduced grounding errors, shifting the primary bottleneck to planning failures.

#### 2.3. Agent S3: A Simplified, More Flexible Baseline

Agent S3 serves as the foundational agent for the bBoN scaling method, prioritizing simplicity, flexibility, and efficiency.

- **Flat Policy:** The rigid hierarchical manager-worker setup was removed in favor of a **"flat policy."** This streamlines the framework, reduces overhead, and allows the agent to **replan at any time**.
- **Native Coding Agent:** Natively integrates a coding agent for programmatic edits (e.g., bulk file operations, structured parsing), offering more reliable solution paths beyond direct GUI manipulation.

> 🏆 **Result:** These refinements alone boosted single-agent performance significantly. Agent S3 achieved a **62.6%** success rate on OSWorld (100-step), establishing a new state-of-the-art **before** the application of wide scaling.

***

## 3. Behavior Best-of-N (bBoN): The Wide Scaling Paradigm

The **Behavior Best-of-N (bBoN)** framework is the core innovation, systematically scaling the number of candidate solution trajectories ($N$) to address the high variance of single-agent rollouts.

### 3.1. Core Concept: From Single Trajectory to Parallel Rollouts

The bBoN paradigm shifts the focus from optimizing a single path to selecting the best path from a **diverse portfolio of attempts**. Generating multiple rollouts in parallel dramatically increases the probability that at least one attempt will succeed.

### 3.2. Key Mechanism 1: Behavior Narrative Generation

This solves the challenge of comparing information-dense, noisy long-horizon trajectories.

1. **Fact Generation:** A VLM analyzes the before-and-after screenshots and the agent's action to derive a concise **"fact."**
2. **Narrative Construction:** These facts are concatenated to form a **behavior narrative**—a compact, task-relevant summary.

> ✅ **Validation:** An ablation study confirmed behavior narratives are a more effective representation for trajectory evaluation than using screenshots alone or a naive captioning approach.

### 3.3. Key Mechanism 2: Comparative Judge Selection

bBoN employs a VLM-based judge for principled, holistic selection using the narratives.

- **Comparative Evaluation:** The judge evaluates **all** candidate behavior narratives simultaneously in a single-round, **multiple-choice question (MCQ) format**.
- **Fact-Grounded Reasoning:** The judge is instructed to **cite and contrast facts** from the narratives to justify its selection, ensuring a robust comparison.

> 💡 **Superiority:** This approach was shown to be superior to both step-wise scaling (risk of over-committing to suboptimal paths) and independent ranking.

***

## 4. Performance, Benchmarks, and Analysis

The combination of Agent S3 and bBoN has yielded state-of-the-art results, demonstrating "unreasonable effectiveness."

#### 4.1. State-of-the-Art Performance Metrics

**Benchmark/Agent/Method/Success Rate/Notes**
- OSWorld (100-step): Previous SoTA (CoAct-1): 59.9% (Previous state-of-the-art)
- OSWorld (100-step): Agent S3 (Single Run): 62.6% (New SoTA *before* scaling)
- OSWorld (100-step): Agent S3 w/ bBoN (N=10): **69.9%** (**Near human performance (72%)**)
- OSWorld (100-step): Agent S3 w/ bBoN (GPT-5 Mini): 60.2% (Shows scaling benefits smaller models)
- WindowsAgentArena: Agent S3 w/ bBoN (N=3): 56.6% (Demonstrates strong generalization)
- AndroidWorld: Agent S3 w/ bBoN (N=3): 71.6% (Demonstrates strong generalization)

- Performance was found to generally **increase with the number of rollouts**.
- A **mixture-of-models ensemble** (GPT-5 + Gemini 2.5 Pro) achieved the highest task coverage.

#### 4.2. Judge and Human Alignment: True Performance

Analysis of the bBoN judge's accuracy revealed a critical insight: automated evaluation scripts can be imperfect.

**Alignment Type/Judge Accuracy/Finding**
- Benchmark Alignment: 78.4% (Alignment with the rigid automated evaluation script)
- Human Alignment: **92.8%** (Judge's selection found to be correct upon manual inspection of remaining failures)

> 🚀 **Conclusion:** The high degree of human alignment suggests the agent's true performance on OSWorld is closer to **76.3%**, **potentially exceeding the reported human-level score of 72%**, because the agent often found valid alternative solutions the rigid evaluation scripts marked as incorrect.

***

## 5. The Broader CUA Landscape

Simular AI's work is situated in a competitive landscape categorized into two traditional approaches:

1. **Monolithic (Generalist) Solutions:** Use a single large model for all task aspects.
    - *Examples:* OpenAI CUA (Operator), Anthropic's Claude Computer Use (CCU), UI-TARS, and Jedi-7B.
2. **Hierarchical and Compositional Solutions:** Decompose tasks and delegate to multiple modules.
    - *Examples:* Agent S, S2, Navi (WindowsAgentArena), MobileUse, and UI-Venus (AndroidWorld).

> 💎 **Distinct Paradigm:** **Agent S3 with bBoN's wide scaling** represents a distinct **third paradigm**, focused on leveraging multiple agent runs rather than optimizing a single execution path.

***

## 6. Open Source Implementation and Project Details

Simular AI's frameworks are developed as an open-source project to advance the field.

**Detail / Information**
- Repository: `github.com/simular-ai/Agent-S`
- Library: Installable via Python: `pip install gui-agents`
- Supported Platforms: Linux, Mac, and Windows (single-monitor setups).
- Recommended Setup: **OpenAI gpt-5-2025-08-07** (main reasoning) + **UI-TARS-1.5-7B** (specialist grounding).
- Project History: **ICLR 2025:** Agent S paper acceptance. **COLM 2025:** Agent S2 paper acceptance. **October 2, 2025:** Agent S3 paper release.

***

## CUA and Agent S Framework Index (Compact)

This list summarizes key index words and concepts from the provided sources related to Computer-Use Agents (CUAs) and Simular AI's Agent S series.

### I. Core Frameworks and Methods

**Category/Key Concepts**
- Behavior Best-of-N (bBoN): Wide scaling paradigm, Multiple rollouts/trajectories, Behavior narratives, Principled trajectory selection, Comparative evaluation, Behavior Narrative Generator, bBoN Judge, MCQ.
- Agent S3: Flat policy, Native coding agent, State-of-the-Art (SoTA).
- Agent S2: Compositional framework, Mixture of Grounding (MoG), Proactive Hierarchical Planning (PHP), Reactive planning.
- Agent S: Experience-Augmented Hierarchical Planning, Hierarchical Planning, Monolithic methods.

### II. Key Components and Technologies

**Component/Function/Index Terms**
- Agent Core: Computer-Use Agents (CUAs), GUI agents, High variance (CUA bottleneck), Agent-Computer Interface (ACI).
- Grounding: Visual Grounding Expert, Textual Grounding Expert, Structural Grounding Expert, GUI localization.
- Planning & Memory: Advanced reasoning, Memory, Narrative Memory, Episodic Memory, Online Web Knowledge, Self-Evaluator, Self-supervised Exploration, Continual Memory Update (continual learning), Retrieval-augmented-generation (RAG).
- Models & Compute: MLLM, LLMs, Foundation models, Scaling agents/compute.

### III. Benchmarks and Evaluation

**Index Terms**
- OSWorld, WindowsAgentArena, AndroidWorld, State of the Art (SoTA), Human-level performance, Trajectory evaluation.

### IV. Technical and General Subject Index Terms

**Subject Area/Concepts**
- AI/CS Fields: Artificial Intelligence (cs.AI), Computation and Language (cs.CL), Computer Vision and Pattern Recognition (cs.CV), Machine Learning (cs.LG).
- General Tech: Computer automation, Modularity, Data privacy, Cyber attacks, AIOps, Federated learning, GPT, AGI.

***

## Simular AI Agent S Paper Sources on Hugging Face

***

**Agent Version/Paper Title/Date/Acceptance/Hugging Face|arXiv URL**
- Agent S: Agent S: An Open Agentic Framework that Uses Computers Like a Human, Accepted ICLR 2025 (Released Oct 2024), `https://huggingface.co/papers/2410.08164` (arXiv)
- Agent S2: Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents, Accepted COLM 2025 (Released Apr 2025), `https://arxiv.org/abs/2504.00906` (arXiv)
- Agent S3: The Unreasonable Effectiveness of Scaling Agents for Computer Use (Relates to the bBoN paradigm), Released Oct 2, 2025, `https://huggingface.co/papers/2510.02250` (Paper Page)

Comments