AndroidWorld AI Agent Benchmark Results 2025: Pass@1 & Pass@k
Table of Contents
- Overview
- Primary Results Table (Dataset 1)
- Legacy/Additional Results (Dataset 2)
- Notes on Data Collection & Methodology
- Images & Media
- Appendix: Data Definitions
---
Overview
AndroidWorld AI Agent Benchmark Results 2025 compiles community-submitted performance data for AndroidWorld's AI agents across mobile tasks. The dataset records release dates, agent names, model sizes, and screen representations, along with pass@1 and pass@k success rates, trajectory submissions, and contextual notes. The benchmark highlights how different models—from GPT-4o to Llama-based agents—perform on real AndroidWorld tasks, using accessibility (A11y) trees and screenshots to provide structure and visual context. This article preserves all data fields, including notes and trajectories, and includes a legacy set of results from earlier months. Readers can scan the primary 2025 results or consult the legacy results for trends and reproducibility.
Primary Results Table
Release Date Result Source Open? Model Size (billion) Model Screen Representation Success Rate (pass@1) Number of trials Success Rate (pass@k) Trajectory submissions Note
1 10/2025 AGI-0 AI agent ✗ - AGI-0 Screenshot 97.4 1 n/a 10/14/2025. Changed from model --> agent, based on description in blog post. https://www.theagi.company/blog/android-world
2 10/2025 askui AndroidVisionAgent AI agent ✔ - askui AndroidVisionAgent, Claude 4.5 Sonnet + Claude 4.0 Sonnet Screenshot 94.8 1 n/a
3 10/2025 DroidRun AI agent ✔ - GPT5, Gemini 2.5 ProScreenshot + A11y tree 91.4 1 Trajectories [10/6/2025]: 78.4 --> 91.4; details here
[09/19/2025]: Updated score from 63.0 --> 78.4. For this run, we used GPT-5 for reasoning in combination with Gemini-2.5-Pro for acting, while continuing to rely on a hybrid method of the Accessibility (A11y) tree plus corresponding screenshots to provide both structural and visual context to the agent. 4 10/2025 Surfer 2 AI agent ✗ - o3 + holo1.5-72b Screenshot 87.1 1 pass@3 93.1 https://hcompai.github.io/android-world-traces/ Agent is based on https://github.com/hcompai/surfer-h-cli - same architecture, but with modifications to prompt and action space
5 10/2025 gbox.ai AI agent ✔ - Sonnet 4.5 + Sonnet 4Screenshot 86.2 1 https://github.com/babelcloud/android_world_benchmark/tree/main/GBOX/trajectory GBOX uses Claude code as the agent with GBOX mcp. Link to report -> https://github.com/babelcloud/android_world_benchmark/blob/main/GBOX/report.md
6 9/2025 mobile-use AI agent ✔ - Llama 4-scout, Gemini 2.5 pro, GPT-5 nano Screenshot + A11y tree 84.5 1 https://minitap.ai/benchmark [10/1/2025]: 77.6% -> 84.5%
[09/12/2025]: Updated score from 74.1% --> 77.6%
[08/19/2025]: Initial trajectory labeling issue (MarkorCreateNoteFromClipboard) has been corrected by the authors. They provide Discord support for setup issues.
[08/18/2025]: Repo has been reported broken by some users (not independently verified). Some trajectories (e.g., MarkorCreateNoteFromClipboard, BrowserMaze) are labeled as successful by the authors but contain incorrect actions 7 8/2025 AutoGLM-Mobile Model ✗ 9B AutoGLM-Mobile Screenshot + A11y tree 80.2 1 autoglm-mobile-aw-ckpt.zip [9/26/2025]: Updated score from 75.8 --> 80.2 This is an updated version of AutoGLM-Mobile. In this version, we added more pretraining data and trained the VLM at full image resolution. 8 9/2025 LX-GUIAgent AI agent ✗ - LX-GUIAgent Screenshot + A11y tree 79.3 1 androidworld-trajectories.zip [9/23/2025]: Updated score from 75.0 --> 79.3. Added trajectories. 9 8/2025 Finalrun AI agent ✗ - GPT-5 Screenshot + A11y tree 76.7 1 run_20250829T064517298482.zip The technical doc/approach can be found here: https://github.com/final-run/finalrun-android-world-benchmark 9 9/2025 K²-Agent AI agent ✗ 72B + 7B Qwen2.5-VL-72B + Qwen2.5-VL-7B Screenshot 76.7 1 https://github.com/k2-agent/k2-agent/blob/main/androidworld-trajectories.zip K²-Agent, a hierarchical approach that self-evolves a Qwen2.5-VL-72B for high-level planning and post-trains a Qwen2.5-VL-7B for low-level execution. Our GitHub repository, which contains a detailed description of our approach, can be found here: https://github.com/k2-agent/k2-agent. 11 9/2025 MobileUse-v2 AI agent ✔ 32B Hammer-UI-32B Screenshot 75.0 1 MobileUse-v2-aw-ckpt.zip Make further post-training based on the GUI-Owl-32b model. Optimize the memory and knowledge module of the MobileUse framework. 12 8/2025 Mobile-Agent-v3 AI agent ✗ 32B GUI-Owl-32B Screenshot 73.3 1 https://github.com/X-PLUG/MobileAgent/tree/main
13 10/2025 Gemini 2.5 Computer Use Model ✗ - Gemini 2.5 Computer Use Screenshot 69.7 1
14 6/2025 JT-GUIAgent-V2 AI agent ✗ - JT-GUIAgent-V2 Screenshot 67.2 1
15 8/2025 GUI-Owl-7B Model ✗ 7B GUI-Owl-7B Screenshot 66.4 1 https://github.com/X-PLUG/MobileAgent/tree/main
16 8/2025 UI-Venus Model ✔ 72B UI-Venus-Navi-72B Screenshot 65.9 1 https://github.com/inclusionAI/UI-Venus/blob/main/vis_androidworld/UI-Venus-androidworld.zip https://huggingface.co/inclusionAI/UI-Venus-Navi-72B
17 07/2025 MobileUse AI agent ✔ 72B Qwen2.5-VL-72B Screenshot 62.9 1
18 05/2025 Seed1.5-VL Model ✔ 20.B Seed1.5-VL Screenshot + A11y tree 62.1 1
19 6/2025 JT-GUIAgent-V1 AI agent ✗ - JT-GUIAgent-V1 Screenshot 60.0 1
20 3/2025 V-Droid Paper AI agent ✔ 8B V-Droid (Llama8B) A11y tree 59.5 1 - - Training data consists of apps and tasks from the AndroidWorld benchmark. Code. 21 4/2025 Agent S2 AI agent ✔ - Agent S2 Screenshot 54.3 1 - 22 8/2025 UI-Venus Model ✔ 7B Venus-Navi-7B Screenshot 49.1 1 https://github.com/inclusionAI/UI-Venus/blob/main/vis_androidworld/UI-Venus-androidworld.zip https://huggingface.co/inclusionAI/UI-Venus-Navi-7B
23 05/2025 GUI-Explorer AI agent ✔ - GPT-4o Screenshot + A11y tree 47.4 1
24 4/2025 AndroidGen AI agent ✔ - GPT-4o A11y tree 46.8 1
25 1/2025 UI-TARS Model ✔ 72B UI-TARS Screenshot 46.6 1 - 26 12/2024 Aria-UI Model ✔ - GPT-4o + Aria-UI Screenshot 44.8 1 - - 27 4/2025 ScaleTrack Model ✗ 8B ScaleTrack-7B A11y tree 44.0 1
27 1/2025 UGround Model ✔ - GPT-4o + UGround Screenshot 44.0 1 - Trajectories Code for reproduction
29 6/2025 Mirage-1 AI agent ✔ - GPT-4o Screenshot 42.2 1 With Mirage-1-O; uses OS-Atlas grounder 30 12/2024 Ponder & Press AI agent ✗ - GPT-4o Screenshot 34.5 1 - - Code is not yet open-sourced
31 05/2024 AndroidWorld AI agent ✔ - GPT-4 Turbo A11y tree 30.6 1 - - 32 6/2025 GUI-Critic-R1 Model ✔ 7B Qwen-2.5-VL-7B Screenshot + A11y tree 27.6 1
32 05/2024 EcoAgent AI agent ✗ - GPT-4o, OS-Atlas-Pro 4B, Qwen2-VL-2B-Instruct Screenshot 27.6 1
34 1/2025 InfiGUIAgent Model ✔ 2B Qwen2-VL-2B (fine-tuned) Screenshot 9.0 1 - - 10/2024 OSCAR AI agent ✗ - GPT-4o Screenshot 1 61.6 (k=4) - Code will be open-source upon publication. Human Performance
05/2024 AndroidWorld - Human 80.0 3 Comment here or email crawles@gmail.com to submit your work! Please attach how should the data entry look like. Definitions Model A relatively simple prompt involving one LLM /VLM call AI agent A multi-agent architecture involving several LLM calls and a protocol to coordinate the various agents or an LLM wrapped into an advanced agent with memory, subgoal planning, etc.
Release Date Result Source Open? Model Size (billion) Model Screen Representation Success Rate (pass@1) Number of trials Success Rate (pass@k) Trajectory submissions [Optional] Note
05/2024 AndroidWorld ✔ - GPT-4 Turbo Set-of-Mark 67.6 1 -
6/2025 Mirage-1 ✔ - GPT-4o Screenshot 62 1 With Mirage-1-A; uses Aria-UI grounder
4/2025 ScaleTrack ✔ 8B ScaleTrack-7B Screenshot 61.0 1
12/2024 Aria-UI ✔ 3.9B GPT-4o + Aria-UI A11y tree 60.4 1 -Legacy/Additional Results
Release Date Result Source Open? Model Size (billion) Model Screen Representation Success Rate (pass@1) Number of trials
05/2024 AndroidWorld ✔ - GPT-4 Turbo Set-of-Mark 67.6 1 -
6/2025 Mirage-1 ✔ - GPT-4o Screenshot 62 1 With Mirage-1-A; uses Aria-UI grounder
4/2025 ScaleTrack ✔ 8B ScaleTrack-7B Screenshot 61.0 1
12/2024 Aria-UI ✔ 3.9B GPT-4o + Aria-UI A11y tree 60.4 1 -Notes on Data Collection & Methodology
The dataset is community-submitted and self-reported. Details, dates, and trajectories reflect user-provided information and may be updated over time.
Images & Media
!AndroidWorld benchmark results
Appendix: Data Definitions
- Definitions Model: A relatively simple prompt involving one LLM /VLM call.
- AI agent: A multi-agent architecture involving several LLM calls and a protocol to coordinate the various agents or an LLM wrapped into an advanced agent with memory, subgoal planning, etc.