OSWorld/evaluation_examples
MillanK cbc3b590ff
Task fix batch (#383)
* update 873cafdd-a581-47f6-8b33-b9696ddb7b05 task eval

* c1fa57f3-c3db-4596-8f09-020701085416 fix, add tolerance to url matching

* 8df7e444-8e06-4f93-8a1a-c5c974269d82 add more clear instruction to the filename for compress

* add address string normalization for 6f4073b8-d8ea-4ade-8a18-c5d1d5d5aa9a

---------

Co-authored-by: Jiaqi <dengjiaqi@moonshot.cn>
2025-11-19 17:24:25 +08:00
..
examples Task fix batch (#383) 2025-11-19 17:24:25 +08:00
examples_windows refactor: update URLs in multiple JSON files to ensure proper encoding of special characters 2025-06-07 17:26:45 +00:00
settings fix some multi_apps task (#243) 2025-07-08 18:59:00 +08:00
README.md Server setup readme revision (#108) 2024-11-25 16:30:59 +08:00
test_all.json Fix Duplicate ids; Remove unused JSON files across multiple applications 2025-02-10 15:49:54 +08:00
test_nogdrive.json add nogdrive json (#281) 2025-07-23 19:12:42 +08:00
test_small.json Fix Duplicate ids; Remove unused JSON files across multiple applications 2025-02-10 15:49:54 +08:00

README.md

Evaluation examples

Here we put the data examples to benchmark the ability of agents when interacting with GUI. The examples are stored in ./examples where each data item formatted as:

{
    "id": "uid", # unique id
    "snapshot": "snapshot_id", # the snapshot id of the environment, with some data already there and apps already opened, or just desktop
    "instruction": "natural_language_instruction", # the natural language instruction of the task, what we want the agent to do
    "source": "website_url", # where we know this example, some forum, or some website, or some paper
    "config": {xxx}, # the scripts to setup the donwload and open files actions, as the initial state of a task
    # (coming in next project) "trajectory": "trajectory_directory", # the trajectory directory, which contains the action sequence file, the screenshots and the recording video
    "related_apps": ["app1", "app2", ...], # the related apps, which are opened during the task
    "evaluator": "evaluation_dir", # the directory of the evaluator, which contains the evaluation script for this example
…
}

The ./trajectories file contains the annotated trajectories for each data item in ./examples for finishing the task.

For now, it is under construction, and only tested on Windows 10. Please:

  • Modify the path accordingly to run the evaluation;
  • Remind us if some parts are overfit to our environment.