OSWorld/evaluation_examples/examples/multi_apps
MillanK cbc3b590ff
Task fix batch (#383)
* update 873cafdd-a581-47f6-8b33-b9696ddb7b05 task eval

* c1fa57f3-c3db-4596-8f09-020701085416 fix, add tolerance to url matching

* 8df7e444-8e06-4f93-8a1a-c5c974269d82 add more clear instruction to the filename for compress

* add address string normalization for 6f4073b8-d8ea-4ade-8a18-c5d1d5d5aa9a

---------

Co-authored-by: Jiaqi <dengjiaqi@moonshot.cn>
2025-11-19 17:24:25 +08:00
..
0c825995-5b70-4526-b663-113f4c999dd2.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
0e5303d4-8820-42f6-b18d-daf7e633de21.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
00fa164e-2612-4439-992e-157d019a8436.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
1f18aa87-af6f-41ef-9853-cdb8f32ebdea.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
2b9493d7-49b8-493a-a71b-56cd1f4d6908.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
2c9fc0de-3ee7-45e1-a5df-c86206ad78b5.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
02ce9a50-7af2-47ed-8596-af0c230501f8.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
2fe4b718-3bd7-46ec-bdce-b184f5653624.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
3a93cae4-ad3e-403e-8c12-65303b271818.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
3c8f201a-009d-4bbe-8b65-a6f8b35bb57f.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
3e3fc409-bff3-4905-bf16-c968eee3f807.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
3f05f3b9-29ba-4b6b-95aa-2204697ffc06.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
4c26e3f3-3a14-4d86-b44a-d3cedebbb487.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
4e9f0faf-2ecc-4ae8-a804-28c9a75d1ddc.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
5bc63fb9-276a-4439-a7c1-9dc76401737f.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
5df7b33a-9f77-4101-823e-02f863e1c1ae.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
6d72aad6-187a-4392-a4c4-ed87269c51cf.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
6f4073b8-d8ea-4ade-8a18-c5d1d5d5aa9a.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
7e287123-70ca-47b9-8521-47db09b69b14.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
7f35355e-02a6-45b5-b140-f0be698bcf85.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
7ff48d5b-2df2-49da-b500-a5150ffc7f18.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
8df7e444-8e06-4f93-8a1a-c5c974269d82.json Task fix batch (#383) 2025-11-19 17:24:25 +08:00
8e116af7-7db7-4e35-a68b-b0939c066c78.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
09a37c51-e625-49f4-a514-20a773797a8a.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
9f3bb592-209d-43bc-bb47-d77d9df56504.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
22a4636f-8179-4357-8e87-d1743ece1f81.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
42d25c08-fb87-4927-8b65-93631280a26f.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
42f4d1c7-4521-4161-b646-0a8934e36081.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
47f7c0ce-a5fb-4100-a5e6-65cd0e7429e5.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
48c46dc7-fe04-4505-ade7-723cba1aa6f6.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
48d05431-6cd5-4e76-82eb-12b60d823f7d.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
51f5801c-18b3-4f25-b0c3-02f85507a078.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
68a25bd4-59c7-4f4d-975e-da0c8509c848.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
69acbb55-d945-4927-a87b-8480e1a5bb7e.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
74d5859f-ed66-4d3e-aa0e-93d7a592ce41.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
78aed49a-a710-4321-a793-b611a7c5b56b.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
81c425f5-78f3-4771-afd6-3d2973825947.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
82e3c869-49f6-4305-a7ce-f3e64a0618e7.json Calc eval fix (#272) 2025-07-18 21:28:48 +08:00
98e8e339-5f91-4ed2-b2b2-12647cb134f4.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
185f29bd-5da0-40a6-b69c-ba7f4e0324ef.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
227d2f97-562b-4ccb-ae47-a5ec9e142fbb.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
337d318b-aa07-4f4f-b763-89d9a2dd013f.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
415ef462-bed3-493a-ac36-ca8c6d23bf1b.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
510f64c8-9bcc-4be1-8d30-638705850618.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
716a6079-22da-47f1-ba73-c9d58f986a38.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
778efd0a-153f-4842-9214-f05fc176b877.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
788b3701-3ec9-4b67-b679-418bfa726c22.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
869de13e-bef9-4b91-ba51-f6708c40b096.json Calc eval fix (#272) 2025-07-18 21:28:48 +08:00
873cafdd-a581-47f6-8b33-b9696ddb7b05.json Task fix batch (#383) 2025-11-19 17:24:25 +08:00
881deb30-9549-4583-a841-8270c65f2a17.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
897e3b53-5d4d-444b-85cb-2cdc8a97d903.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
2373b66a-092d-44cb-bfd7-82e86e7a3b4d.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
3680a5ee-6870-426a-a997-eba929a0d25c.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
26660ad1-6ebb-4f59-8cba-a8432dfe8d38.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
67890eb6-6ce5-4c00-9e3d-fb4972699b06.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
236833a3-5704-47fc-888c-4f298f09f799.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
937087b6-f668-4ba6-9110-60682ee33441.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
5990457f-2adb-467b-a4af-5c857c92d762.json Add AutoGLM-OS agent (#309) 2025-08-17 12:08:40 +08:00
9219480b-3aed-47fc-8bac-d2cffc5849f7.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
20236825-b5df-46e7-89bf-62e1d640a897.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
26150609-0da3-4a7d-8868-0faf9c5f01bb.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
36037439-2044-4b50-b9d1-875b5a332143.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
46407397-a7d5-4c6b-92c6-dbe038b1457b.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
58565672-7bfe-48ab-b828-db349231de6b.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
91190194-f406-4cd6-b3f9-c43fac942b22.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
a0b9dc9c-fc07-4a88-8c5d-5e3ecad91bcb.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
a74b607e-6bb5-4ea8-8a7c-5d97c7bbcd2a.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
a82b78bb-7fde-4cb3-94a4-035baf10bcf0.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
a503b07f-9119-456b-b75d-f5146737d24f.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
aad10cd7-9337-4b62-b704-a857848cedf2.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
acb0f96b-e27c-44d8-b55f-7cb76609dfcd.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
aceb0368-56b8-4073-b70e-3dc9aee184e0.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
b52b40a5-ad70-4c53-b5b0-5650a8387052.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
b337d106-053f-4d37-8da0-7f9c4043a66b.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
b5062e3e-641c-4e3a-907b-ac864d2e7652.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
bb83cab4-e5c7-42c7-a67b-e46068032b86.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
bc2b57f3-686d-4ec9-87ce-edf850b7e442.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
c7c1e4c3-9e92-4eba-a4b8-689953975ea4.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
c867c42d-a52d-4a24-8ae3-f75d256b5618.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
c2751594-0cd5-4088-be1b-b5f2f9ec97c4.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
ce2b64a2-ddc1-4f91-8c7d-a88be7121aac.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
d1acdb87-bb67-4f30-84aa-990e56a09c92.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
d9b7c649-c975-4f53-88f5-940b29c47247.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
d68204bf-11c1-4b13-b48b-d303c73d4bf6.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
da52d699-e8d2-4dc5-9191-a2199e0b6a9b.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
da922383-bfa4-4cd3-bbad-6bebab3d7742.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
dd60633f-2c72-42ba-8547-6f2c8cb0fdb0.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
deec51c9-3b1e-4b9e-993c-4776f20e8bb2.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
df67aebb-fb3a-44fd-b75b-51b6012df509.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
e1fc0df3-c8b9-4ee7-864c-d0b590d3aa56.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
e135df7c-7687-4ac0-a5f0-76b74438b53e.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
e2392362-125e-4f76-a2ee-524b183a3412.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
e8172110-ec08-421b-a6f5-842e6451911f.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
eb303e01-261e-4972-8c07-c9b4e7a4922a.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
ee9a3c83-f437-4879-8918-be5efbb9fac7.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
f5c13cdd-205c-4719-a562-348ae5cd1d91.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
f7dfbef3-7697-431c-883a-db8583a4e4f9.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
f8cfa149-d1c1-4215-8dac-4a0932bad3c2.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
f918266a-b3e0-4914-865d-4faa564f1aef.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00
f8369178-fafe-40c2-adc4-b9b08a125456.json feat: standardize configuration fields across all evaluation examples 2025-07-16 13:45:34 +00:00