OSWorld

Commit Graph

Author	SHA1	Message	Date
Jiaqi	23b81993fa	os task fix: set the default dim screen time to be 300s	2025-07-24 08:13:02 +00:00
ChenYXxxx	873f8a0359	Update 10a730d5-d414-4b40-b479-684bed1ae522.json change the ight 2 the night	2025-07-24 15:44:52 +08:00
Yuan Mengqi	d128edbbc1	add nogdrive json (#281 ) * add uitars agent code * improve claude * improve claude * improve claude * improve claude * improve claude * add nogdrive json	2025-07-23 19:12:42 +08:00
yuanmengqi	d6f2190a9f	fix: refine instruction in OS evaluation example to clarify restrictions on logging out or shutting down the machine	2025-07-18 18:51:01 +00:00
Danyang Zhang	53ffc05042	Calc eval fix (#272 ) * ver Jun17th updating annotations * ver Jun17th corrected annotation of 1d17 added check for cell merge * ver Jun17th updated several annotations * ver Jun20th fixed set-up config of 2bd59342-0664-4ccb-ba87-79379096cc08 * fix: Enhance instructions in LibreOffice Calc examples for clarity and specificity, including details on using Pivot Tables, column placements, and revenue calculations. * ver Jun21st updating calc evals * ver Jun22nd fixed an impress task * ver Jun22ndv2 adjusted several calc tasks * Clean scalfolds * ver Jul18th added two try-excepts to handle possible formula parsing and calculation failures --------- Co-authored-by: BowenBryanWang <bryanwang.nlp@connect.hku.hk> Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>	2025-07-18 21:28:48 +08:00
shenzhennan	c7017a476d	fix impress instruction 0a211154	2025-07-18 07:14:35 +00:00
yuanmengqi	0fb625e4fd	Update instruction in OS evaluation example to include a restriction against shutting down the machine.	2025-07-18 05:28:43 +00:00
yuanmengqi	2c51950e73	feat: enhance evaluator configuration for Chrome with post-execution commands - Added postconfig commands to multiple JSON files for Chrome evaluation examples. - Included commands to terminate existing Chrome processes, launch Chrome with remote debugging, and introduce sleep intervals for timing. - Updated logging messages in the AWS manager to improve clarity and user experience. These changes enhance the automation and usability of the evaluation examples while preserving existing logic.	2025-07-17 10:50:10 +00:00
yuanmengqi	0939226020	feat: enhance evaluator configuration with post-execution commands for Chrome - Added a series of postconfig commands to the evaluator section in the JSON file. - Commands include executing a refresh in Chrome, managing Chrome processes, launching Chrome with remote debugging, and opening specific settings tabs. - Introduced sleep intervals to ensure proper execution timing between commands. This update improves the automation capabilities of the evaluation examples while maintaining existing logic.	2025-07-16 17:37:37 +00:00
yuanmengqi	e433f35c1f	feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.	2025-07-16 13:45:34 +00:00
yuanmengqi	7912880d16	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-15 07:24:38 +00:00
yuanmengqi	451bbf5fc2	Update multi_apps JSON examples: refined instructions for image processing in GIMP, replaced an open command with a launch command for VLC, and corrected assignment modification instruction in LibreOffice Calc example.	2025-07-15 07:24:33 +00:00
Yuan Mengqi	af47ed8fb1	fix infeasible&chrome tasks (#258 ) * fix chrome * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example * fix tasks * fix chrome finished * fix * clean chrome_fix code * clean chrome_fix code * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix multiapps * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix some multi_apps tasks * fix some multi_apps tasks * fix password&resolution * fix password&resolution * Improve code logic for password & resolution * edit * Merge branch 'main' into fix_chrome * fix chrome tasks * Merge branch 'fix_chrome' * fix insensible&chrome tasks --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-15 13:02:42 +08:00
ChenYXxxx	698483390a	"Could you turn my image into CYMK mode?" add "within GIMP"	2025-07-14 23:53:20 +08:00
ChenYXxxx	9242becd87	"Please batch process all images on the desktop by increasing their brightness to 50, instead of adjusting them individually." add "within GIMP"	2025-07-14 23:52:41 +08:00
ChenYXxxx	56b2fe9cc4	Update d16c99dc-2a1e-46f2-b350-d97c86c85c15.json	2025-07-14 23:23:11 +08:00
ChenYXxxx	b481c794c5	Update 72f83cdc-bf76-4531-9a1b-eb893a13f8aa.json	2025-07-14 23:22:50 +08:00
ChenYXxxx	7f973a391c	Update f723c744-e62c-4ae6-98d1-750d3cd7d79d.json	2025-07-14 23:22:14 +08:00
shenzhennan	7f96cc0633	Merge branch 'main' of https://github.com/xlang-ai/OSWorld	2025-07-14 12:35:00 +00:00
shenzhennan	53983db9cb	fix impress eval : extending sleep time to ensure save	2025-07-14 12:34:43 +00:00
Danyang Zhang	2339db20ca	ver Jul7th (#255 ) pip-installing directly from PyPI fails misteriously in postconfig execution, possible owing to proxy configuration in the VM, adjusted strategy by downloading the wheel on host and pip-installing it locally on VM in thunderbird/d38192b0-17dc-4e1d-99c3-786d0117de77	2025-07-14 20:26:29 +08:00
shenzhennan	60e26d2d0d	fix impress compare use gold file	2025-07-14 11:35:06 +00:00
Yuan Mengqi	38a30734a6	Improve code logic for password & resolution (#252 ) * fix chrome * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example * fix tasks * fix chrome finished * fix * clean chrome_fix code * clean chrome_fix code * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix multiapps * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix some multi_apps tasks * fix some multi_apps tasks * fix password&resolution * fix password&resolution * Improve code logic for password & resolution * edit * Merge branch 'main' into fix_chrome * fix chrome tasks --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-13 21:04:07 +08:00
yuanmengqi	97ed6f99b0	Final review multi_apps fix the rest part	2025-07-12 20:28:55 +00:00
yuanmengqi	dbecf46057	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-12 16:35:02 +00:00
yuanmengqi	877e75a013	Final review multi_apps fix Xinzhuang part	2025-07-12 16:34:55 +00:00
Yuan Mengqi	27319ce1e3	fix password&resolution (#251 ) * fix chrome * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example * fix tasks * fix chrome finished * fix * clean chrome_fix code * clean chrome_fix code * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix multiapps * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix some multi_apps tasks * fix some multi_apps tasks * fix password&resolution * fix password&resolution --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-13 00:25:37 +08:00
yuanmengqi	6f0382c0c2	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-10 22:35:42 +00:00
yuanmengqi	6897e5320d	Enhance image text comparison functionality with detailed logging - Added logging for OCR results and text matching outcomes in compare_image_text function. - Updated JSON examples to support multiple expected results and improved structure for evaluator functions. - Enhanced handling of expected text rules to include multiple variations for better matching accuracy.	2025-07-10 22:32:53 +00:00
st2rb8g	61f265a082	fix some multi_apps tasks (#245 ) * fix chrome * fix some multi_apps tasks. * fix some multiapps tasks * fix some multiapps tasks --------- Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>	2025-07-11 06:32:13 +08:00
Shenzhennan	29caebb765	Impress check and fix (all font compare issue) (#247 ) * Enhance PPTX comparison logic in slides.py - Improved alignment comparison to treat None and LEFT as equivalent. - Added special handling for font bold and italic properties to consider None and False as equivalent. - Introduced a new bullet comparison function that allows for minor differences and tolerates formatting variations. - Updated JSON examples to support multiple file comparisons and results. * fix all fonts json file f23ac * fix clean the shape examination in unrelevatn part-top position check * Refactor JSON structure for PPTX comparison - Updated the instruction formatting for clarity. - Modified the comparison logic to support multiple expected and result files, enhancing flexibility in evaluations. - Changed the function key to an array to accommodate multiple comparison functions. - Introduced a conjunction key to specify logical relationships between comparisons. * fix impress-e4ef0baf by adding all fonts gold file * update impress bf4e9888 task ins * fix impress b8adbc24 font size * Enhance PPTX comparison functionality in slides.py - Introduced a debug logger for detailed output during PPTX comparisons. - Added a new function to recursively retrieve all text shapes, including those within groups. - Enabled debug logging to provide insights on slide and shape comparisons. - Updated JSON examples to support multiple expected and result files for enhanced evaluation flexibility. * Enable debug logging by default in PPTX comparison and enhance debug output for shape mismatches. Updated JSON examples to support multiple expected and result files for improved evaluation consistency. * fix impress all fons compare file * Refactor PPTX comparison logic and JSON examples for height modification tasks - Added critical notes in slides.py to clarify the execution order of shape examination and height modification checks. - Updated JSON examples to support multiple expected and result files, enhancing evaluation consistency. - Ensured that examine_shape must be set to False for examine_modify_height to function correctly, preventing premature termination of comparisons. * Enhance debug logging in PPTX comparison for detailed font attribute mismatches - Added debug logging for differences in font color, bold, italic, and underline attributes during table cell comparisons. - Improved clarity of debug output by including specific slide, shape, and cell indices for mismatches. - Ensured that existing comparison logic remains intact while enhancing debugging capabilities. * Enhance debug logging for font attribute mismatches in PPTX comparison - Added detailed debug logging for font name and size mismatches during PPTX comparisons, including specific slide, shape, and paragraph indices. - Updated JSON examples to support multiple expected and result files, improving evaluation consistency. - Maintained existing comparison logic while enhancing debugging capabilities. * fix impress 3161de json file --------- Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>	2025-07-10 00:36:32 +08:00
Yuan Mengqi	093679b90d	fix some multi_apps task (#243 ) * fix chrome * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example * fix tasks * fix chrome finished * fix * clean chrome_fix code * clean chrome_fix code * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix multiapps * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix some multi_apps tasks * fix some multi_apps tasks --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-08 18:59:00 +08:00
yuanmengqi	7d0ad02706	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-06 19:38:26 +00:00
yuanmengqi	a68d6f7ab6	Enhance GIMP metrics evaluator with logging and transparency handling - Replaced print statements with logging for better traceability in gimp.py. - Added handling for transparent images in structure checks and size evaluations. - Updated JSON examples to include delays in pyautogui commands for improved execution reliability. - Changed image URL in example to a more accessible source.	2025-07-06 19:38:22 +00:00
shuyhere	3afc01f1fe	fix chrome examples (#240 )	2025-07-07 02:25:59 +08:00
MillanK	8facb285a1	VS Code Task Fix (#237 ) * vscode fix * vscode task fix	2025-07-06 16:47:23 +08:00
yuanmengqi	a1891f7d88	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-06 07:52:42 +00:00
yuanmengqi	9be6fcd688	Check and fix on Chrome tasks - Added `pytz` dependency to `requirements.txt` for timezone handling. - Introduced `get_macys_product_url_parse` function to replace the old `get_url_path_parse` for better clarity and maintain backward compatibility. - Enhanced logging throughout the `get_active_tab_html_parse` and `get_rule_relativeTime` functions for improved debugging and traceability. - Updated JSON examples to reflect changes in expected keys and added new fields for better evaluation context. - Removed deprecated execution commands from JSON examples to streamline the evaluation process.	2025-07-06 07:52:37 +00:00
XXZ	c8a6a22aad	Fix VLC task design (#238 ) * fix: fix multiapp tasks * fix: update instructions for VLC evaluation examples --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-04 20:39:48 +08:00
Shenzhennan	1b40a458de	Impress eval fix (#226 ) * fix compare_pptx * Fix impress-4ed5abd0-8b5d-47bd-839f-cacfa15ca37a eval script:Fix temporarily by ignoring the contaminated To fix completely, compare source file needs to be updated * fix impress domain * fix a53 by changing gold * fix impress a53 * fix impress b8d origin file * add table font color check * fix left pane check --------- Co-authored-by: chenjix <3107760494@qq.com> Co-authored-by: moonshot <moonshot@moonshotznshenMacBook-Pro.local> Co-authored-by: Shen Zhennan <shenzhennan@moonshot.cn>	2025-07-04 13:32:02 +08:00
Zilong Zhou	1308a80029	Update 5990457f-2adb-467b-a4af-5c857c92d762.json (#235 )	2025-07-04 13:31:18 +08:00
yuanmengqi	3cd79c9830	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-03 16:57:49 +00:00
yuanmengqi	a651b04e49	Update AWS AMI ID, enhance directory creation logic in file upload, modify osworld service configuration, and refine JSON evaluation examples for improved clarity and functionality.	2025-07-03 16:57:41 +00:00
Danyang Zhang	adc9ad88c2	Thunderbird eval fix (#233 ) * ver Jul2nd updated task requiring set up new email account * ver Jul3rd fixed several tasks	2025-07-03 21:55:55 +08:00
XXZ	ac24ccce99	fix: fix multiapp tasks (#229 ) Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-03 21:53:58 +08:00
yuanmengqi	7b2120c843	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-03 13:50:35 +00:00
yuanmengqi	cb4bed20a0	Refactor compare_python_pure_text function for improved normalization and error handling. Update JSON example to clarify instruction for extracting Python code from Colab, changing output file names for consistency.	2025-07-03 13:50:21 +00:00
Yuan Mengqi	b2fb8b4222	fix chrome tasks (#230 ) * fix chrome * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example * fix tasks * fix chrome finished * fix * clean chrome_fix code * clean chrome_fix code --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-03 21:32:41 +08:00
ChenYXxxx	bdaf37e0e5	fix_os&gimp (#220 ) * Update ec4e3f68-9ea4-4c18-a5c9-69f89d1178b3.json * Update c288e301-e626-4b98-a1ab-159dcb162af5.json * Update 3ce045a0-877b-42aa-8d2c-b4a863336ab8.json * Update b3d4a89c-53f2-4d6b-8b6a-541fb5d205fa.json * Update 2e6f678f-472d-4c55-99cc-8e7c5c402a71.json Please batch process all images on the desktop by increasing their brightness to 50, instead of adjusting them individually. * Update 5ca86c6f-f317-49d8-b6a7-b527541caae8.json * Update a746add2-cab0-4740-ac36-c3769d9bfb46.json * Update a746add2-cab0-4740-ac36-c3769d9bfb46.json * Update 62f7fd55-0687-4a43-b6e1-3eda16fc6252.json * Update d52d6308-ec58-42b7-a2c9-de80e4837b2b.json * Update d16c99dc-2a1e-46f2-b350-d97c86c85c15.json * Update d16c99dc-2a1e-46f2-b350-d97c86c85c15.json * Update 58d3eeeb-e9d0-499f-962e-fd0db2a744d8.json	2025-07-03 16:59:05 +08:00
Tianbao Xie	bba367b8bc	fix: fix multiapps tasks (#231 ) * Update JSON example for multi_apps: change snapshot name and specify presenter in instructions for clarity. * Enhance PDF image comparison in chrome.py by adding existence checks for input files and improving image extraction logic. Introduce image hashing for similarity scoring with a configurable threshold. Update docs.py to support fuzzy matching in DOCX file comparisons, allowing for similarity scoring based on text content. Modify example JSON to enable fuzzy matching option. --------- Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>	2025-07-03 16:58:43 +08:00

1 2 3 4 5 ...

529 Commits