OSWorld

Commit Graph

Author	SHA1	Message	Date
yuanmengqi	2c51950e73	feat: enhance evaluator configuration for Chrome with post-execution commands - Added postconfig commands to multiple JSON files for Chrome evaluation examples. - Included commands to terminate existing Chrome processes, launch Chrome with remote debugging, and introduce sleep intervals for timing. - Updated logging messages in the AWS manager to improve clarity and user experience. These changes enhance the automation and usability of the evaluation examples while preserving existing logic.	2025-07-17 10:50:10 +00:00
yuanmengqi	0939226020	feat: enhance evaluator configuration with post-execution commands for Chrome - Added a series of postconfig commands to the evaluator section in the JSON file. - Commands include executing a refresh in Chrome, managing Chrome processes, launching Chrome with remote debugging, and opening specific settings tabs. - Introduced sleep intervals to ensure proper execution timing between commands. This update improves the automation capabilities of the evaluation examples while maintaining existing logic.	2025-07-16 17:37:37 +00:00
yuanmengqi	e433f35c1f	feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.	2025-07-16 13:45:34 +00:00
yuanmengqi	7912880d16	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-15 07:24:38 +00:00
yuanmengqi	451bbf5fc2	Update multi_apps JSON examples: refined instructions for image processing in GIMP, replaced an open command with a launch command for VLC, and corrected assignment modification instruction in LibreOffice Calc example.	2025-07-15 07:24:33 +00:00
Yuan Mengqi	af47ed8fb1	fix infeasible&chrome tasks (#258 ) * fix chrome * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example * fix tasks * fix chrome finished * fix * clean chrome_fix code * clean chrome_fix code * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix multiapps * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix some multi_apps tasks * fix some multi_apps tasks * fix password&resolution * fix password&resolution * Improve code logic for password & resolution * edit * Merge branch 'main' into fix_chrome * fix chrome tasks * Merge branch 'fix_chrome' * fix insensible&chrome tasks --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-15 13:02:42 +08:00
ChenYXxxx	698483390a	"Could you turn my image into CYMK mode?" add "within GIMP"	2025-07-14 23:53:20 +08:00
ChenYXxxx	9242becd87	"Please batch process all images on the desktop by increasing their brightness to 50, instead of adjusting them individually." add "within GIMP"	2025-07-14 23:52:41 +08:00
ChenYXxxx	56b2fe9cc4	Update d16c99dc-2a1e-46f2-b350-d97c86c85c15.json	2025-07-14 23:23:11 +08:00
ChenYXxxx	b481c794c5	Update 72f83cdc-bf76-4531-9a1b-eb893a13f8aa.json	2025-07-14 23:22:50 +08:00
ChenYXxxx	7f973a391c	Update f723c744-e62c-4ae6-98d1-750d3cd7d79d.json	2025-07-14 23:22:14 +08:00
shenzhennan	7f96cc0633	Merge branch 'main' of https://github.com/xlang-ai/OSWorld	2025-07-14 12:35:00 +00:00
shenzhennan	53983db9cb	fix impress eval : extending sleep time to ensure save	2025-07-14 12:34:43 +00:00
Danyang Zhang	2339db20ca	ver Jul7th (#255 ) pip-installing directly from PyPI fails misteriously in postconfig execution, possible owing to proxy configuration in the VM, adjusted strategy by downloading the wheel on host and pip-installing it locally on VM in thunderbird/d38192b0-17dc-4e1d-99c3-786d0117de77	2025-07-14 20:26:29 +08:00
shenzhennan	60e26d2d0d	fix impress compare use gold file	2025-07-14 11:35:06 +00:00
Yuan Mengqi	38a30734a6	Improve code logic for password & resolution (#252 ) * fix chrome * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example * fix tasks * fix chrome finished * fix * clean chrome_fix code * clean chrome_fix code * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix multiapps * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix some multi_apps tasks * fix some multi_apps tasks * fix password&resolution * fix password&resolution * Improve code logic for password & resolution * edit * Merge branch 'main' into fix_chrome * fix chrome tasks --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-13 21:04:07 +08:00
yuanmengqi	97ed6f99b0	Final review multi_apps fix the rest part	2025-07-12 20:28:55 +00:00
yuanmengqi	dbecf46057	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-12 16:35:02 +00:00
yuanmengqi	877e75a013	Final review multi_apps fix Xinzhuang part	2025-07-12 16:34:55 +00:00
Yuan Mengqi	27319ce1e3	fix password&resolution (#251 ) * fix chrome * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example * fix tasks * fix chrome finished * fix * clean chrome_fix code * clean chrome_fix code * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix multiapps * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix some multi_apps tasks * fix some multi_apps tasks * fix password&resolution * fix password&resolution --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-13 00:25:37 +08:00
yuanmengqi	6f0382c0c2	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-10 22:35:42 +00:00
yuanmengqi	6897e5320d	Enhance image text comparison functionality with detailed logging - Added logging for OCR results and text matching outcomes in compare_image_text function. - Updated JSON examples to support multiple expected results and improved structure for evaluator functions. - Enhanced handling of expected text rules to include multiple variations for better matching accuracy.	2025-07-10 22:32:53 +00:00
st2rb8g	61f265a082	fix some multi_apps tasks (#245 ) * fix chrome * fix some multi_apps tasks. * fix some multiapps tasks * fix some multiapps tasks --------- Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>	2025-07-11 06:32:13 +08:00
Shenzhennan	29caebb765	Impress check and fix (all font compare issue) (#247 ) * Enhance PPTX comparison logic in slides.py - Improved alignment comparison to treat None and LEFT as equivalent. - Added special handling for font bold and italic properties to consider None and False as equivalent. - Introduced a new bullet comparison function that allows for minor differences and tolerates formatting variations. - Updated JSON examples to support multiple file comparisons and results. * fix all fonts json file f23ac * fix clean the shape examination in unrelevatn part-top position check * Refactor JSON structure for PPTX comparison - Updated the instruction formatting for clarity. - Modified the comparison logic to support multiple expected and result files, enhancing flexibility in evaluations. - Changed the function key to an array to accommodate multiple comparison functions. - Introduced a conjunction key to specify logical relationships between comparisons. * fix impress-e4ef0baf by adding all fonts gold file * update impress bf4e9888 task ins * fix impress b8adbc24 font size * Enhance PPTX comparison functionality in slides.py - Introduced a debug logger for detailed output during PPTX comparisons. - Added a new function to recursively retrieve all text shapes, including those within groups. - Enabled debug logging to provide insights on slide and shape comparisons. - Updated JSON examples to support multiple expected and result files for enhanced evaluation flexibility. * Enable debug logging by default in PPTX comparison and enhance debug output for shape mismatches. Updated JSON examples to support multiple expected and result files for improved evaluation consistency. * fix impress all fons compare file * Refactor PPTX comparison logic and JSON examples for height modification tasks - Added critical notes in slides.py to clarify the execution order of shape examination and height modification checks. - Updated JSON examples to support multiple expected and result files, enhancing evaluation consistency. - Ensured that examine_shape must be set to False for examine_modify_height to function correctly, preventing premature termination of comparisons. * Enhance debug logging in PPTX comparison for detailed font attribute mismatches - Added debug logging for differences in font color, bold, italic, and underline attributes during table cell comparisons. - Improved clarity of debug output by including specific slide, shape, and cell indices for mismatches. - Ensured that existing comparison logic remains intact while enhancing debugging capabilities. * Enhance debug logging for font attribute mismatches in PPTX comparison - Added detailed debug logging for font name and size mismatches during PPTX comparisons, including specific slide, shape, and paragraph indices. - Updated JSON examples to support multiple expected and result files, improving evaluation consistency. - Maintained existing comparison logic while enhancing debugging capabilities. * fix impress 3161de json file --------- Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>	2025-07-10 00:36:32 +08:00
Yuan Mengqi	093679b90d	fix some multi_apps task (#243 ) * fix chrome * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example * fix tasks * fix chrome finished * fix * clean chrome_fix code * clean chrome_fix code * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix multiapps * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix some multi_apps tasks * fix some multi_apps tasks --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-08 18:59:00 +08:00
yuanmengqi	7d0ad02706	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-06 19:38:26 +00:00
yuanmengqi	a68d6f7ab6	Enhance GIMP metrics evaluator with logging and transparency handling - Replaced print statements with logging for better traceability in gimp.py. - Added handling for transparent images in structure checks and size evaluations. - Updated JSON examples to include delays in pyautogui commands for improved execution reliability. - Changed image URL in example to a more accessible source.	2025-07-06 19:38:22 +00:00
shuyhere	3afc01f1fe	fix chrome examples (#240 )	2025-07-07 02:25:59 +08:00
MillanK	8facb285a1	VS Code Task Fix (#237 ) * vscode fix * vscode task fix	2025-07-06 16:47:23 +08:00
yuanmengqi	a1891f7d88	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-06 07:52:42 +00:00
yuanmengqi	9be6fcd688	Check and fix on Chrome tasks - Added `pytz` dependency to `requirements.txt` for timezone handling. - Introduced `get_macys_product_url_parse` function to replace the old `get_url_path_parse` for better clarity and maintain backward compatibility. - Enhanced logging throughout the `get_active_tab_html_parse` and `get_rule_relativeTime` functions for improved debugging and traceability. - Updated JSON examples to reflect changes in expected keys and added new fields for better evaluation context. - Removed deprecated execution commands from JSON examples to streamline the evaluation process.	2025-07-06 07:52:37 +00:00
XXZ	c8a6a22aad	Fix VLC task design (#238 ) * fix: fix multiapp tasks * fix: update instructions for VLC evaluation examples --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-04 20:39:48 +08:00
Shenzhennan	1b40a458de	Impress eval fix (#226 ) * fix compare_pptx * Fix impress-4ed5abd0-8b5d-47bd-839f-cacfa15ca37a eval script:Fix temporarily by ignoring the contaminated To fix completely, compare source file needs to be updated * fix impress domain * fix a53 by changing gold * fix impress a53 * fix impress b8d origin file * add table font color check * fix left pane check --------- Co-authored-by: chenjix <3107760494@qq.com> Co-authored-by: moonshot <moonshot@moonshotznshenMacBook-Pro.local> Co-authored-by: Shen Zhennan <shenzhennan@moonshot.cn>	2025-07-04 13:32:02 +08:00
Zilong Zhou	1308a80029	Update 5990457f-2adb-467b-a4af-5c857c92d762.json (#235 )	2025-07-04 13:31:18 +08:00
yuanmengqi	3cd79c9830	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-03 16:57:49 +00:00
yuanmengqi	a651b04e49	Update AWS AMI ID, enhance directory creation logic in file upload, modify osworld service configuration, and refine JSON evaluation examples for improved clarity and functionality.	2025-07-03 16:57:41 +00:00
Danyang Zhang	adc9ad88c2	Thunderbird eval fix (#233 ) * ver Jul2nd updated task requiring set up new email account * ver Jul3rd fixed several tasks	2025-07-03 21:55:55 +08:00
XXZ	ac24ccce99	fix: fix multiapp tasks (#229 ) Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-03 21:53:58 +08:00
yuanmengqi	7b2120c843	Merge branch 'main' of github.com:xlang-ai/OSWorld	2025-07-03 13:50:35 +00:00
yuanmengqi	cb4bed20a0	Refactor compare_python_pure_text function for improved normalization and error handling. Update JSON example to clarify instruction for extracting Python code from Colab, changing output file names for consistency.	2025-07-03 13:50:21 +00:00
Yuan Mengqi	b2fb8b4222	fix chrome tasks (#230 ) * fix chrome * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example * fix tasks * fix chrome finished * fix * clean chrome_fix code * clean chrome_fix code --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>	2025-07-03 21:32:41 +08:00
ChenYXxxx	bdaf37e0e5	fix_os&gimp (#220 ) * Update ec4e3f68-9ea4-4c18-a5c9-69f89d1178b3.json * Update c288e301-e626-4b98-a1ab-159dcb162af5.json * Update 3ce045a0-877b-42aa-8d2c-b4a863336ab8.json * Update b3d4a89c-53f2-4d6b-8b6a-541fb5d205fa.json * Update 2e6f678f-472d-4c55-99cc-8e7c5c402a71.json Please batch process all images on the desktop by increasing their brightness to 50, instead of adjusting them individually. * Update 5ca86c6f-f317-49d8-b6a7-b527541caae8.json * Update a746add2-cab0-4740-ac36-c3769d9bfb46.json * Update a746add2-cab0-4740-ac36-c3769d9bfb46.json * Update 62f7fd55-0687-4a43-b6e1-3eda16fc6252.json * Update d52d6308-ec58-42b7-a2c9-de80e4837b2b.json * Update d16c99dc-2a1e-46f2-b350-d97c86c85c15.json * Update d16c99dc-2a1e-46f2-b350-d97c86c85c15.json * Update 58d3eeeb-e9d0-499f-962e-fd0db2a744d8.json	2025-07-03 16:59:05 +08:00
Tianbao Xie	bba367b8bc	fix: fix multiapps tasks (#231 ) * Update JSON example for multi_apps: change snapshot name and specify presenter in instructions for clarity. * Enhance PDF image comparison in chrome.py by adding existence checks for input files and improving image extraction logic. Introduce image hashing for similarity scoring with a configurable threshold. Update docs.py to support fuzzy matching in DOCX file comparisons, allowing for similarity scoring based on text content. Modify example JSON to enable fuzzy matching option. --------- Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>	2025-07-03 16:58:43 +08:00
Tianbao Xie	e9c657b714	fix: Libreoffice writer fix (#232 ) * Refactor LibreOffice Writer example JSON to support multiple expected and result files for line spacing comparison, enhancing evaluation flexibility. Updated function calls and added additional expected file paths. * Update source link in LibreOffice Writer example JSON to a more relevant help page for inserting tables, improving instructional clarity. --------- Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>	2025-07-03 16:58:08 +08:00
Zilong Zhou	595a704aff	fix: fix proxy setup (#227 ) * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example	2025-07-02 01:36:32 +08:00
Danyang Zhang	d4273d992e	Calc eval fix (#225 ) * ver Jun17th updating annotations * ver Jun17th corrected annotation of 1d17 added check for cell merge * ver Jun17th updated several annotations * ver Jun20th fixed set-up config of 2bd59342-0664-4ccb-ba87-79379096cc08 * fix: Enhance instructions in LibreOffice Calc examples for clarity and specificity, including details on using Pivot Tables, column placements, and revenue calculations. * ver Jun21st updating calc evals * ver Jun22nd fixed an impress task * ver Jun22ndv2 adjusted several calc tasks * Clean scalfolds --------- Co-authored-by: BowenBryanWang <bryanwang.nlp@connect.hku.hk> Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>	2025-06-30 18:23:09 +08:00
Tianbao Xie	30138c5db1	VLC fix (#224 ) * Enhance SetupController with improved logging and error handling during setup and file upload processes. Update instance type to t3.xlarge and AMI ID for AWS configuration. Add download progress logging and exception handling for better debugging. * Enhance VLC status evaluation by adding multiple paths for file and URL information extraction, improving robustness against varying VLC XML structures. Implement detailed logging for better debugging and error handling in case of mismatches or missing data. Update example JSON for VLC evaluation to use a valid HLS stream URL. * Improve audio comparison robustness in VLC evaluator by adding error handling for audio file loading and extraction. Implement detailed logging for empty or corrupt files, and normalize DTW distance calculation for more accurate similarity scoring. Remove deprecated audio fingerprint comparison function. --------- Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>	2025-06-29 20:18:44 +08:00
Tianbao Xie	0cc93543a8	Environment is_used flag; OS domain fix (#219 ) * Refactor evaluator structure in LibreOffice Writer example JSON to support multiple expected and result files, enhancing evaluation flexibility. * Update instance type to t3.large and add VNC access URL logging for allocated VMs, enhancing remote access capabilities. * Update instance type to t3.large and add VNC access URL logging for allocated VMs, enhancing remote access capabilities. * Update time format in get_vm_file function to include hours, minutes, and seconds for more precise file naming with time suffix. * More delay for 936321ce-5236-426a-9a20-e0e3c5dc536f; support one more potential solutions. * Enhance SetupController with configurable retry limit and improved error handling for file opening requests. Introduce new function to compare unique training records, and update logging for better debugging. Adjust JSON examples for evaluation to support multiple expected and result files. * Clean debug code * Enhance DesktopEnv to track environment usage for optimized snapshot management. Introduce is_environment_used flag to determine if a snapshot revert is necessary based on provider type. Update setup and step methods to mark environment usage appropriately. Add new execute_with_verification method in SetupController for command execution with result verification, improving reliability. Change AWS instance type to m5.large for better performance and update AMI ID for compatibility. Update file opening logic in main.py to handle both file paths and application commands more effectively. --------- Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>	2025-06-28 00:45:53 +08:00
MillanK	48ac57697a	VSCode fix (#222 )	2025-06-24 17:08:09 +08:00
Tianbao Xie	4e11eafd1d	Robust Evaluation, Blocking File Open, Grader Sensitivity, and LibreOffice Writer Fixes (#217 ) * Refactor evaluator structure in LibreOffice Writer example JSON to support multiple expected and result files, enhancing evaluation flexibility. * Update instance type to t3.large and add VNC access URL logging for allocated VMs, enhancing remote access capabilities. * Update instance type to t3.large and add VNC access URL logging for allocated VMs, enhancing remote access capabilities. * Update time format in get_vm_file function to include hours, minutes, and seconds for more precise file naming with time suffix. * More delay for 936321ce-5236-426a-9a20-e0e3c5dc536f; support one more potential solutions. * Enhance SetupController with configurable retry limit and improved error handling for file opening requests. Introduce new function to compare unique training records, and update logging for better debugging. Adjust JSON examples for evaluation to support multiple expected and result files. * Clean debug code --------- Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>	2025-06-16 21:37:19 +08:00

1 2 3 4 5 ...

496 Commits