EvoCUA Update (2025.01.05) (#412 )

* evocua init * setup max_token * evocua update --------- Co-authored-by: xuetaofeng <xuetaofeng@meituan.com> Co-authored-by: Tianbao Xie <47296835+Timothyxxx@users.noreply.github.com>
fix(os_symphony_evaluation) (#410 )
2026-01-05 16:14:53 +08:00 · 2026-01-04 15:56:51 +08:00 · 2026-01-01 11:27:34 +08:00 · 2025-12-30 22:43:47 +08:00 · 2025-12-29 20:45:36 +08:00 · 2025-12-23 20:46:23 +08:00
926 changed files with 143224 additions and 4615 deletions
--- a/.gitignore
+++ b/.gitignore
@ -197,4 +197,18 @@ vmware_vm_data

 .vscode

-dataimpulse_proxy_config.json
+dataimpulse_proxy_config.json
+
+## reference and draft and debug
+reference/
+draft/
+manual_examine.py
+run_human_examine.sh
+quick_start.py
+result_multi_apps_pengxiang_transformers12evaluation_examples/settings/proxy/dataimpulse.json
+evaluation_examples/settings/proxy/dataimpulse.json
+
+# Local test configurations (not for public repo)
+evaluation_examples/spiderman.json
+evaluation_examples/test_50_random_proportional.json
+evaluation_examples/test_chrome.json
--- a/PUBLIC_EVALUATION_GUIDELINE.md
+++ b/PUBLIC_EVALUATION_GUIDELINE.md
@ -1,30 +1,35 @@
 # Public Evaluation Platform User Guide
-
-
-We have built an AWS-based platform for large-scale parallel evaluation of OSWorld tasks. The system follows a Host-Client architecture:
+We have built an AWS-based platform for large-scale parallel evaluation of OSWorld tasks. 
+The system follows a Host-Client architecture:

 - **Host Instance**: The central controller that stores code, configurations, and manages task execution.
 - **Client Instances**: Worker nodes automatically launched to perform tasks in parallel.

-All instances use a preconfigured AMI to ensure a consistent environment.
-
+The architecture consists of a host machine (where you git clone and set up the OSWorld host environment) that controls multiple virtual machines for testing and potential training purposes. 
+Each virtual machine serves as an OSWorld client environment using pre-configured AMI images and runs `osworld.service` to execute actions and commands from the host machine. 
+To prevent security breaches, proper security groups and subnets must be configured for both the host and virtual machines.

 ## 1. Platform Deployment & Connection

+Below, we assume you have no prior AWS configuration experience. 
+You may freely skip or replace any graphical operations with API calls if you know how to do so.
+
 ### 1.1 Launch the Host Instance
+Please create an instance in the AWS EC2 graphical interface to build the Host Machine.

-Create an EC2 instance in the AWS Console with the following settings:
+Our recommended instance type settings are as follows: if you want to run fewer than 5 VM environments in parallel (`--num_envs` < 5), `t3.medium` will be sufficient; if you want to run fewer than 15 VM environments in parallel (`--num_envs` < 15), `t3.large` will be sufficient; however, if you want to use more than 15 VM environments in parallel, it's better to choose a machine with more vCPUs and memory, such as `c4.8xlarge`.

-| Configuration Item         | Value                                                        |
-| -------------------------- | ------------------------------------------------------------ |
-| AMI ID                     | `ami-0e49e0a70044dde43`                                      |
-| Instance Type              | - `t3.medium` (Recommended for ≤5 parallel tasks)<br />- ` t3.large `  (Recommended for ≤15 parallel tasks)<br /><br /> - These numbers are based on using VSCode over SSH. You can save resources by running via CLI—`t3.large` supports up to 20 tasks that way.<br /> - For higher parallelism, use a more powerful instance. |
-| VPC                        | `vpc-0f207282fe145bcda`                                      |
-| Subnet                     | `subnet-0a4b0c5b8f6066712`                                   |
-| Firewall (security groups) | `sg-05f8e79c10a7768e4`                                       |
-| Storage                    | 50GB<br /> - Consider increasing if storing multiple results to avoid crashes. |
+For the AMI, we recommend using Ubuntu Server 24.04 LTS (HVM), SSD Volume Type, though other options will also work since the host machine doesn't have strict requirements.
+
+For storage space, please consider the number of experiments you plan to run. 
+We recommend at least 50GB or more.
+
+For security group configuration, please configure according to your specific requirements. 
+We provides a monitor service that runs on port 8080 by default.
+You need to open this port to use this functionality.
+
+Set the VPC as default, and we will return to it later to configure the virtual machines with the same setting.

-Once launched, you will receive an instance ID like `i-xxxxxx`.

 ### 1.2 Connect to the Host Instance

@ -66,7 +71,7 @@ Once launched, you will receive an instance ID like `i-xxxxxx`.
  ssh -i <your_key_path> ubuntu@<your_public_dns>
  ```

-* VSCode Remote SSH configuration:
+* VSCode/Cursor Remote SSH configuration:

  ```
  Host host_example
@ -75,6 +80,94 @@ Once launched, you will receive an instance ID like `i-xxxxxx`.
      IdentityFile <your_key_path>
  ```

+#### Step 3: Set up the host machine
+After you connect the host machine, clone the latest OSWorld and set up the environment.
+Please ensure that the version of Python is >= 3.10.
+```
+# Clone the OSWorld repository
+git clone https://github.com/xlang-ai/OSWorld
+
+# Change directory into the cloned repository
+cd OSWorld
+
+# Optional: Create a Conda environment for OSWorld
+# conda create -n osworld python=3.10
+# conda activate osworld
+
+# Install required dependencies
+pip install -r requirements.txt
+```
+
+When installing requirements, you may encounter general environment issues, but these are solvable. 
+You'll need to use `apt install` to install and configure some dependencies. 
+These issues can be quickly fixed with the help of AI tools like Claude Code.
+
+Then it is almost done for the host machine part!
+
+### 1.3 Set up the virtual machine
+We need to programmatically scale virtual machines. 
+Therefore, we will use the graphical interface to configure and obtain the necessary environment variables, then set these environment variables on the host machine so that the OSWorld code can read them to automatically scale and run experiments.
+
+For the client machine environments, we have already prepared pre-configured client machine environments in different regions, stored in https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/providers/aws/manager.py:
+
+```
+IMAGE_ID_MAP = {
+    "us-east-1": {
+        (1920, 1080): "ami-0d23263edb96951d8"
+    },
+    "ap-east-1": {
+        (1920, 1080): "ami-0c092a5b8be4116f5"
+    }
+}
+# Tell us if you need more, we can make immigration from one place to another.
+```
+
+Therefore, you don't need to configure the virtual machine environments and related variables. 
+If you need to add additional functionality, you can configure it based on these images. 
+If you want to reconfigure from scratch, please refer to the files and instructions under https://github.com/xlang-ai/OSWorld/tree/main/desktop_env/server.
+
+
+#### Step 1: Security Group for OSWorld Virtual Machines
+OSWorld requires certain ports to be open, such as port 5000 for backend connections to OSWorld services, port 5910 for VNC visualization, port 9222 for Chrome control, etc. 
+The `AWS_SECURITY_GROUP_ID` variable represents the security group configuration for virtual machines serving as OSWorld environments. 
+Please complete the configuration and set this environment variable to the ID of the configured security group.
+
+**⚠️ Important**: Please strictly follow the port settings below to prevent OSWorld tasks from failing due to connection issues:
+
+##### Inbound Rules (8 rules required)
+
+| Type | Protocol | Port Range | Source | Description |
+|------|----------|------------|--------|-------------|
+| SSH | TCP | 22 | 0.0.0.0/0 | SSH access |
+| HTTP | TCP | 80 | 172.31.0.0/16 | HTTP traffic |
+| Custom TCP | TCP | 5000 | 172.31.0.0/16 | OSWorld backend service |
+| Custom TCP | TCP | 5910 | 0.0.0.0/0 | NoVNC visualization port |
+| Custom TCP | TCP | 8006 | 172.31.0.0/16 | VNC service port |
+| Custom TCP | TCP | 8080 | 172.31.0.0/16 | VLC service port |
+| Custom TCP | TCP | 8081 | 172.31.0.0/16 | Additional service port |
+| Custom TCP | TCP | 9222 | 172.31.0.0/16 | Chrome control port |
+
+Once finished, record the `AWS_SECURITY_GROUP_ID` as you will need to set it as the environment variable `AWS_SECURITY_GROUP_ID` on the host machine before starting the client code.
+
+##### Outbound Rules (1 rule required)
+
+| Type | Protocol | Port Range | Destination | Description |
+|------|----------|------------|-------------|-------------|
+| All traffic | All | All | 0.0.0.0/0 | Allow all outbound traffic |
+
+#### Step 2: Record VPC Configuration for Client Machines from Host Machine
+
+To isolate the entire evaluation stack, we run both the host machine and all client virtual machines inside a dedicated VPC. 
+
+The setup is straightforward:
+
+1. Launch the host instance in the EC2 console via the AWS console and note the **VPC ID** and **Subnet ID** shown in its network settings.
+2. Record the **Subnet ID** as you will need to set it as the environment variable `AWS_SUBNET_ID` on the host machine before starting the client code.
+
+<p align="center">
+  <img src="./assets/pubeval_subnet.png" alt="pubeval_subnet" style="width: 80%;" />
+</p>
+
 ### 1.3 Get AWS Access Keys & Secret Access Key

 Click on **Security Credentials** from the drop-down menu under your account in the top-right corner.
@ -89,16 +182,40 @@ In the **Access keys** section, click **"Create access key"** to generate your o
  <img src="./assets/pubeval5.png" alt="pubeval5" style="width: 100%;" />
 </p>

+If this method doesn't work, please go to **IAM** → **Users** → select your username → **Security credentials** tab → **Create access key**.
+
+Alternatively, you can create access keys through IAM for better security practices:
+
+1. Navigate to **IAM** in the AWS Console
+2. Click on **Users** in the left sidebar
+3. Select your username or create a new IAM user
+4. Go to the **Security credentials** tab
+5. Click **Create access key**
+6. Choose the appropriate use case (e.g., "Command Line Interface (CLI)")
+7. Download or copy the Access Key ID and Secret Access Key
+
+**Note**: For production environments, it's recommended to use IAM roles instead of access keys when possible, or create dedicated IAM users with minimal required permissions rather than using root account credentials.
+
+Similarly, later you will need to set them as the environment variables on the host machine.
+
 ## 2. Environment Setup

-### 2.1 Google Drive Integration
+Great! Now back to the **host machine**, we can start running experiments!
+All the following operations are performed on the host machine environment, under the OSWorld path.
+
+### 2.1 Google Drive Integration (Optional)

 Follow the instructions in [ACCOUNT_GUIDELINE](./ACCOUNT_GUIDELINE.md), specifically the section "Generating `credentials.json` for Public Eval". This part is necessary if using public evaluation.

+You can skip this step at the debugging stage, since it is only 8 Google Drive tasks and it is more and more annoying to make it due to their policy.
+<p align="center">
+    <img src="./assets/pubeval_gdrive_auth.jpg" alt="pubeval_gdrive_auth" style="width:80%;" />
+</p>
+
 ### 2.2 Proxy Setup

 - Register at [DataImpulse](https://dataimpulse.com/).
-
+- Purchase a US residential IP package (approximately $1 per 1GB).
 - Configure your credentials in `OSWorld/evaluation_examples/settings/proxy/dataimpulse.json`:

  ```json
@ -117,18 +234,35 @@ Follow the instructions in [ACCOUNT_GUIDELINE](./ACCOUNT_GUIDELINE.md), specific
  ] 
  ```

+We have set proxy to True in the config JSON files for those proxy-sensitive tasks. OSWorld will automatically wrap these tasks with a proxy when DesktopEnv's `enable_proxy=True`, while other tasks will not be affected.
+We recommend using a proxy.
+If you don't need it at all, please set `enable_proxy=False` in the experiment's `.py` file:
+
+```
+env = DesktopEnv(
+    ...
+    enable_proxy=False,
+    ...
+)
+```
+(We didn't make too much explanantion on the DesktopEnv interface, please read theough the code to get to understand here.)
+
+Note that disabling the proxy will cause some tasks under the Chrome domain to fail.
+
 ### 2.3 Set Environment Variables

 ```bash
-export OPENAI_API_KEY_CUA="your_api_key"
-export AWS_ACCESS_KEY_ID="your_access_key"
-export AWS_SECRET_ACCESS_KEY="your_security_access_key"
-export AWS_REGION="your_aws_region" # eg. us-east-1
-export AWS_SUBNET_ID="subnet-0a4b0c5b8f6066712"
-export AWS_SECURITY_GROUP_ID="sg-08a53433e9b4abde6"
+# export OPENAI_API_KEY_CUA="your_openai_api_key" # if you use openai API
+# export ANTHROPIC_API_KEY="your_anthropic_api_key" # if you use anthropic API
+# export DASHSCOPE_API_KEY="your_dashscope_api_key" # if you use dashscope API from alibaba qwen
+# export DOUBAO_API_KEY, DOUBAO_API_URL = "", "" # if you use doubao seed API from bytedance ui_tars
+export AWS_ACCESS_KEY_ID="your_access_key" # key we mentioned before
+export AWS_SECRET_ACCESS_KEY="your_security_access_key" # key we mentioned before
+export AWS_REGION="your_aws_region" # eg. us-east-1, or leave it, it will be set default to us-east-1
+export AWS_SECURITY_GROUP_ID="sg-xxxx" # the security group we mentioned before 
+export AWS_SUBNET_ID="subnet-xxxx" # the subnet we mentioned before
 ```

-
 ## 3. Running Evaluations

 Use the `run_multienv_xxx.py` scripts to launch tasks in parallel.
@ -136,15 +270,31 @@ Use the `run_multienv_xxx.py` scripts to launch tasks in parallel.
 Example (with the OpenAI CUA agent):

 ```bash
+# --client_password set to the one you set to the client machine
+# Run OpenAI CUA
 python run_multienv_openaicua.py \
 --headless \
 --observation_type screenshot \
 --model computer-use-preview \
--result_dir ./results_all \
+--result_dir ./results_operator \
 --test_all_meta_path evaluation_examples/test_all.json \
 --region us-east-1 \
--max_steps 150 \
--num_envs 5
+--max_steps 50 \
+--num_envs 5 \
+--client_password osworld-public-evaluation
+
+# Run Anthropic (via AWS Bedrock), please modify agent if you want Anthropic endpoint
+python run_multienv_claude.py \
+--headless \
+--observation_type screenshot \
+--action_space claude_computer_use \
+--model claude-4-sonnet-20250514 \
+--result_dir ./results_claude \
+--test_all_meta_path evaluation_examples/test_all.json \
+--max_steps 50 \
+--num_envs 5 \
+--provider_name aws \
+--client_password osworld-public-evaluation
 ```

 Key Parameters:
@ -155,6 +305,8 @@ Key Parameters:
 - `--test_all_meta_path`: Path to the test set metadata
 - `--region`: AWS region

+Usually the running code is named with `run_multi_env_xxx.py` under main folder, and the agent implementation is under `mm_agents` folder.
+Add according to your needs.

 ## 4. Viewing Results

@ -170,6 +322,23 @@ Then, open your Host's **public IP** on port `8080` in a browser. (eg. `http://<

 For more, see: [MONITOR_README](./monitor/README.md)

-### 4.2 VNC Remote Desktop Access
+<p align="center">
+    <img src="./assets/pubeval_monitor1.jpg" alt="pubeval_monitor" style="width:80%;" />
+</p>
+<p align="center">
+    <img src="./assets/pubeval_monitor2.jpg" alt="pubeval_monitor" style="width:80%;" />
+</p>

-You can also access Client instances via VNC at`http://<client-public-ip>:5910/vnc.html`
+
+### 4.2 VNC Remote Desktop Access
+We pre-install vnc for every virtual machine so you can have a look on it during the running.
+You can access via VNC at`http://<client-public-ip>:5910/vnc.html`
+The password set default is `osworld-public-evaluation` in our AMI to prevent attack.
+
+## 5. Contact the team to update leaderboard and fix errors (optional)
+
+If you want your results to be displayed on the leaderboard, please send a message to the OSWorld leaderboard maintainers (tianbaoxiexxx@gmail.com, yuanmengqi732@gmail.com) and open a pull request. We can update the results in the self-reported section.
+
+If you want your results to be verified and displayed in the verified leaderboard section, we need you to schedule a meeting with us to run your agent code on our side to obtain results and have us report them. Alternatively, if you are from a trusted institution, you can share your monitor and trajectories with us.
+
+If you discover new errors or the environment has undergone some changes, please contact us via GitHub issues or email.
--- a/README.md
+++ b/README.md
@ -33,9 +33,10 @@


 ## 📢 Updates
+- 2025-07-28: Introducing **OSWorld-Verified**! We have made major updates, fixed several issues reported by the community, with more support for AWS (can reduce evaluation time to within 1 hour through parallelization!), and making the benchmark signals more effective. Check out more in the [report](https://xlang.ai/blog/osworld-verified). We have run new model results in the latest version and updated them on the [official website](https://os-world.github.io/). Please compare your OSWorld results with the new benchmark results when running the latest version.
 - 2025-05-01: If you need pre-downloaded files for init state setup, we downloaded for you [here](https://drive.google.com/file/d/1XlEy49otYDyBlA3O9NbR0BpPfr2TXgaD/view?usp=drive_link).
 - 2024-10-22: We supported Docker🐳 for hosting virtual machines on virtualized platforms. Check below for detailed instructions!
- 2024-06-15: We refactor the code of environment part to decompose VMware Integration, and start to support other platforms such as VitualBox, AWS, Azure, etc. Hold tight!
+- 2024-06-15: We refactor the code of environment part to decompose VMware Integration, and start to support other platforms such as VirtualBox, AWS, Azure, etc. Hold tight!
 - 2024-04-11: We released our [paper](https://arxiv.org/abs/2404.07972), [environment and benchmark](https://github.com/xlang-ai/OSWorld), and [project page](https://os-world.github.io/). Check it out!

 ## 💾 Installation
@ -43,7 +44,7 @@
 Suppose you are operating on a system that has not been virtualized (e.g. your desktop, laptop, bare metal machine), meaning you are not utilizing a virtualized environment like AWS, Azure, or k8s.
 If this is the case, proceed with the instructions below. However, if you are on a virtualized platform, please refer to the [Docker](https://github.com/xlang-ai/OSWorld?tab=readme-ov-file#docker-server-with-kvm-support-for-the-better) section.

-1. First, clone this repository and `cd` into it. Then, install the dependencies listed in `requirements.txt`. It is recommended that you use the latest version of Conda to manage the environment, but you can also choose to manually install the dependencies. Please ensure that the version of Python is >= 3.9.
+1. First, clone this repository and `cd` into it. Then, install the dependencies listed in `requirements.txt`. It is recommended that you use the latest version of Conda to manage the environment, but you can also choose to manually install the dependencies. Please ensure that the version of Python is >= 3.10.
 ```bash
 # Clone the OSWorld repository
 git clone https://github.com/xlang-ai/OSWorld
@ -52,7 +53,7 @@ git clone https://github.com/xlang-ai/OSWorld
 cd OSWorld

 # Optional: Create a Conda environment for OSWorld
-# conda create -n osworld python=3.9
+# conda create -n osworld python=3.10
 # conda activate osworld

 # Install required dependencies
@ -64,7 +65,7 @@ Alternatively, you can install the environment without any benchmark tasks:
 pip install desktop-env
 ```

-2. Install [VMware Workstation Pro](https://www.vmware.com/products/workstation-pro/workstation-pro-evaluation.html) (for systems with Apple Chips, you should install [VMware Fusion](https://support.broadcom.com/group/ecx/productdownloads?subfamily=VMware+Fusion)) and configure the `vmrun` command.  The installation process can refer to [How to install VMware Worksation Pro](desktop_env/providers/vmware/INSTALL_VMWARE.md). Verify the successful installation by running the following:
+2. Install [VMware Workstation Pro](https://www.vmware.com/products/workstation-pro/workstation-pro-evaluation.html) (for systems with Apple Chips, you should install [VMware Fusion](https://support.broadcom.com/group/ecx/productdownloads?subfamily=VMware+Fusion)) and configure the `vmrun` command.  The installation process can refer to [How to install VMware Workstation Pro](desktop_env/providers/vmware/INSTALL_VMWARE.md). Verify the successful installation by running the following:
 ```bash
 vmrun -T ws list
 ```
@ -73,7 +74,7 @@ If the installation along with the environment variable set is successful, you w

 All set! Our setup script will automatically download the necessary virtual machines and configure the environment for you.

-### Docker (Server (with KVM Support for the better))
+### Docker (Server with KVM Support for Better Performance)
 If you are running on a non-bare metal server, or prefer not to use VMware and VirtualBox platforms, we recommend using our Docker support.

 #### Prerequisite: Check if your machine supports KVM
@ -93,6 +94,11 @@ Add the following arguments when initializing `DesktopEnv`:
 - `os_type`: `Ubuntu` or `Windows`, depending on the OS of the VM
 > **Note**: If the experiment is interrupted abnormally (e.g., by interrupting signals), there may be residual docker containers which could affect system performance over time. Please run `docker stop $(docker ps -q) && docker rm $(docker ps -a -q)` to clean up.

+### AWS
+Using cloud services for parallel evaluation can significantly accelerate evaluation efficiency (can reduce evaluation time to within 1 hour through parallelization!) and can even be used as infrastructure for training. 
+We provide comprehensive AWS support with a Host-Client architecture that enables large-scale parallel evaluation of OSWorld tasks. 
+For detailed setup instructions, see [Public Evaluation Guideline](https://github.com/xlang-ai/OSWorld/blob/main/PUBLIC_EVALUATION_GUIDELINE.md) and [AWS Configuration Guide](https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/providers/aws/AWS_GUIDELINE.md). 
+
 ### Others
 We are working on supporting more 👷. Please hold tight!

@ -100,91 +106,99 @@ We are working on supporting more 👷. Please hold tight!
 ## 🚀 Quick Start
 Run the following minimal example to interact with the environment:

-```python
-from desktop_env.desktop_env import DesktopEnv
+```bash
+# Basic usage with default settings
+python quickstart.py

-example = {
-    "id": "94d95f96-9699-4208-98ba-3c3119edf9c2",
-    "instruction": "I want to install Spotify on my current system. Could you please help me?",
-    "config": [
-        {
-            "type": "execute",
-            "parameters": {
-                "command": [
-                    "python",
-                    "-c",
-                    "import pyautogui; import time; pyautogui.click(960, 540); time.sleep(0.5);"
-                ]
-            }
-        }
-    ],
-    "evaluator": {
-        "func": "check_include_exclude",
-        "result": {
-            "type": "vm_command_line",
-            "command": "which spotify"
-        },
-        "expected": {
-            "type": "rule",
-            "rules": {
-                "include": ["spotify"],
-                "exclude": ["not found"]
-            }
-        }
-    }
-}
-
-env = DesktopEnv(action_space="pyautogui")
-
-obs = env.reset(task_config=example)
-obs, reward, done, info = env.step("pyautogui.rightClick()")
+# Customize provider and VM path
+python quickstart.py --provider_name vmware --path_to_vm "path/to/your/vm.vmx"
 ```
+
 You will see all the logs of the system running normally, including the successful creation of the environment, completion of setup, and successful execution of actions. In the end, you will observe a successful right-click on the screen, which means you are ready to go.

 ## 🧪 Experiments
 ### Agent Baselines
-If you wish to run the baseline agent used in our paper, you can execute the following command as an example under the GPT-4V pure-screenshot setting:
+
+> **⚠️ Important Configuration Requirements:**
+> 
+> * **Google Account Tasks**: Some tasks require Google account access and OAuth2.0 configuration. Please refer to [Google Account Guideline](ACCOUNT_GUIDELINE.md) for detailed setup instructions.
+> * **Proxy Configuration**: Some tasks may require proxy settings to function properly (this depends on the strength of website defenses against your network location). Please refer to your system's proxy configuration documentation.
+> * **Impact of Missing Configuration**: If these configurations are not properly set up, the corresponding tasks will fail to execute correctly, leading to lower evaluation scores.
+
+
+If you wish to run the baseline agent used in our paper, you can execute the following command as an example under the GPT-4o pure-screenshot setting:

 Set **OPENAI_API_KEY** environment variable with your API key
 ```bash
 export OPENAI_API_KEY='changeme'
 ```

+Optionally, set **OPENAI_BASE_URL** to use a custom OpenAI-compatible API endpoint
 ```bash
-python run.py --path_to_vm Ubuntu/Ubuntu.vmx --headless --observation_type screenshot --model gpt-4-vision-preview --result_dir ./results
+export OPENAI_BASE_URL='http://your-custom-endpoint.com/v1'  # Optional: defaults to https://api.openai.com
 ```
-The results, which include screenshots, actions, and video recordings of the agent's task completion, will be saved in the `./results` directory in this case. You can then run the following command to obtain the result:
+
+Single-threaded execution (deprecated, using `vmware` provider as example)
+```bash
+python run.py \
+    --provider_name vmware \
+    --path_to_vm Ubuntu/Ubuntu.vmx \
+    --headless \
+    --observation_type screenshot \
+    --model gpt-4o \
+    --sleep_after_execution 3 \
+    --max_steps 15 \
+    --result_dir ./results \
+    --client_password password
+```
+
+Parallel execution (example showing switching provider to `docker`)
+```bash
+python run_multienv.py \
+    --provider_name docker \
+    --headless \
+    --observation_type screenshot \
+    --model gpt-4o \
+    --sleep_after_execution 3 \
+    --max_steps 15 \
+    --num_envs 10 \
+    --client_password password
+```
+
+The results, which include screenshots, actions, and video recordings of the agent's task completion, will be saved in the `./results` (or other `result_dir` you specified) directory in this case. 
+You can then run the following command to obtain the result:
 ```bash
 python show_result.py
 ```

-### Evaluation
+## Evaluation
+### Local Evaluation
 Please start by reading through the [agent interface](https://github.com/xlang-ai/OSWorld/blob/main/mm_agents/README.md) and the [environment interface](https://github.com/xlang-ai/OSWorld/blob/main/desktop_env/README.md).
-Correctly implement the agent interface and import your customized version in the `run.py` file.
+Correctly implement the agent interface and import your customized version in the `run.py` or `run_multienv.py` file.
 Afterward, you can execute a command similar to the one in the previous section to run the benchmark on your agent.

+### Public Evaluation
+If you want your results to be verified and displayed on the verified leaderboard, you need to schedule a meeting with us (current maintainer: tianbaoxiexxx@gmail.com, yuanmengqi732@gmail.com) to run your agent code on our side and have us report the results. 
+You need to upload and allow us to disclose your agent implementation under the OSWorld framework (you may choose not to expose your model API to the public), along with a report that allows the public to understand what's happening behind the scenes.
+Alternatively, if you are from a trusted institution, you can share your monitoring data and trajectories with us. 
+Please carefully follow the [Public Evaluation Guideline](https://github.com/xlang-ai/OSWorld/blob/main/PUBLIC_EVALUATION_GUIDELINE.md) to get results.
+
+
 ## ❓ FAQ
 ### What is the username and password for the virtual machines?
-The username and password for the virtual machines are as follows:
- **Ubuntu:** `user` / `password`
+The username and password for the virtual machines are as follows (for provider `vmware`, `virtualbox` and `docker`): we set the account credentials for Ubuntu as `user` / `password`. 
+For cloud service providers like `aws`, to prevent attacks due to weak passwords, we default to `osworld-public-evaluation`. 
+If you make further modifications, remember to set the client_password variable and pass it to DesktopEnv and Agent (if supported) when running experiments. 
+Some features like setting up proxy require the environment to have the client VM password to obtain sudo privileges, and for some OSWorld tasks, the agent needs the password to obtain sudo privileges to complete them.

 ### How to setup the account and credentials for Google and Google Drive?

 See [Account Guideline](ACCOUNT_GUIDELINE.md).

-### How can I configure a proxy for the VM if I'm behind a GFW?
+### How can I configure a proxy for the VM (if I'm behind the GFW, or I don't want some of my tasks to be identified as bot and get lower scores)?

-See [Proxy Guideline](PROXY_GUIDELINE.md).
-
-### What are the running times and costs under different settings?
-| Setting                        | Expected Time* | Budget Cost (Full Test Set/Small Test Set) |
-| ------------------------------ | -------------- | ------------------------------------------ |
-| GPT-4V (screenshot)            | 10h            | $100 ($10)                                 |
-| Gemini-ProV (screenshot)       | 15h            | $0 ($0)                                    |
-| Claude-3 Opus (screenshot)     | 15h            | $150 ($15)                                 |
-| GPT-4V (a11y tree, SoM, etc.)  | 30h            | $500 ($50)                                 |
-
-\*No environment parallelism. Calculated in April 2024.
+If you want to set it up yourself, please refer to [Proxy Guideline](PROXY_GUIDELINE.md).
+We also provide a pre-configured solution based on dataimpulse, please refer to [proxy-setup section in PUBLIC_EVALUATION_GUIDELINE](https://github.com/xlang-ai/OSWorld/blob/main/PUBLIC_EVALUATION_GUIDELINE.md#22-proxy-setup).

 ### Open Source Contributors

@ -207,3 +221,14 @@ If you find this environment useful, please consider citing our work:
      primaryClass={cs.AI}
 }
 ```
+
+## Acknowledgement for OSWorld-Verified
+Special thanks to the following institutions that provided feedback and participated in the fixes (as well as institutions that provided feedback during the process): [MoonShot AI, a.k.a. Kimi](https://www.moonshot.ai/)，[Human Data](https://www.hud.so/), [OpenAI](https://openai.com/), [ByteDance Seed TARS](https://seed-tars.com/), [Anthropic](https://www.anthropic.com/), [Simular](https://www.simular.ai/), [HKU Data Intelligence Lab](https://sites.google.com/view/chaoh)
+
+Special thanks to the following students who participated in the specific fixes: [Mengqi Yuan](https://yuanmengqi.github.io/), [Danyang Zhang](https://zdy023.github.io/), [Xinzhuang Xiong](https://thisisxxz.com/),  [Zhennan Shen](https://scholar.google.com/citations?user=JPwg5MwAAAAJ&hl=en), [Zilong Zhou](https://github.com/adlsdztony), Yanxu Chen, [Jiaqi Deng](https://millank0817.github.io/), [Tianbao Xie](https://tianbaoxie.com/), Junda Chen, [Jixuan Chen](https://chenjix.github.io/), [Haoyuan Wu](https://www.linkedin.com/in/haoyuan-wu-240878291/).
+
+Special thanks to the following students who participated in running the re-evaluation: [Mengqi Yuan](https://yuanmengqi.github.io/), [Zilong Zhou](https://github.com/adlsdztony), [Xinyuan Wang](https://xinyuanwangcs.github.io/), [Bowen Wang](https://bowenbryanwang.github.io/).
+
+## You might also be interested
+
+- **OSWorld-MCP**: Benchmarking MCP Tool Invocation in Computer-Use Agents. [Website](https://osworld-mcp.github.io/)
--- a/assets/pubeval_gdrive_auth.jpg
+++ b/assets/pubeval_gdrive_auth.jpg
--- a/assets/pubeval_monitor1.jpg
+++ b/assets/pubeval_monitor1.jpg
--- a/assets/pubeval_monitor2.jpg
+++ b/assets/pubeval_monitor2.jpg
--- a/assets/pubeval_subnet.png
+++ b/assets/pubeval_subnet.png
--- a/desktop_env/controllers/python.py
+++ b/desktop_env/controllers/python.py
@ -3,6 +3,7 @@ import logging
 import random
 from typing import Any, Dict, Optional
 import time
+import traceback
 import requests

 from desktop_env.actions import KEYBOARD_KEYS
@ -20,17 +21,41 @@ class PythonController:
        self.retry_times = 3
        self.retry_interval = 5

+    @staticmethod
+    def _is_valid_image_response(content_type: str, data: Optional[bytes]) -> bool:
+        """Quick validation for PNG/JPEG payload using magic bytes; Content-Type is advisory.
+        Returns True only when bytes look like a real PNG or JPEG.
+        """
+        if not isinstance(data, (bytes, bytearray)) or not data:
+            return False
+        # PNG magic
+        if len(data) >= 8 and data[:8] == b"\x89PNG\r\n\x1a\n":
+            return True
+        # JPEG magic
+        if len(data) >= 3 and data[:3] == b"\xff\xd8\xff":
+            return True
+        # If server explicitly marks as image, accept as a weak fallback (some environments strip magic)
+        if content_type and ("image/png" in content_type or "image/jpeg" in content_type or "image/jpg" in content_type):
+            return True
+        return False
+
    def get_screenshot(self) -> Optional[bytes]:
        """
        Gets a screenshot from the server. With the cursor. None -> no screenshot or unexpected error.
        """

-        for _ in range(self.retry_times):
+        for attempt_idx in range(self.retry_times):
            try:
-                response = requests.get(self.http_server + "/screenshot")
+                response = requests.get(self.http_server + "/screenshot", timeout=10)
                if response.status_code == 200:
-                    logger.info("Got screenshot successfully")
-                    return response.content
+                    content_type = response.headers.get("Content-Type", "")
+                    content = response.content
+                    if self._is_valid_image_response(content_type, content):
+                        logger.info("Got screenshot successfully")
+                        return content
+                    else:
+                        logger.error("Invalid screenshot payload (attempt %d/%d).", attempt_idx + 1, self.retry_times)
+                        logger.info("Retrying to get screenshot.")
                else:
                    logger.error("Failed to get screenshot. Status code: %d", response.status_code)
                    logger.info("Retrying to get screenshot.")
@ -136,13 +161,94 @@ class PythonController:

        logger.error("Failed to execute command.")
        return None
+    
+    def run_python_script(self, script: str) -> Optional[Dict[str, Any]]:
+        """
+        Executes a python script on the server.
+        """
+        payload = json.dumps({"code": script})

-    def execute_action(self, action: Dict[str, Any]):
+        for _ in range(self.retry_times):
+            try:
+                response = requests.post(self.http_server + "/run_python", headers={'Content-Type': 'application/json'},
+                                         data=payload, timeout=90)
+                if response.status_code == 200:
+                    return response.json()
+                else:
+                    return {"status": "error", "message": "Failed to execute command.", "output": None, "error": response.json()["error"]}
+            except requests.exceptions.ReadTimeout:
+                break
+            except Exception:
+                logger.error("An error occurred while trying to execute the command: %s", traceback.format_exc())
+                logger.info("Retrying to execute command.")
+            time.sleep(self.retry_interval)
+
+        logger.error("Failed to execute command.")
+        return {"status": "error", "message": "Failed to execute command.", "output": "", "error": "Retry limit reached."}
+    
+    def run_bash_script(self, script: str, timeout: int = 30, working_dir: Optional[str] = None) -> Optional[Dict[str, Any]]:
+        """
+        Executes a bash script on the server.
+        
+        :param script: The bash script content (can be multi-line)
+        :param timeout: Execution timeout in seconds (default: 30)
+        :param working_dir: Working directory for script execution (optional)
+        :return: Dictionary with status, output, error, and returncode, or None if failed
+        """
+        payload = json.dumps({
+            "script": script,
+            "timeout": timeout,
+            "working_dir": working_dir
+        })
+
+        for _ in range(self.retry_times):
+            try:
+                response = requests.post(
+                    self.http_server + "/run_bash_script", 
+                    headers={'Content-Type': 'application/json'},
+                    data=payload, 
+                    timeout=timeout + 100  # Add buffer to HTTP timeout
+                )
+                if response.status_code == 200:
+                    result = response.json()
+                    logger.info("Bash script executed successfully with return code: %d", result.get("returncode", -1))
+                    return result
+                else:
+                    logger.error("Failed to execute bash script. Status code: %d, response: %s", 
+                                response.status_code, response.text)
+                    logger.info("Retrying to execute bash script.")
+            except requests.exceptions.ReadTimeout:
+                logger.error("Bash script execution timed out")
+                return {
+                    "status": "error",
+                    "output": "",
+                    "error": f"Script execution timed out after {timeout} seconds",
+                    "returncode": -1
+                }
+            except Exception as e:
+                logger.error("An error occurred while trying to execute the bash script: %s", e)
+                logger.info("Retrying to execute bash script.")
+            time.sleep(self.retry_interval)
+
+        logger.error("Failed to execute bash script after %d retries.", self.retry_times)
+        return {
+            "status": "error",
+            "output": "",
+            "error": f"Failed to execute bash script after {self.retry_times} retries",
+            "returncode": -1
+        }
+
+    def execute_action(self, action):
        """
        Executes an action on the server computer.
        """
+        # Handle string actions
        if action in ['WAIT', 'FAIL', 'DONE']:
            return
+        
+        # Handle dictionary actions
+        if type(action) == dict and action.get('action_type') in ['WAIT', 'FAIL', 'DONE']:
+            return

        action_type = action["action_type"]
        parameters = action["parameters"] if "parameters" in action else {param: action[param] for param in action if param != 'action_type'}
--- a/desktop_env/controllers/setup.py
+++ b/desktop_env/controllers/setup.py
@ -27,7 +27,7 @@ import dotenv
 # Load environment variables from .env file
 dotenv.load_dotenv()

-CLIENT_PASSWORD = os.getenv("CLIENT_PASSWORD", "osworld-public-evaluation")  # Default password for sudo operations
+
 PROXY_CONFIG_FILE = os.getenv("PROXY_CONFIG_FILE", "evaluation_examples/settings/proxy/dataimpulse.json")  # Default proxy config file

 logger = logging.getLogger("desktopenv.setup")
@ -39,7 +39,7 @@ init_proxy_pool(PROXY_CONFIG_FILE)  # initialize the global proxy pool
 MAX_RETRIES = 20

 class SetupController:
-    def __init__(self, vm_ip: str, server_port: int = 5000, chromium_port: int = 9222, vlc_port: int = 8080, cache_dir: str = "cache"):
+    def __init__(self, vm_ip: str, server_port: int = 5000, chromium_port: int = 9222, vlc_port: int = 8080, cache_dir: str = "cache", client_password: str = "", screen_width: int = 1920, screen_height: int = 1080):
        self.vm_ip: str = vm_ip
        self.server_port: int = server_port
        self.chromium_port: int = chromium_port
@ -48,6 +48,9 @@ class SetupController:
        self.http_server_setup_root: str = f"http://{vm_ip}:{server_port}/setup"
        self.cache_dir: str = cache_dir
        self.use_proxy: bool = False
+        self.client_password: str = client_password
+        self.screen_width: int = screen_width
+        self.screen_height: int = screen_height

    def reset_cache_dir(self, cache_dir: str):
        self.cache_dir = cache_dir
@ -196,26 +199,62 @@ class SetupController:
            path: str = f["path"]

            if not os.path.exists(local_path):
-                logger.error(f"Setup Upload - Invalid local path ({local_path}).")
-                return
+                raise Exception(f"Setup Upload - Invalid local path ({local_path}).")

-            form = MultipartEncoder({
-                "file_path": path,
-                "file_data": (os.path.basename(path), open(local_path, "rb"))
-            })
-            headers = {"Content-Type": form.content_type}
-            logger.debug(form.content_type)
-
-            # send request to server to upload file
+            file_size = None
            try:
-                logger.debug("REQUEST ADDRESS: %s", self.http_server + "/setup" + "/upload")
-                response = requests.post(self.http_server + "/setup" + "/upload", headers=headers, data=form)
-                if response.status_code == 200:
-                    logger.info("Command executed successfully: %s", response.text)
-                else:
-                    logger.error("Failed to upload file. Status code: %s", response.text)
-            except requests.exceptions.RequestException as e:
-                logger.error("An error occurred while trying to send the request: %s", e)
+                file_size = os.path.getsize(local_path)
+            except Exception:
+                pass
+
+            max_retries = 3
+            last_error: Optional[Exception] = None
+
+            for attempt in range(max_retries):
+                try:
+                    logger.info(
+                        f"Uploading {os.path.basename(local_path)}{f' ({file_size} bytes)' if file_size is not None else ''} "
+                        f"to VM at {path} (attempt {attempt + 1}/{max_retries})"
+                    )
+                    logger.debug("REQUEST ADDRESS: %s", self.http_server + "/setup" + "/upload")
+
+                    # Open the file inside each attempt to ensure fresh stream position
+                    with open(local_path, "rb") as fp:
+                        form = MultipartEncoder({
+                            "file_path": path,
+                            "file_data": (os.path.basename(path), fp)
+                        })
+                        headers = {"Content-Type": form.content_type}
+                        logger.debug(form.content_type)
+
+                        # Explicit connect/read timeout to avoid hanging forever
+                        response = requests.post(
+                            self.http_server + "/setup" + "/upload",
+                            headers=headers,
+                            data=form,
+                            timeout=(10, 600)
+                        )
+
+                        if response.status_code == 200:
+                            logger.info(f"File uploaded successfully: {path}")
+                            logger.debug("Upload response: %s", response.text)
+                            last_error = None
+                            break
+                        else:
+                            msg = f"Failed to upload file {path}. Status code: {response.status_code}, Response: {response.text}"
+                            logger.error(msg)
+                            last_error = requests.RequestException(msg)
+
+                except requests.exceptions.RequestException as e:
+                    last_error = e
+                    logger.error(f"Upload attempt {attempt + 1} failed for {path}: {e}")
+
+                # Exponential backoff between retries
+                if attempt < max_retries - 1:
+                    time.sleep(2 ** attempt)
+
+            if last_error is not None:
+                raise last_error

    def _change_wallpaper_setup(self, path: str):
        if not path:
@ -298,6 +337,31 @@ class SetupController:
        terminates: bool = False
        nb_failings = 0

+        def replace_screen_env_in_command(command):
+            password = self.client_password
+            width = self.screen_width
+            height = self.screen_height
+            width_half = str(width // 2)
+            height_half = str(height // 2)
+            new_command_list = []
+            new_command = ""
+            if isinstance(command, str):
+                new_command = command.replace("{CLIENT_PASSWORD}", password)
+                new_command = new_command.replace("{SCREEN_WIDTH_HALF}", width_half)
+                new_command = new_command.replace("{SCREEN_HEIGHT_HALF}", height_half)
+                new_command = new_command.replace("{SCREEN_WIDTH}", str(width))
+                new_command = new_command.replace("{SCREEN_HEIGHT}", str(height))
+                return new_command
+            else:
+                for item in command:
+                    item = item.replace("{CLIENT_PASSWORD}", password)
+                    item = item.replace("{SCREEN_WIDTH_HALF}", width_half)
+                    item = item.replace("{SCREEN_HEIGHT_HALF}", height_half)
+                    item = item.replace("{SCREEN_WIDTH}", str(width))
+                    item = item.replace("{SCREEN_HEIGHT}", str(height))
+                    new_command_list.append(item)
+                return new_command_list
+        command = replace_screen_env_in_command(command)
        payload = json.dumps({"command": command, "shell": shell})
        headers = {"Content-Type": "application/json"}

@ -445,7 +509,7 @@ class SetupController:
        except requests.exceptions.RequestException as e:
            logger.error("An error occurred while trying to send the request: %s", e)

-    def _proxy_setup(self, client_password: str = CLIENT_PASSWORD):
+    def _proxy_setup(self, client_password: str = ""):
        """Setup system-wide proxy configuration using proxy pool
        
        Args:
@ -749,106 +813,108 @@ class SetupController:

    def _update_browse_history_setup(self, **config):
        cache_path = os.path.join(self.cache_dir, "history_new.sqlite")
-        db_url = "https://drive.usercontent.google.com/u/0/uc?id=1Lv74QkJYDWVX0RIgg0Co-DUcoYpVL0oX&export=download" # google drive
+        db_url = "https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache/resolve/main/chrome/44ee5668-ecd5-4366-a6ce-c1c9b8d4e938/history_empty.sqlite?download=true"
        if not os.path.exists(cache_path):
-                max_retries = 3
-                downloaded = False
-                e = None
-                for i in range(max_retries):
-                    try:
-                        response = requests.get(db_url, stream=True)
-                        response.raise_for_status()
+            max_retries = 3
+            downloaded = False
+            e = None
+            for i in range(max_retries):
+                try:
+                    response = requests.get(db_url, stream=True)
+                    response.raise_for_status()

-                        with open(cache_path, 'wb') as f:
-                            for chunk in response.iter_content(chunk_size=8192):
-                                if chunk:
-                                    f.write(chunk)
-                        logger.info("File downloaded successfully")
-                        downloaded = True
-                        break
+                    with open(cache_path, 'wb') as f:
+                        for chunk in response.iter_content(chunk_size=8192):
+                            if chunk:
+                                f.write(chunk)
+                    logger.info("File downloaded successfully")
+                    downloaded = True
+                    break

-                    except requests.RequestException as e:
-                        logger.error(
-                            f"Failed to download {db_url} caused by {e}. Retrying... ({max_retries - i - 1} attempts left)")
-                if not downloaded:
-                    raise requests.RequestException(f"Failed to download {db_url}. No retries left. Error: {e}")
+                except requests.RequestException as e:
+                    logger.error(
+                        f"Failed to download {db_url} caused by {e}. Retrying... ({max_retries - i - 1} attempts left)")
+            if not downloaded:
+                raise requests.RequestException(f"Failed to download {db_url}. No retries left. Error: {e}")
        else:
            logger.info("File already exists in cache directory")
        # copy a new history file in the tmp folder
-        db_path = cache_path
+        with tempfile.TemporaryDirectory() as tmp_dir:
+            db_path = os.path.join(tmp_dir, "history_empty.sqlite")
+            shutil.copy(cache_path, db_path)

-        history = config['history']
+            history = config['history']

-        for history_item in history:
-            url = history_item['url']
-            title = history_item['title']
-            visit_time = datetime.now() - timedelta(seconds=history_item['visit_time_from_now_in_seconds'])
+            for history_item in history:
+                url = history_item['url']
+                title = history_item['title']
+                visit_time = datetime.now() - timedelta(seconds=history_item['visit_time_from_now_in_seconds'])

-            # Chrome use ms from 1601-01-01 as timestamp
-            epoch_start = datetime(1601, 1, 1)
-            chrome_timestamp = int((visit_time - epoch_start).total_seconds() * 1000000)
+                # Chrome use ms from 1601-01-01 as timestamp
+                epoch_start = datetime(1601, 1, 1)
+                chrome_timestamp = int((visit_time - epoch_start).total_seconds() * 1000000)

-            conn = sqlite3.connect(db_path)
-            cursor = conn.cursor()
+                conn = sqlite3.connect(db_path)
+                cursor = conn.cursor()

-            cursor.execute('''
-                   INSERT INTO urls (url, title, visit_count, typed_count, last_visit_time, hidden)
-                   VALUES (?, ?, ?, ?, ?, ?)
-               ''', (url, title, 1, 0, chrome_timestamp, 0))
+                cursor.execute('''
+                    INSERT INTO urls (url, title, visit_count, typed_count, last_visit_time, hidden)
+                    VALUES (?, ?, ?, ?, ?, ?)
+                ''', (url, title, 1, 0, chrome_timestamp, 0))

-            url_id = cursor.lastrowid
+                url_id = cursor.lastrowid

-            cursor.execute('''
-                   INSERT INTO visits (url, visit_time, from_visit, transition, segment_id, visit_duration)
-                   VALUES (?, ?, ?, ?, ?, ?)
-               ''', (url_id, chrome_timestamp, 0, 805306368, 0, 0))
+                cursor.execute('''
+                    INSERT INTO visits (url, visit_time, from_visit, transition, segment_id, visit_duration)
+                    VALUES (?, ?, ?, ?, ?, ?)
+                ''', (url_id, chrome_timestamp, 0, 805306368, 0, 0))

-            conn.commit()
-            conn.close()
+                conn.commit()
+                conn.close()

-        logger.info('Fake browsing history added successfully.')
+            logger.info('Fake browsing history added successfully.')

-        controller = PythonController(self.vm_ip, self.server_port)
+            controller = PythonController(self.vm_ip, self.server_port)

-        # get the path of the history file according to the platform
-        os_type = controller.get_vm_platform()
+            # get the path of the history file according to the platform
+            os_type = controller.get_vm_platform()

-        if os_type == 'Windows':
-            chrome_history_path = controller.execute_python_command(
-                """import os; print(os.path.join(os.getenv('USERPROFILE'), "AppData", "Local", "Google", "Chrome", "User Data", "Default", "History"))""")[
-                'output'].strip()
-        elif os_type == 'Darwin':
-            chrome_history_path = controller.execute_python_command(
-                """import os; print(os.path.join(os.getenv('HOME'), "Library", "Application Support", "Google", "Chrome", "Default", "History"))""")[
-                'output'].strip()
-        elif os_type == 'Linux':
-            if "arm" in platform.machine():
+            if os_type == 'Windows':
                chrome_history_path = controller.execute_python_command(
-                    "import os; print(os.path.join(os.getenv('HOME'), 'snap', 'chromium', 'common', 'chromium', 'Default', 'History'))")[
+                    """import os; print(os.path.join(os.getenv('USERPROFILE'), "AppData", "Local", "Google", "Chrome", "User Data", "Default", "History"))""")[
                    'output'].strip()
-            else:
+            elif os_type == 'Darwin':
                chrome_history_path = controller.execute_python_command(
-                    "import os; print(os.path.join(os.getenv('HOME'), '.config', 'google-chrome', 'Default', 'History'))")[
+                    """import os; print(os.path.join(os.getenv('HOME'), "Library", "Application Support", "Google", "Chrome", "Default", "History"))""")[
                    'output'].strip()
-        else:
-            raise Exception('Unsupported operating system')
-
-        form = MultipartEncoder({
-            "file_path": chrome_history_path,
-            "file_data": (os.path.basename(chrome_history_path), open(db_path, "rb"))
-        })
-        headers = {"Content-Type": form.content_type}
-        logger.debug(form.content_type)
-
-        # send request to server to upload file
-        try:
-            logger.debug("REQUEST ADDRESS: %s", self.http_server + "/setup" + "/upload")
-            response = requests.post(self.http_server + "/setup" + "/upload", headers=headers, data=form)
-            if response.status_code == 200:
-                logger.info("Command executed successfully: %s", response.text)
+            elif os_type == 'Linux':
+                if "arm" in platform.machine():
+                    chrome_history_path = controller.execute_python_command(
+                        "import os; print(os.path.join(os.getenv('HOME'), 'snap', 'chromium', 'common', 'chromium', 'Default', 'History'))")[
+                        'output'].strip()
+                else:
+                    chrome_history_path = controller.execute_python_command(
+                        "import os; print(os.path.join(os.getenv('HOME'), '.config', 'google-chrome', 'Default', 'History'))")[
+                        'output'].strip()
            else:
-                logger.error("Failed to upload file. Status code: %s", response.text)
-        except requests.exceptions.RequestException as e:
-            logger.error("An error occurred while trying to send the request: %s", e)
+                raise Exception('Unsupported operating system')

-        self._execute_setup(["sudo chown -R user:user /home/user/.config/google-chrome/Default/History"], shell=True)
+            form = MultipartEncoder({
+                "file_path": chrome_history_path,
+                "file_data": (os.path.basename(chrome_history_path), open(db_path, "rb"))
+            })
+            headers = {"Content-Type": form.content_type}
+            logger.debug(form.content_type)
+
+            # send request to server to upload file
+            try:
+                logger.debug("REQUEST ADDRESS: %s", self.http_server + "/setup" + "/upload")
+                response = requests.post(self.http_server + "/setup" + "/upload", headers=headers, data=form)
+                if response.status_code == 200:
+                    logger.info("Command executed successfully: %s", response.text)
+                else:
+                    logger.error("Failed to upload file. Status code: %s", response.text)
+            except requests.exceptions.RequestException as e:
+                logger.error("An error occurred while trying to send the request: %s", e)
+
+            self._execute_setup(["sudo chown -R user:user /home/user/.config/google-chrome/Default/History"], shell=True)
--- a/desktop_env/desktop_env.py
+++ b/desktop_env/desktop_env.py
@ -3,6 +3,7 @@ from __future__ import annotations
 import logging
 import os
 import time
+import re
 from typing import Callable, Any, Optional, Tuple
 from typing import List, Dict, Union

@ -19,6 +20,77 @@ Metric = Callable[[Any, Any], float]
 Getter = Callable[[gym.Env, Dict[str, Any]], Any]

 MAX_RETRIES = 5 # Maximum retries for environment setup
+            
+
+
+def _fix_pyautogui_less_than_bug(command: str) -> str:
+    """
+    Fix PyAutoGUI '<' character bug by converting it to hotkey("shift", ',') calls.
+    
+    This fixes the known PyAutoGUI issue where typing '<' produces '>' instead.
+    References:
+    - https://github.com/asweigart/pyautogui/issues/198
+    - https://github.com/xlang-ai/OSWorld/issues/257
+    
+    Args:
+        command (str): The original pyautogui command
+        
+    Returns:
+        str: The fixed command with '<' characters handled properly
+    """
+    # Pattern to match press('<') or press('\u003c') calls  
+    press_pattern = r'pyautogui\.press\(["\'](?:<|\\u003c)["\']\)'
+
+    # Handle press('<') calls
+    def replace_press_less_than(match):
+        return 'pyautogui.hotkey("shift", ",")'
+    
+    # First handle press('<') calls
+    command = re.sub(press_pattern, replace_press_less_than, command)
+
+    # Pattern to match typewrite calls with quoted strings
+    typewrite_pattern = r'pyautogui\.typewrite\((["\'])(.*?)\1\)'
+    
+    # Then handle typewrite calls
+    def process_typewrite_match(match):
+        quote_char = match.group(1)
+        content = match.group(2)
+        
+        # Preprocess: Try to decode Unicode escapes like \u003c to actual '<'
+        # This handles cases where '<' is represented as escaped Unicode
+        try:
+            # Attempt to decode unicode escapes
+            decoded_content = content.encode('utf-8').decode('unicode_escape')
+            content = decoded_content
+        except UnicodeDecodeError:
+            # If decoding fails, proceed with original content to avoid breaking existing logic
+            pass  # English comment: Graceful degradation - fall back to original content if decoding fails
+        
+        # Check if content contains '<'
+        if '<' not in content:
+            return match.group(0)
+        
+        # Split by '<' and rebuild
+        parts = content.split('<')
+        result_parts = []
+        
+        for i, part in enumerate(parts):
+            if i == 0:
+                # First part
+                if part:
+                    result_parts.append(f"pyautogui.typewrite({quote_char}{part}{quote_char})")
+            else:
+                # Add hotkey for '<' and then typewrite for the rest
+                result_parts.append('pyautogui.hotkey("shift", ",")')
+                if part:
+                    result_parts.append(f"pyautogui.typewrite({quote_char}{part}{quote_char})")
+        
+        return '; '.join(result_parts)
+    
+    command = re.sub(typewrite_pattern, process_typewrite_match, command)
+    
+    return command
+

 class DesktopEnv(gym.Env):
    """
@ -30,14 +102,15 @@ class DesktopEnv(gym.Env):
            region: str = None,
            path_to_vm: str = None,
            snapshot_name: str = "init_state",
-            action_space: str = "computer_13",
+            action_space: str = "pyautogui",
            cache_dir: str = "cache",
-            screen_size: Tuple[int] = (1920, 1080),
+            screen_size: Tuple[int] = (int(os.environ.get("SCREEN_WIDTH", 1920)), int(os.environ.get("SCREEN_HEIGHT", 1080))),
            headless: bool = False,
            require_a11y_tree: bool = True,
            require_terminal: bool = False,
            os_type: str = "Ubuntu",
            enable_proxy: bool = False,
+            client_password: str = "",
    ):
        """
        Args:
@ -59,6 +132,16 @@ class DesktopEnv(gym.Env):
        self.region = region
        self.provider_name = provider_name
        self.enable_proxy = enable_proxy  # Store proxy enablement setting
+        if client_password == "":
+            if self.provider_name == "aws":
+                self.client_password = "osworld-public-evaluation"
+            else:
+                self.client_password = "password"
+        else:
+            self.client_password = client_password
+
+        self.screen_width = screen_size[0]
+        self.screen_height = screen_size[1]

        # Default 
        self.server_port = 5000
@ -75,7 +158,7 @@ class DesktopEnv(gym.Env):
        # Track whether environment has been used (step/setup) to optimize snapshot revert
        # docker, aws, gcp, azure are always unused as the emulator starts from a clean state
        # vmware, virtualbox are always used as the emulator starts from a dirty state
-        if self.provider_name in {"docker", "aws", "gcp", "azure"}:
+        if self.provider_name in {"docker", "aws", "gcp", "azure", "aliyun", "volcengine"}:
            self.is_environment_used = False
        elif self.provider_name in {"vmware", "virtualbox"}:
            self.is_environment_used = True
@ -87,56 +170,54 @@ class DesktopEnv(gym.Env):
            self.path_to_vm = os.path.abspath(os.path.expandvars(os.path.expanduser(path_to_vm))) \
                if provider_name in {"vmware", "virtualbox"} else path_to_vm
        else:
+            self.path_to_vm = self.manager.get_vm_path(os_type=self.os_type, region=region, screen_size=(self.screen_width, self.screen_height))
+        
+        self.snapshot_name = snapshot_name
+        self.cache_dir_base: str = cache_dir
+        # todo: add the logic to get the screen size from the VM
+        self.headless = headless
+        self.require_a11y_tree = require_a11y_tree
+        self.require_terminal = require_terminal

-            self.path_to_vm = self.manager.get_vm_path(os_type=self.os_type, region=region)
-        try:
-            self.snapshot_name = snapshot_name
-            self.cache_dir_base: str = cache_dir
-            # todo: add the logic to get the screen size from the VM
-            self.headless = headless
-            self.require_a11y_tree = require_a11y_tree
-            self.require_terminal = require_terminal
+        # Initialize emulator and controller
+        logger.info("Initializing...")
+        self._start_emulator()

-            # Initialize emulator and controller
-            if provider_name != "docker": # Check if this is applicable to other VM providers
-                logger.info("Initializing...")
-                self._start_emulator()
+        # mode: human or machine
+        self.instruction = None
+        assert action_space in ["computer_13", "pyautogui", "claude_computer_use", "autoglm_computer_use"]
+        self.action_space = action_space  # todo: refactor it to the ActType

-            # mode: human or machine
-            self.instruction = None
-            assert action_space in ["computer_13", "pyautogui"]
-            self.action_space = action_space  # todo: refactor it to the ActType
+        # episodic stuffs, like counters, will be updated or reset
+        # when calling self.reset()
+        self._traj_no: int = -1
+        self._step_no: int = 0
+        self.action_history: List[Dict[str, any]] = []

-            # episodic stuffs, like counters, will be updated or reset
-            # when calling self.reset()
-            self._traj_no: int = -1
-            self._step_no: int = 0
-            self.action_history: List[Dict[str, any]] = []
-        except Exception as e:
-            logger.error(f"Failed to initialize DesktopEnv: {e}")
-            # If initialization fails, we should clean up the VM
-            try:
-                self.close()
-                self.manager.delete_vm(self.path_to_vm, self.region)
-                logger.info(f"Cleaned up VM {self.path_to_vm}.")
-            except Exception as cleanup_error:
-                logger.error(f"Failed to clean up VM {self.path_to_vm}: {cleanup_error}")
-            raise

    def _start_emulator(self):
-        # Power on the virtual machine
-        self.provider.start_emulator(self.path_to_vm, self.headless, self.os_type)
+        try:
+            # Power on the virtual machine
+            self.provider.start_emulator(self.path_to_vm, self.headless, self.os_type)

-        # Get the ip from the virtual machine, and setup the controller
-        vm_ip_ports = self.provider.get_ip_address(self.path_to_vm).split(':')
-        self.vm_ip = vm_ip_ports[0]
-        if len(vm_ip_ports) > 1:
-            self.server_port = int(vm_ip_ports[1])
-            self.chromium_port = int(vm_ip_ports[2])
-            self.vnc_port = int(vm_ip_ports[3])
-            self.vlc_port = int(vm_ip_ports[4])
-        self.controller = PythonController(vm_ip=self.vm_ip, server_port=self.server_port)
-        self.setup_controller = SetupController(vm_ip=self.vm_ip, server_port=self.server_port, chromium_port=self.chromium_port, vlc_port=self.vlc_port, cache_dir=self.cache_dir_base)
+            # Get the ip from the virtual machine, and setup the controller
+            vm_ip_ports = self.provider.get_ip_address(self.path_to_vm).split(':')
+            self.vm_ip = vm_ip_ports[0]
+            # Get the ports from the virtual machine (for Docker provider only)
+            if len(vm_ip_ports) > 1:
+                self.server_port = int(vm_ip_ports[1])
+                self.chromium_port = int(vm_ip_ports[2])
+                self.vnc_port = int(vm_ip_ports[3])
+                self.vlc_port = int(vm_ip_ports[4])
+            self.controller = PythonController(vm_ip=self.vm_ip, server_port=self.server_port)
+            self.setup_controller = SetupController(vm_ip=self.vm_ip, server_port=self.server_port, chromium_port=self.chromium_port, vlc_port=self.vlc_port, cache_dir=self.cache_dir_base, client_password=self.client_password, screen_width=self.screen_width, screen_height=self.screen_height)
+
+        except Exception as e:
+            try:
+                self.provider.stop_emulator(self.path_to_vm)
+            except Exception as stop_err:
+                logger.warning(f"Cleanup after interrupt failed: {stop_err}")
+            raise

    def _revert_to_snapshot(self):
        # Revert to certain snapshot of the virtual machine, and refresh the path to vm and ip of vm
@ -169,7 +250,10 @@ class DesktopEnv(gym.Env):
        self.action_history.clear()

        for attempt in range(MAX_RETRIES):
-            # Check and handle proxy requirement changes BEFORE starting emulator
+            # Only revert to snapshot if environment has been used (step/setup)
+            # This optimization is especially important for cloud providers like AWS
+            # where unnecessary snapshot operations are costly and time-consuming
+            
            if task_config is not None:
                # Only consider task proxy requirement if proxy is enabled at system level
                task_use_proxy = task_config.get("proxy", False) and self.enable_proxy
@ -179,10 +263,7 @@ class DesktopEnv(gym.Env):
                if task_use_proxy != self.current_use_proxy:
                    # keep because get_info_from_website depend on this
                    self.current_use_proxy = task_use_proxy
-        
-            # Only revert to snapshot if environment has been used (step/setup)
-            # This optimization is especially important for cloud providers like AWS
-            # where unnecessary snapshot operations are costly and time-consuming
+            
            if self.is_environment_used:
                logger.info("Environment has been used, reverting to snapshot {}...".format(self.snapshot_name))
                self._revert_to_snapshot()
@ -197,7 +278,7 @@ class DesktopEnv(gym.Env):
            if task_config is not None:
                if task_config.get("proxy", False) and self.enable_proxy:
                    # If using proxy and proxy is enabled, set up the proxy configuration
-                    self.setup_controller._proxy_setup()
+                    self.setup_controller._proxy_setup(self.client_password)
                self._set_task_info(task_config)
                self.setup_controller.reset_cache_dir(self.cache_dir)
                logger.info("Setting up environment...")
@ -307,27 +388,34 @@ class DesktopEnv(gym.Env):
        reward = 0  # todo: Define reward calculation for each example
        done = False  # todo: Define episode termination condition for each example
        info = {}
-
+        logger.info(f"Step {self._step_no} in trajectory {self._traj_no} with action: {action}")
        # handle the special actions
        if action in ['WAIT', 'FAIL', 'DONE'] or (type(action) == dict and action['action_type'] in ['WAIT', 'FAIL', 'DONE']):
-            if action == 'WAIT':
+            if action == 'WAIT' or (type(action) == dict and action.get('action_type') == 'WAIT'):
                time.sleep(pause)
-            elif action == 'FAIL':
+            elif action == 'FAIL' or (type(action) == dict and action.get('action_type') == 'FAIL'):
                done = True
                info = {"fail": True}
-            elif action == 'DONE':
+            elif action == 'DONE' or (type(action) == dict and action.get('action_type') == 'DONE'):
                done = True
                info = {"done": True}

        if self.action_space == "computer_13":
            # the set of all possible actions defined in the action representation
            self.controller.execute_action(action)
-        elif self.action_space == "pyautogui":
-            if action in ['WAIT', 'FAIL', 'DONE']:
+        elif self.action_space == "pyautogui" or self.action_space == "claude_computer_use":
+            if action in ['WAIT', 'FAIL', 'DONE'] or (type(action) == dict and action.get('action_type') in ['WAIT', 'FAIL', 'DONE']):
                self.controller.execute_action(action)
            else:
                # the set of all possible python commands insides `pyautogui`
-                self.controller.execute_python_command(action)
+                if type(action) == str:
+                    # Fix PyAutoGUI '<' character bug before execution
+                    fixed_command = _fix_pyautogui_less_than_bug(action)
+                    self.controller.execute_python_command(fixed_command)
+                elif type(action) == dict:
+                    # Fix PyAutoGUI '<' character bug before execution
+                    fixed_command = _fix_pyautogui_less_than_bug(action['command'])
+                    self.controller.execute_python_command(fixed_command)

        time.sleep(pause)
        observation = self._get_obs()
@ -340,19 +428,22 @@ class DesktopEnv(gym.Env):
        """

        postconfig = self.evaluator.get("postconfig", [])
-        self.setup_controller.setup(postconfig)
+        self.setup_controller.setup(postconfig, self.enable_proxy)
        # Mark environment as used if there were postconfig setup operations
        if postconfig:
            self.is_environment_used = True

        if self.evaluator['func'] == "infeasible":
-            if len(self.action_history) > 0 and self.action_history[-1] == "FAIL":
-                return 1
-            else:
-                return 0
+            if len(self.action_history) > 0:
+                last_action = self.action_history[-1]
+                if last_action == "FAIL" or (type(last_action) == dict and last_action.get('action_type') == 'FAIL'):
+                    return 1
+            return 0
        else:
-            if len(self.action_history) > 0 and self.action_history[-1] == "FAIL":
-                return 0
+            if len(self.action_history) > 0:
+                last_action = self.action_history[-1]
+                if last_action == "FAIL" or (type(last_action) == dict and last_action.get('action_type') == 'FAIL'):
+                    return 0

        if type(self.metric) == list:
            # Multiple metrics to evaluate whether the task is successfully completed
--- a/desktop_env/desktop_env_os_symphony.py
+++ b/desktop_env/desktop_env_os_symphony.py
@ -0,0 +1,499 @@
+from __future__ import annotations
+
+import logging
+import os
+import time
+import re
+from typing import Callable, Any, Optional, Tuple
+from typing import List, Dict, Union
+
+import gymnasium as gym
+
+from desktop_env.controllers.python import PythonController
+from desktop_env.controllers.setup import SetupController
+from desktop_env.evaluators import metrics, getters
+from desktop_env.providers import create_vm_manager_and_provider
+
+logger = logging.getLogger("desktopenv.env")
+
+Metric = Callable[[Any, Any], float]
+Getter = Callable[[gym.Env, Dict[str, Any]], Any]
+
+MAX_RETRIES = 5 # Maximum retries for environment setup
+            
+
+
+def _fix_pyautogui_less_than_bug(command: str) -> str:
+    """
+    Fix PyAutoGUI '<' character bug by converting it to hotkey("shift", ',') calls.
+    
+    This fixes the known PyAutoGUI issue where typing '<' produces '>' instead.
+    References:
+    - https://github.com/asweigart/pyautogui/issues/198
+    - https://github.com/xlang-ai/OSWorld/issues/257
+    
+    Args:
+        command (str): The original pyautogui command
+        
+    Returns:
+        str: The fixed command with '<' characters handled properly
+    """
+    # Pattern to match press('<') or press('\u003c') calls  
+    press_pattern = r'pyautogui\.press\(["\'](?:<|\\u003c)["\']\)'
+
+    # Handle press('<') calls
+    def replace_press_less_than(match):
+        return 'pyautogui.hotkey("shift", ",")'
+    
+    # First handle press('<') calls
+    command = re.sub(press_pattern, replace_press_less_than, command)
+
+    # Pattern to match typewrite calls with quoted strings
+    typewrite_pattern = r'pyautogui\.typewrite\((["\'])(.*?)\1\)'
+    
+    # Then handle typewrite calls
+    def process_typewrite_match(match):
+        quote_char = match.group(1)
+        content = match.group(2)
+        
+        # Preprocess: Try to decode Unicode escapes like \u003c to actual '<'
+        # This handles cases where '<' is represented as escaped Unicode
+        try:
+            # Attempt to decode unicode escapes
+            decoded_content = content.encode('utf-8').decode('unicode_escape')
+            content = decoded_content
+        except UnicodeDecodeError:
+            # If decoding fails, proceed with original content to avoid breaking existing logic
+            pass  # English comment: Graceful degradation - fall back to original content if decoding fails
+        
+        # Check if content contains '<'
+        if '<' not in content:
+            return match.group(0)
+        
+        # Split by '<' and rebuild
+        parts = content.split('<')
+        result_parts = []
+        
+        for i, part in enumerate(parts):
+            if i == 0:
+                # First part
+                if part:
+                    result_parts.append(f"pyautogui.typewrite({quote_char}{part}{quote_char})")
+            else:
+                # Add hotkey for '<' and then typewrite for the rest
+                result_parts.append('pyautogui.hotkey("shift", ",")')
+                if part:
+                    result_parts.append(f"pyautogui.typewrite({quote_char}{part}{quote_char})")
+        
+        return '; '.join(result_parts)
+    
+    command = re.sub(typewrite_pattern, process_typewrite_match, command)
+    
+    return command
+
+
+class DesktopEnv(gym.Env):
+    """
+    DesktopEnv with OpenAI Gym interface. It provides a desktop environment for setting and evaluating desktop automation tasks.
+    """
+    def __init__(
+            self,
+            provider_name: str = "vmware",
+            region: str = None,
+            path_to_vm: str = None,
+            snapshot_name: str = "init_state",
+            action_space: str = "pyautogui",
+            cache_dir: str = "cache",
+            screen_size: Tuple[int] = (int(os.environ.get("SCREEN_WIDTH", 1920)), int(os.environ.get("SCREEN_HEIGHT", 1080))),
+            headless: bool = False,
+            require_a11y_tree: bool = True,
+            require_terminal: bool = False,
+            os_type: str = "Ubuntu",
+            enable_proxy: bool = False,
+            client_password: str = "",
+    ):
+        """
+        Args:
+            provider_name (str): virtualization provider name, default to "vmware"
+            region (str): the region for allocate machines, work for cloud services, default to  "us-east-1"
+            path_to_vm (str): path to .vmx file
+            snapshot_name (str): snapshot name to revert to, default to "init_state"
+            action_space (str): "computer_13" | "pyautogui"
+            cache_dir (str): cache directory to cache task-related stuffs like
+              reference file for evaluation
+            screen_size (Tuple[int]): screen size of the VM
+            headless (bool): whether to run the VM in headless mode
+            require_a11y_tree (bool): whether to require accessibility tree
+            require_terminal (bool): whether to require terminal output
+            os_type (str): operating system type, default to "Ubuntu"
+            enable_proxy (bool): whether to enable proxy support, default to False
+        """
+        # Initialize VM manager and vitualization provider
+        self.region = region
+        self.provider_name = provider_name
+        self.enable_proxy = enable_proxy  # Store proxy enablement setting
+        if client_password == "":
+            if self.provider_name == "aws":
+                self.client_password = "osworld-public-evaluation"
+            else:
+                self.client_password = "password"
+        else:
+            self.client_password = client_password
+
+        self.screen_width = screen_size[0]
+        self.screen_height = screen_size[1]
+
+        # Default 
+        self.server_port = 5000
+        self.chromium_port = 9222
+        self.vnc_port = 8006
+        self.vlc_port = 8080
+        
+        # Initialize with default (no proxy) provider
+        self.current_use_proxy = False
+        self.manager, self.provider = None, None
+        self.os_type = os_type
+        self.path_to_vm = path_to_vm
+        # Track whether environment has been used (step/setup) to optimize snapshot revert
+        # docker, aws, gcp, azure are always unused as the emulator starts from a clean state
+        # vmware, virtualbox are always used as the emulator starts from a dirty state
+        if self.provider_name in {"docker", "aws", "gcp", "azure", "aliyun", "volcengine"}:
+            self.is_environment_used = False
+        elif self.provider_name in {"vmware", "virtualbox"}:
+            self.is_environment_used = True
+        else:
+            raise ValueError(f"Invalid provider name: {self.provider_name}")
+
+        self.snapshot_name = snapshot_name
+        self.cache_dir_base: str = cache_dir
+        self.headless = headless
+        self.require_a11y_tree = require_a11y_tree
+        self.require_terminal = require_terminal
+
+        # mode: human or machine
+        self.instruction = None
+        assert action_space in ["computer_13", "pyautogui", "claude_computer_use", "autoglm_computer_use"]
+        self.action_space = action_space  # todo: refactor it to the ActType
+
+        # episodic stuffs, like counters, will be updated or reset
+        # when calling self.reset()
+        self._traj_no: int = -1
+        self._step_no: int = 0
+        self.action_history: List[Dict[str, any]] = []
+
+    def start(self):
+        # Initialize emulator and controller
+        if not self.manager and not self.provider:
+            logger.info("Initializing...")
+            self.manager, self.provider = create_vm_manager_and_provider(self.provider_name, self.region, use_proxy=False)
+
+            if self.path_to_vm:
+                self.path_to_vm = os.path.abspath(os.path.expandvars(os.path.expanduser(self.path_to_vm))) \
+                    if self.provider_name in {"vmware", "virtualbox"} else self.path_to_vm
+            else:
+                self.path_to_vm = self.manager.get_vm_path(os_type=self.os_type, region=self.region, screen_size=(self.screen_width, self.screen_height))
+
+            self._start_emulator()
+
+    def _start_emulator(self):
+        try:
+            # Power on the virtual machine
+            self.provider.start_emulator(self.path_to_vm, self.headless, self.os_type)
+
+            # Get the ip from the virtual machine, and setup the controller
+            vm_ip_ports = self.provider.get_ip_address(self.path_to_vm).split(':')
+            self.vm_ip = vm_ip_ports[0]
+            # Get the ports from the virtual machine (for Docker provider only)
+            if len(vm_ip_ports) > 1:
+                self.server_port = int(vm_ip_ports[1])
+                self.chromium_port = int(vm_ip_ports[2])
+                self.vnc_port = int(vm_ip_ports[3])
+                self.vlc_port = int(vm_ip_ports[4])
+            self.controller = PythonController(vm_ip=self.vm_ip, server_port=self.server_port)
+            self.setup_controller = SetupController(vm_ip=self.vm_ip, server_port=self.server_port, chromium_port=self.chromium_port, vlc_port=self.vlc_port, cache_dir=self.cache_dir_base, client_password=self.client_password, screen_width=self.screen_width, screen_height=self.screen_height)
+
+        except Exception as e:
+            try:
+                self.provider.stop_emulator(self.path_to_vm)
+            except Exception as stop_err:
+                logger.warning(f"Cleanup after interrupt failed: {stop_err}")
+            raise
+
+    def _revert_to_snapshot(self):
+        # Revert to certain snapshot of the virtual machine, and refresh the path to vm and ip of vm
+        # due to the fact it could be changed when implemented by cloud services
+        path_to_vm = self.provider.revert_to_snapshot(self.path_to_vm, self.snapshot_name)
+        if path_to_vm and not path_to_vm == self.path_to_vm:
+            # path_to_vm has to be a new path 
+            
+            self.manager.delete_vm(self.path_to_vm, self.region)
+            self.manager.add_vm(path_to_vm, self.region)
+            self.manager.occupy_vm(path_to_vm, os.getpid(), self.region)
+            self.path_to_vm = path_to_vm
+
+    def _save_state(self, snapshot_name=None):
+        # Save the current virtual machine state to a certain snapshot name
+        self.provider.save_state(self.path_to_vm, snapshot_name)
+
+    def close(self):
+        # Close (release) the virtual machine
+        self.provider.stop_emulator(self.path_to_vm)
+
+    def reset(self, task_config: Optional[Dict[str, Any]] = None, seed=None, options=None) -> Dict[str, Any]:
+        
+        # Reset to certain task in OSWorld
+        logger.info("Resetting environment...")
+        logger.info("Switching task...")
+        logger.info("Setting counters...")
+        self._traj_no += 1
+        self._step_no = 0
+        self.action_history.clear()
+
+        for attempt in range(MAX_RETRIES):
+            # Only revert to snapshot if environment has been used (step/setup)
+            # This optimization is especially important for cloud providers like AWS
+            # where unnecessary snapshot operations are costly and time-consuming
+            
+            if task_config is not None:
+                # Only consider task proxy requirement if proxy is enabled at system level
+                task_use_proxy = task_config.get("proxy", False) and self.enable_proxy
+                if not self.enable_proxy and task_config.get("proxy", False):
+                    logger.info("Task requires proxy but proxy is disabled at system level, ignoring proxy requirement.")
+                
+                if task_use_proxy != self.current_use_proxy:
+                    # keep because get_info_from_website depend on this
+                    self.current_use_proxy = task_use_proxy
+            
+            if self.is_environment_used:
+                logger.info("Environment has been used, reverting to snapshot {}...".format(self.snapshot_name))
+                self._revert_to_snapshot()
+                logger.info("Starting emulator...")
+                self._start_emulator()
+                logger.info("Emulator started.")
+                # Reset the usage flag after reverting
+                self.is_environment_used = False
+            else:
+                logger.info("Environment is clean, skipping snapshot revert (provider: {}).".format(self.provider_name))
+
+            if task_config is not None:
+                if task_config.get("proxy", False) and self.enable_proxy:
+                    # If using proxy and proxy is enabled, set up the proxy configuration
+                    self.setup_controller._proxy_setup(self.client_password)
+                self._set_task_info(task_config)
+                self.setup_controller.reset_cache_dir(self.cache_dir)
+                logger.info("Setting up environment...")
+                success = self.setup_controller.setup(self.config, task_config.get("proxy", False) and self.enable_proxy)
+                if success:
+                    # Mark environment as used when setup is successfully executed
+                    if self.config:  # Only mark as used if there were actual setup operations
+                        self.is_environment_used = True
+                    break
+                else:
+                    logger.error(
+                        "Environment setup failed, retrying (%d/%d)...",
+                        attempt + 1,
+                        MAX_RETRIES,
+                    )
+                    time.sleep(5)
+            else:
+                break
+            
+        logger.info("Environment setup complete.")
+
+        observation = self._get_obs()
+        return observation
+
+    def _get_obs(self):
+        # We provide screenshot, accessibility_tree (optional), terminal (optional), and instruction.
+        # can be customized and scaled
+        return {
+            "screenshot": self.controller.get_screenshot(),
+            "accessibility_tree": self.controller.get_accessibility_tree() if self.require_a11y_tree else None,
+            "terminal": self.controller.get_terminal_output() if self.require_terminal else None,
+            "instruction": self.instruction
+        }
+
+    @property
+    def vm_platform(self):
+        return self.controller.get_vm_platform()
+
+    @property
+    def vm_screen_size(self):
+        return self.controller.get_vm_screen_size()
+
+    def _set_task_info(self, task_config: Dict[str, Any]):
+        """Set task info (proxy logic is handled in reset method)"""
+        self.task_id: str = task_config["id"]
+        self.cache_dir: str = os.path.join(self.cache_dir_base, self.task_id)
+        os.makedirs(self.cache_dir, exist_ok=True)
+        self.instruction = task_config["instruction"]
+        self.config = task_config["config"] if "config" in task_config else []
+        
+        self._set_evaluator_info(task_config)
+
+    def _set_evaluator_info(self, task_config: Dict[str, Any]):
+        """Set evaluator information from task config"""
+        if "evaluator" not in task_config:
+            return
+        # evaluator dict
+        # func -> metric function string, or list of metric function strings
+        # conj -> conjunction of multiple metrics if func is a list with length > 1, "and"/"or"
+        # result -> result getter config, or list of result getter configs
+        # expected (optional) -> expected getter config, or list of expected getter configs
+        # options (optional) -> metric options, or list of metric options
+        # if func is a str list, then result, expected (if exists), options (if exists) should also be lists of the same length
+        # even if one of the metrics does not need expected or options field, it should be included in the list with None
+        self.evaluator = task_config["evaluator"]
+        self.metric: Metric = [getattr(metrics, func) for func in self.evaluator["func"]] \
+            if isinstance(self.evaluator["func"], list) \
+            else getattr(metrics, self.evaluator["func"])
+        self.metric_conj: str = self.evaluator.get("conj", "and")  # take conjunction of multiple metrics
+        if "result" in self.evaluator and len(self.evaluator["result"]) > 0:
+            self.result_getter: Getter = [getattr(getters, "get_{:}".format(res["type"])) for res in
+                                          self.evaluator["result"]] \
+                if isinstance(self.evaluator["result"], list) \
+                else getattr(getters, "get_{:}".format(self.evaluator["result"]["type"]))
+        else:
+            self.result_getter = [None] * len(self.metric) \
+                if isinstance(self.metric, list) \
+                else None
+
+        if "expected" in self.evaluator and len(self.evaluator["expected"]) > 0:
+            self.expected_getter: Getter = [getattr(getters, "get_{:}".format(exp["type"])) if exp else None for exp in
+                                            self.evaluator["expected"]] \
+                if isinstance(self.evaluator["expected"], list) \
+                else getattr(getters, "get_{:}".format(self.evaluator["expected"]["type"]))
+        else:
+            self.expected_getter = [None] * len(self.metric) \
+                if isinstance(self.metric, list) \
+                else None
+        self.metric_options: Union[List[Dict[str, Any]], Dict[str, Any]] = [opt if opt else {} for opt in
+                                                                            self.evaluator["options"]] \
+            if isinstance(self.evaluator.get("options", {}), list) \
+            else self.evaluator["options"] \
+            if "options" in self.evaluator \
+            else [{}] * len(self.metric) \
+            if isinstance(self.metric, list) \
+            else {}
+
+        assert (not isinstance(self.evaluator["func"], list)
+                or (len(self.metric) == len(self.result_getter) == len(self.expected_getter) == len(
+                    self.metric_options)))
+
+    def step(self, action, pause=2):
+        self._step_no += 1
+        self.action_history.append(action)
+        
+        # Mark environment as used when step is called
+        self.is_environment_used = True
+
+        reward = 0  # todo: Define reward calculation for each example
+        done = False  # todo: Define episode termination condition for each example
+        info = {}
+        logger.info(f"Step {self._step_no} in trajectory {self._traj_no} with action: {action}")
+        # handle the special actions
+        if action in ['WAIT', 'FAIL', 'DONE'] or (type(action) == dict and action['action_type'] in ['WAIT', 'FAIL', 'DONE']):
+            if action == 'WAIT' or (type(action) == dict and action.get('action_type') == 'WAIT'):
+                time.sleep(pause)
+            elif action == 'FAIL' or (type(action) == dict and action.get('action_type') == 'FAIL'):
+                done = True
+                info = {"fail": True}
+            elif action == 'DONE' or (type(action) == dict and action.get('action_type') == 'DONE'):
+                done = True
+                info = {"done": True}
+
+        if self.action_space == "computer_13":
+            # the set of all possible actions defined in the action representation
+            self.controller.execute_action(action)
+        elif self.action_space == "pyautogui" or self.action_space == "claude_computer_use":
+            if action in ['WAIT', 'FAIL', 'DONE'] or (type(action) == dict and action.get('action_type') in ['WAIT', 'FAIL', 'DONE']):
+                self.controller.execute_action(action)
+            else:
+                # the set of all possible python commands insides `pyautogui`
+                if type(action) == str:
+                    # Fix PyAutoGUI '<' character bug before execution
+                    fixed_command = _fix_pyautogui_less_than_bug(action)
+                    self.controller.execute_python_command(fixed_command)
+                elif type(action) == dict:
+                    # Fix PyAutoGUI '<' character bug before execution
+                    fixed_command = _fix_pyautogui_less_than_bug(action['command'])
+                    self.controller.execute_python_command(fixed_command)
+
+        time.sleep(pause)
+        observation = self._get_obs()
+
+        return observation, reward, done, info
+
+    def evaluate(self):
+        """
+        Evaluate whether the task is successfully completed.
+        """
+
+        postconfig = self.evaluator.get("postconfig", [])
+        self.setup_controller.setup(postconfig, self.enable_proxy)
+        # Mark environment as used if there were postconfig setup operations
+        if postconfig:
+            self.is_environment_used = True
+
+        if self.evaluator['func'] == "infeasible":
+            if len(self.action_history) > 0:
+                last_action = self.action_history[-1]
+                if last_action == "FAIL" or (type(last_action) == dict and last_action.get('action_type') == 'FAIL'):
+                    return 1
+            return 0
+        else:
+            if len(self.action_history) > 0:
+                last_action = self.action_history[-1]
+                if last_action == "FAIL" or (type(last_action) == dict and last_action.get('action_type') == 'FAIL'):
+                    return 0
+
+        if type(self.metric) == list:
+            # Multiple metrics to evaluate whether the task is successfully completed
+            results = []
+            assert len(self.metric) == len(self.result_getter), "The number of metrics and result getters must be the same"
+            if "expected" in self.evaluator:
+                assert len(self.metric) == len(self.expected_getter), "The number of metrics and expected getters must be the same"
+            for idx, metric in enumerate(self.metric):
+                try:
+                    config = self.evaluator["result"][idx]
+                    result_state = self.result_getter[idx](self, config)
+                except FileNotFoundError:
+                    logger.error("File not found!")
+                    if self.metric_conj == 'and':
+                        return 0
+
+                if "expected" in self.evaluator and self.expected_getter and self.evaluator["expected"]:
+                    expected_state = self.expected_getter[idx](self, self.evaluator["expected"][idx])
+                    metric: int = metric(result_state, expected_state, **self.metric_options[idx])
+                else:
+                    metric: int = metric(result_state, **self.metric_options[idx])
+
+                if self.metric_conj == 'and' and float(metric) == 0.0:
+                    return 0
+                elif self.metric_conj == 'or' and float(metric) == 1.0:
+                    return 1
+                else:
+                    results.append(metric)
+
+            return sum(results) / len(results) if self.metric_conj == 'and' else max(results)
+        else:
+            # Single metric to evaluate whether the task is successfully completed
+            try:
+                result_state = self.result_getter(self, self.evaluator["result"])
+            except FileNotFoundError:
+                logger.error("File not found!")
+                return 0
+
+            if "expected" in self.evaluator and self.expected_getter and self.evaluator["expected"]:
+                expected_state = self.expected_getter(self, self.evaluator["expected"])
+                metric: float = self.metric(result_state, expected_state, **self.metric_options)
+            else:
+                metric: float = self.metric(result_state, **self.metric_options)
+
+        return metric
+
+    def render(self, mode='rgb_array'):
+        if mode == 'rgb_array':
+            return self.controller.get_screenshot()
+        else:
+            raise ValueError('Unsupported render mode: {}'.format(mode))
--- a/desktop_env/evaluators/getters/init.py
+++ b/desktop_env/evaluators/getters/init.py
@ -16,6 +16,7 @@ from .chrome import (
    get_active_tab_info,
    get_enable_do_not_track,
    get_enable_enhanced_safety_browsing,
+    get_enable_safe_browsing,
    get_new_startup_page,
    get_find_unpacked_extension_path,
    get_data_delete_automacally,
--- a/desktop_env/evaluators/getters/chrome.py
+++ b/desktop_env/evaluators/getters/chrome.py
--- a/desktop_env/evaluators/getters/file.py
+++ b/desktop_env/evaluators/getters/file.py
@ -1,10 +1,13 @@
 import os
+import logging
 from typing import Dict, List, Set
 from typing import Optional, Any, Union
 from datetime import datetime
 import requests
 import pandas as pd

+logger = logging.getLogger("desktopenv.getter.file")
+

 def get_content_from_vm_file(env, config: Dict[str, Any]) -> Any:
    """
@ -101,16 +104,42 @@ def get_vm_file(env, config: Dict[str, Any]) -> Union[Optional[str], List[Option

    for i, (p, d) in enumerate(zip(paths, dests)):
        _path = os.path.join(env.cache_dir, d)
-        file = env.controller.get_file(p)
-        if file is None:
+        
+        try:
+            # Try to get file from VM
+            file = env.controller.get_file(p)
+            if file is None:
+                logger.warning(f"Failed to get file from VM: {p}")
+                if i in gives:
+                    cache_paths.append(None)
+                continue
+
+            if i in gives:
+                cache_paths.append(_path)
+                
+            # Write file with robust error handling
+            try:
+                # Ensure cache directory exists
+                os.makedirs(env.cache_dir, exist_ok=True)
+                
+                with open(_path, "wb") as f:
+                    f.write(file)
+                logger.info(f"Successfully saved file: {_path} ({len(file)} bytes)")
+                
+            except IOError as e:
+                logger.error(f"IO error writing file {_path}: {e}")
+                if i in gives:
+                    cache_paths[-1] = None  # Replace the path we just added with None
+            except Exception as e:
+                logger.error(f"Unexpected error writing file {_path}: {e}")
+                if i in gives:
+                    cache_paths[-1] = None
+                    
+        except Exception as e:
+            logger.error(f"Error processing file {p}: {e}")
            if i in gives:
                cache_paths.append(None)
-            continue
-
-        if i in gives:
-            cache_paths.append(_path)
-        with open(_path, "wb") as f:
-            f.write(file)
+                
    return cache_paths[0] if len(cache_paths)==1 else cache_paths


--- a/desktop_env/evaluators/getters/info.py
+++ b/desktop_env/evaluators/getters/info.py
@ -1,6 +1,9 @@
 import os
+import logging
 from typing import Union

+logger = logging.getLogger("desktopenv.getters.info")
+

 def get_vm_screen_size(env, config: dict) -> dict:
    return env.controller.get_vm_screen_size()
@ -14,6 +17,29 @@ def get_vm_wallpaper(env, config: dict) -> Union[str, bytes]:
    _path = os.path.join(env.cache_dir, config["dest"])

    content = env.controller.get_vm_wallpaper()
+    
+    # Check if content is None or empty
+    if content is None:
+        logger.error("Failed to get VM wallpaper: controller returned None")
+        # Create an empty file to prevent downstream errors
+        with open(_path, "wb") as f:
+            f.write(b"")
+        return _path
+    
+    if not isinstance(content, bytes):
+        logger.error(f"Invalid wallpaper content type: {type(content)}, expected bytes")
+        # Create an empty file to prevent downstream errors
+        with open(_path, "wb") as f:
+            f.write(b"")
+        return _path
+    
+    if len(content) == 0:
+        logger.warning("VM wallpaper content is empty")
+        # Create an empty file to prevent downstream errors
+        with open(_path, "wb") as f:
+            f.write(b"")
+        return _path
+
    with open(_path, "wb") as f:
        f.write(content)

--- a/desktop_env/evaluators/metrics/chrome.py
+++ b/desktop_env/evaluators/metrics/chrome.py
@ -3,6 +3,7 @@ import os
 import re
 import shutil
 import io
+import time
 from itertools import product
 from typing import Any, Dict, List, Union

@ -29,8 +30,8 @@ def is_expected_active_tab(active_tab_info: Dict[str, str], rule: Dict[str, Any]
            actual_url = active_tab_info.get('url', None)
        else:
            actual_url = active_tab_info
-        print("expected_url: {}".format(expected_url))
-        print("actual_url: {}".format(actual_url))
+        logger.info("expected_url: {}".format(expected_url))
+        logger.info("actual_url: {}".format(actual_url))
        return 1 if compare_urls(expected_url, actual_url) else 0
    else:
        logger.error(f"Unknown type: {match_type}")
@ -74,25 +75,36 @@ def is_expected_url_pattern_match(result, rules) -> float:
    if not result:
        return 0.

-    if type(result) == dict:
-        result_url = result["url"]
-        print("result url: {}".format(result_url))
-    else:
+    # Extract URL from result parameter - result can be either a string URL or a dict with 'url' field
+    if isinstance(result, str):
        result_url = result
+        logger.info("result url: {}".format(result_url))
+    elif isinstance(result, dict) and 'url' in result:
+        result_url = result['url']
+        logger.info("result url: {}".format(result_url))
+    else:
+        logger.error(f"Invalid result format: {type(result)}, expected string URL or dict with 'url' field")
+        return 0.
+
+    logger.info(f"Result URL to match: {result_url}")
+    
    # expect_regex = re.compile(rules["expected"])
    patterns = rules["expected"]
-    print("expected_regex: {}".format(patterns))
+    logger.info("expected_regex: {}".format(patterns))
    for pattern in patterns:
        match = re.search(pattern, result_url)
-        print(match)
+        logger.info("match: {}".format(match))
        if not match:
            return 0.
    return 1.


 def is_expected_installed_extensions(installed_extensions, expected) -> float:
-    print("installed_extensions: ")
-    print(installed_extensions)
+    if not installed_extensions:
+        return 0.
+
+    logger.info("installed_extensions: ")
+    logger.info(installed_extensions)
    expected_extensions = expected["expected"]

    # whether the expected extensions are installed
@ -109,6 +121,8 @@ def is_expected_tabs(open_tabs: List[Dict[str, str]], rule: Dict[str, Any]) -> f
    """
    Checks if the expected tabs are open in Chrome.
    """
+    if not open_tabs:
+        return 0.

    match_type = rule['type']

@ -146,8 +160,10 @@ def is_expected_bookmarks(bookmarks: List[str], rule: Dict[str, Any]) -> float:
                                     bookmark['type'] == 'folder' and bookmark['name'] == 'Liked Authors'), None)
        if liked_authors_folder:
            # Check if it contains the specified URLs
+            logger.info("'Liked Authors' folder exists")
            liked_authors_urls = [bookmark['url'] for bookmark in liked_authors_folder['children'] if
                                  bookmark['type'] == 'url']
+            logger.info("Here is the 'Liked Authors' folder's urls: {}".format(liked_authors_urls))

            urls = rule['urls']

@ -168,6 +184,9 @@ def is_expected_bookmarks(bookmarks: List[str], rule: Dict[str, Any]) -> float:


 def is_expected_search_query(active_tab_info: Dict[str, str], rules: Dict[str, Any]) -> float:
+    if not active_tab_info:
+        return 0.
+
    expected = rules['expect']
    pattern = expected['pattern']
    matched = re.search(pattern, active_tab_info['url'])
@ -210,6 +229,7 @@ import imagehash

 from pathlib import Path
 import typing
+import time


 def compare_pdf_images(pdf1_path: str, pdf2_path: str, **kwargs) -> float:
--- a/desktop_env/evaluators/metrics/docs.py
+++ b/desktop_env/evaluators/metrics/docs.py
@ -3,6 +3,10 @@ import os
 import re
 import xml.etree.ElementTree as ET
 import zipfile
+import tempfile
+import subprocess
+import struct
+import numpy as np
 from io import BytesIO
 from typing import List, Dict, Any

@ -21,6 +25,77 @@ from skimage.color import rgb2lab
 logger = logging.getLogger("desktopenv.metric.docs")


+def read_x11_image(filepath):
+    """
+    Pure Python X11 (XWD) format reader that converts to PIL Image.
+    No external dependencies required.
+    
+    Args:
+        filepath: Path to the X11/XWD format image file
+        
+    Returns:
+        PIL.Image: Converted image in RGB format
+        
+    Raises:
+        ValueError: If the format is not supported
+        IOError: If file cannot be read
+    """
+    with open(filepath, 'rb') as f:
+        # Read X11 header
+        header_data = f.read(100)
+        
+        # Parse header (big endian format)
+        header_size = struct.unpack('>I', header_data[0:4])[0]
+        version = struct.unpack('>I', header_data[4:8])[0]
+        pixmap_format = struct.unpack('>I', header_data[8:12])[0]
+        pixmap_depth = struct.unpack('>I', header_data[12:16])[0]
+        pixmap_width = struct.unpack('>I', header_data[16:20])[0]
+        pixmap_height = struct.unpack('>I', header_data[20:24])[0]
+        
+        logger.debug(f"X11 image info: {pixmap_width}x{pixmap_height}, depth={pixmap_depth}")
+        
+        # Skip to the end of header
+        f.seek(header_size)
+        
+        # Read pixel data based on depth
+        if pixmap_depth == 32:
+            # 32-bit RGBA format
+            bytes_per_pixel = 4
+            total_pixels = pixmap_width * pixmap_height
+            pixel_data = f.read(total_pixels * bytes_per_pixel)
+            
+            # Convert to numpy array
+            pixels = np.frombuffer(pixel_data, dtype=np.uint8)
+            
+            # Reshape to image dimensions
+            pixels = pixels.reshape((pixmap_height, pixmap_width, bytes_per_pixel))
+            
+            # X11 format is typically BGRA, convert to RGB
+            # Swap B and R channels, ignore alpha
+            rgb_pixels = pixels[:, :, [2, 1, 0]]  # BGR -> RGB
+            
+            # Create PIL image
+            image = Image.fromarray(rgb_pixels, 'RGB')
+            return image
+            
+        elif pixmap_depth == 24:
+            # 24-bit RGB format
+            bytes_per_pixel = 3
+            total_pixels = pixmap_width * pixmap_height
+            pixel_data = f.read(total_pixels * bytes_per_pixel)
+            
+            # Convert to numpy array and reshape
+            pixels = np.frombuffer(pixel_data, dtype=np.uint8)
+            pixels = pixels.reshape((pixmap_height, pixmap_width, bytes_per_pixel))
+            
+            # Create PIL image (assuming RGB order)
+            image = Image.fromarray(pixels, 'RGB')
+            return image
+            
+        else:
+            raise ValueError(f'Unsupported X11 pixel depth: {pixmap_depth}. Only 24-bit and 32-bit formats are supported.')
+
+
 def find_default_font(config_file_path, rules):
    """Find the default font in LibreOffice Writer."""
    default_font = None
@ -294,29 +369,136 @@ def compare_docx_images(docx_file1, docx_file2):
 def compare_image_text(image_path, rule):
    if not image_path:
        return 0
-    reader = easyocr.Reader(['en'])
-    result = reader.readtext(image_path)
-    extracted_text = ' '.join([entry[1] for entry in result])
    
-    # Log OCR results
-    logger.info(f"OCR extracted texts: {[entry[1] for entry in result]}")
-    logger.info(f"Combined extracted text: {extracted_text}")
+    # Check if the image file exists
+    if not os.path.exists(image_path):
+        logger.error(f"Image file not found: {image_path}")
+        return 0
    
-    if rule['type'] == 'text':
-        target_text = rule['text']
-        match_found = target_text in extracted_text
+    # Check image format and convert if necessary
+    temp_image_path = None
+    actual_image_path = image_path
+    
+    try:
+        # First, try to identify the file format
+        result = subprocess.run(['file', image_path], capture_output=True, text=True)
+        file_info = result.stdout.lower()
        
-        # Log matching results
-        logger.info(f"Target text: '{target_text}'")
-        logger.info(f"Match found: {match_found}")
-        if match_found:
-            logger.info("✅ Text matching successful!")
-        else:
-            logger.info("❌ Text matching failed!")
+        # If it's an X11 screen dump, we need to convert it
+        if 'x-window screen dump' in file_info or 'xwd' in file_info:
+            logger.info(f"Detected X11 screen dump format in {image_path}, attempting conversion...")
+            
+            # Create a temporary file for the converted image
+            temp_fd, temp_image_path = tempfile.mkstemp(suffix='.png')
+            os.close(temp_fd)
+            
+            # Try to convert using PIL with xwd support or other methods
+            try:
+                # First try with PIL directly (sometimes it can handle xwd)
+                img = Image.open(image_path)
+                img.save(temp_image_path, 'PNG')
+                actual_image_path = temp_image_path
+                logger.info(f"Successfully converted X11 image using PIL")
+            except Exception as pil_error:
+                logger.warning(f"PIL conversion failed: {pil_error}")
+                
+                # Try our custom X11 reader (pure Python solution)
+                try:
+                    logger.info("Attempting conversion using custom X11 reader...")
+                    x11_image = read_x11_image(image_path)
+                    x11_image.save(temp_image_path, 'PNG')
+                    actual_image_path = temp_image_path
+                    logger.info(f"✅ Successfully converted X11 image using custom reader")
+                except Exception as custom_error:
+                    logger.warning(f"Custom X11 conversion failed: {custom_error}")
+                    
+                    # Try with netpbm tools if available (fallback)
+                    try:
+                        result = subprocess.run(['which', 'xwdtopnm'], capture_output=True)
+                        if result.returncode == 0:
+                            # Use netpbm tools chain: xwdtopnm -> pnmtopng
+                            subprocess.run(['xwdtopnm', image_path], 
+                                         stdout=subprocess.PIPE, check=True)
+                            with open(temp_image_path, 'wb') as f:
+                                result = subprocess.run(['xwdtopnm', image_path], 
+                                                      stdout=subprocess.PIPE, check=True)
+                                result2 = subprocess.run(['pnmtopng'], 
+                                                       input=result.stdout, 
+                                                       stdout=f, check=True)
+                            actual_image_path = temp_image_path
+                            logger.info(f"Successfully converted X11 image using netpbm tools")
+                        else:
+                            raise Exception("netpbm tools not available")
+                    except Exception as netpbm_error:
+                        logger.warning(f"netpbm conversion failed: {netpbm_error}")
+                        
+                        # All conversions failed
+                        logger.error(
+                            f"❌ All X11 conversion methods failed.\n"
+                            f"Attempted: PIL → Custom Python reader → netpbm tools\n"
+                            f"💡 The image might be corrupted or in an unsupported X11 variant"
+                        )
+                        
+                        # If all conversions fail, try to use the original file anyway
+                        # Sometimes easyocr might handle it better than PIL
+                        if temp_image_path:
+                            os.unlink(temp_image_path)
+                            temp_image_path = None
+                        actual_image_path = image_path
+                        logger.info(f"Will attempt OCR on original file format (likely to fail)")
        
-        return 1 if match_found else 0
-    else:
-        raise ValueError("Unsupported rule type")
+        # Now attempt OCR with error handling
+        try:
+            reader = easyocr.Reader(['en'])
+            result = reader.readtext(actual_image_path)
+            extracted_text = ' '.join([entry[1] for entry in result])
+            
+            # Log OCR results
+            logger.info(f"OCR extracted texts: {[entry[1] for entry in result]}")
+            logger.info(f"Combined extracted text: {extracted_text}")
+            
+            if rule['type'] == 'text':
+                target_text = rule['text']
+                match_found = target_text in extracted_text
+                
+                # Log matching results
+                logger.info(f"Target text: '{target_text}'")
+                logger.info(f"Match found: {match_found}")
+                if match_found:
+                    logger.info("✅ Text matching successful!")
+                else:
+                    logger.info("❌ Text matching failed!")
+                
+                return 1 if match_found else 0
+            else:
+                raise ValueError("Unsupported rule type")
+                
+        except Exception as ocr_error:
+            logger.error(f"OCR processing failed for {actual_image_path}: {ocr_error}")
+            
+            # Check if this is specifically an X11 format issue
+            if 'x-window screen dump' in file_info or 'xwd' in file_info:
+                logger.error(
+                    f"🚨 OCR failed on X11 screen dump after all conversion attempts.\n"
+                    f"This might indicate:\n"
+                    f"   1. The X11 file is corrupted or in an unsupported variant\n"
+                    f"   2. Missing dependencies (numpy, PIL)\n"
+                    f"   3. Insufficient memory for large images"
+                )
+            
+            return 0
+            
+    except Exception as e:
+        logger.error(f"Error processing image {image_path}: {e}")
+        return 0
+        
+    finally:
+        # Clean up temporary file if created
+        if temp_image_path and os.path.exists(temp_image_path):
+            try:
+                os.unlink(temp_image_path)
+            except:
+                pass


 def compare_line_spacing(docx_file1, docx_file2):
--- a/desktop_env/evaluators/metrics/general.py
+++ b/desktop_env/evaluators/metrics/general.py
@ -298,34 +298,84 @@ def check_json(result: str, rules: Dict[str, List[Dict[str, Union[List[str], str
    """

    if result is None:
+        logger.warning("Result file path is None, returning 0.0")
+        return 0.
+        
+    # Check if file exists
+    if not os.path.exists(result):
+        logger.warning(f"Result file does not exist: {result}, returning 0.0")
+        return 0.
+    
+    try:
+        with open(result, 'r', encoding='utf-8') as f:
+            if is_yaml:
+                try:
+                    # Use SafeLoader instead of Loader for better security and error handling
+                    result_data: Dict[str, Any] = yaml.safe_load(f)
+                    if result_data is None:
+                        logger.warning(f"YAML file {result} is empty or contains only null values, returning 0.0")
+                        return 0.
+                except yaml.YAMLError as e:
+                    logger.error(f"YAML parsing error in file {result}: {e}")
+                    logger.error(f"File content might be corrupted or have invalid YAML syntax")
+                    return 0.
+                except Exception as e:
+                    logger.error(f"Unexpected error parsing YAML file {result}: {e}")
+                    return 0.
+            else:
+                try:
+                    result_data: Dict[str, Any] = json.load(f)
+                except json.JSONDecodeError as e:
+                    logger.error(f"JSON parsing error in file {result}: {e}")
+                    return 0.
+                except Exception as e:
+                    logger.error(f"Unexpected error parsing JSON file {result}: {e}")
+                    return 0.
+    except IOError as e:
+        logger.error(f"IO error reading file {result}: {e}")
+        return 0.
+    except Exception as e:
+        logger.error(f"Unexpected error reading file {result}: {e}")
        return 0.
-    with open(result) as f:
-        if is_yaml:
-            result: Dict[str, Any] = yaml.load(f, Loader=yaml.Loader)
-        else:
-            result: Dict[str, Any] = json.load(f)

    expect_rules = rules.get("expect", {})
    unexpect_rules = rules.get("unexpect", {})

    metric = True
    for r in expect_rules:
-        value = result
-        for k in r["key"]:
-            try:
-                value = value[k]
-            except KeyError:
-                return 0.
-        metric = metric and _match_value_to_rule(value, r)
+        value = result_data
+        try:
+            for k in r["key"]:
+                try:
+                    value = value[k]
+                except KeyError:
+                    logger.debug(f"Key '{k}' not found in result data, returning 0.0")
+                    return 0.
+                except TypeError:
+                    logger.debug(f"Cannot access key '{k}' - value is not a dictionary, returning 0.0")
+                    return 0.
+            metric = metric and _match_value_to_rule(value, r)
+        except Exception as e:
+            logger.error(f"Error processing expect rule {r}: {e}")
+            return 0.
+            
    for r in unexpect_rules:
-        value = result
-        for k in r["key"]:
-            try:
-                value = value[k]
-            except KeyError:
-                value = None
-                break
-        metric = metric and not _match_value_to_rule(value, r)
+        value = result_data
+        try:
+            for k in r["key"]:
+                try:
+                    value = value[k]
+                except KeyError:
+                    value = None
+                    break
+                except TypeError:
+                    value = None
+                    break
+            metric = metric and not _match_value_to_rule(value, r)
+        except Exception as e:
+            logger.error(f"Error processing unexpect rule {r}: {e}")
+            return 0.
+            
    return float(metric)


--- a/desktop_env/evaluators/metrics/gimp.py
+++ b/desktop_env/evaluators/metrics/gimp.py
@ -17,6 +17,16 @@ def compare_image_list(pred_img_path_list: Union[str, List[str]],
            return 0.0
        pred_img = Image.open(pred_img_path)
        gold_img = Image.open(gold_img_path)
+        
+        # Check if images have different sizes and resize if necessary
+        if pred_img.size != gold_img.size:
+            logging.debug(f"Images have different sizes: {pred_img.size} vs {gold_img.size}, resizing predicted image to match gold image")
+            pred_img = pred_img.resize(gold_img.size, Image.Resampling.LANCZOS)
+        
+        # Ensure both images are in the same mode for comparison
+        if pred_img.mode != gold_img.mode:
+            pred_img = pred_img.convert(gold_img.mode)
+        
        diff = ImageChops.difference(pred_img, gold_img)
        if diff.getbbox():
            return 0.0
@ -190,6 +200,27 @@ def calculate_image_sharpness(image_path):

 def structure_check_by_mse(img1, img2, threshold=0.03):
    """Check if two images are approximately the same by MSE"""
+    
+    # Ensure both images are PIL Image objects
+    if not hasattr(img1, 'size') or not hasattr(img2, 'size'):
+        # Convert numpy arrays to PIL Images if needed
+        if hasattr(img1, 'shape'):
+            img1 = Image.fromarray(img1)
+        if hasattr(img2, 'shape'):
+            img2 = Image.fromarray(img2)
+    
+    # Check if images have different sizes and resize if necessary
+    if img1.size != img2.size:
+        logging.debug(f"Images have different sizes: {img1.size} vs {img2.size}, resizing first image to match second")
+        img1 = img1.resize(img2.size, Image.Resampling.LANCZOS)
+    
+    # Ensure both images are in RGB mode for consistent comparison
+    if img1.mode != 'RGB':
+        img1 = img1.convert('RGB')
+    if img2.mode != 'RGB':
+        img2 = img2.convert('RGB')
+    
+    # Now calculate MSE with properly sized images
    mse = np.mean(
        (np.array(img1, dtype=np.float32) / 255
         - np.array(img2, dtype=np.float32) / 255) ** 2)
@ -200,7 +231,55 @@ def structure_check_by_mse(img1, img2, threshold=0.03):

 def structure_check_by_ssim(img1, img2, threshold=0.9):
    """Check if two images are approximately the same by SSIM"""
-    similarity = ssim(np.array(img1), np.array(img2), multichannel=True, channel_axis=-1)
+    min_size = 7
+    if img1.width < min_size or img1.height < min_size or \
+       img2.width < min_size or img2.height < min_size:
+        logging.warning(f"image too small for ssim: {img1.size} vs {img2.size}")
+        return False
+    
+    if img1.mode != 'RGB':
+        img1 = img1.convert('RGB')
+    if img2.mode != 'RGB':
+        img2 = img2.convert('RGB')
+    
+    # Now both images are in RGB mode, so they should have the same number of channels (3)
+    # But we still need to check the size (though the caller should have checked)
+    if img1.size != img2.size:
+        # If the sizes are different, we cannot compare, return False
+        logging.debug(f"Images have different sizes: {img1.size} vs {img2.size}")
+        return False
+
+    array1 = np.array(img1)
+    array2 = np.array(img2)
+    # They should have the same shape now, but double check
+    if array1.shape != array2.shape:
+        logging.debug(f"Images have different shapes after conversion: {array1.shape} vs {array2.shape}")
+        return False
+
+    # Determine the window size for SSIM
+    min_dim = min(array1.shape[0], array1.shape[1])
+    if min_dim < 7:
+        # If the smallest dimension is less than 7, set win_size to the next smaller odd number
+        win_size = min_dim if min_dim % 2 == 1 else min_dim - 1
+        if win_size < 1:
+            logging.debug("Image too small for SSIM computation (min dimension < 1)")
+            return False
+    else:
+        win_size = 7  # default
+
+    try:
+        # For newer versions of skimage, we use channel_axis, for older versions, multichannel
+        # We try to use the newer way first, then fall back to the old way
+        try:
+            # Newer versions (channel_axis is available)
+            similarity = ssim(array1, array2, win_size=win_size, channel_axis=2)
+        except TypeError:
+            # Older versions use multichannel
+            similarity = ssim(array1, array2, win_size=win_size, multichannel=True)
+    except Exception as e:
+        logging.error(f"SSIM computation failed: {e}")
+        return False
+
    logging.debug("SSIM: %s", similarity)
    return similarity >= threshold

@ -345,13 +424,20 @@ def check_structure_sim(src_path, tgt_path):
    if src_path is None or tgt_path is None:
        return 0.

-    img_src = Image.open(src_path)
-    img_tgt = Image.open(tgt_path)
-    structure_same = structure_check_by_ssim(img_src, img_tgt)
-    if structure_same:
-        return 1.
-    else:
-        return 0.
+    try:
+        img_src = Image.open(src_path)
+        img_tgt = Image.open(tgt_path)
+
+        if img_src.size != img_tgt.size:
+            logging.debug(f"size different: src_path: {src_path}, tgt_path: {tgt_path}")
+            return 0.0
+            
+        structure_same = structure_check_by_ssim(img_src, img_tgt)
+        return 1.0 if structure_same else 0.0
+        
+    except Exception as e:
+        logging.error(f"check_structure_sim error: {str(e)}")
+        return 0.0


 def check_structure_sim_resized(src_path, tgt_path):
@ -396,7 +482,10 @@ def check_structure_sim_resized(src_path, tgt_path):

    # Check if the structure is similar
    structure_same = structure_check_by_ssim(img_src_resized, img_tgt)
-    return structure_same
+    if structure_same:
+        return 1.
+    else:
+        return 0.


 def check_contrast_increase_and_structure_sim(src_path, tgt_path):
@ -509,27 +598,119 @@ def check_image_size(src_path, rule):
        return 0.


+def safe_open_image_with_retry(file_path, max_retries=3, retry_delay=0.5):
+    """
+    Safely open an image file with retry mechanism for handling truncated files
+    """
+    import os
+    import time
+    import logging
+    
+    logger = logging.getLogger(__name__)
+    
+    if not file_path or not os.path.exists(file_path):
+        logger.error(f"File does not exist: {file_path}")
+        return None
+    
+    for attempt in range(max_retries):
+        try:
+            # Check file size first
+            file_size = os.path.getsize(file_path)
+            if file_size == 0:
+                logger.warning(f"File is empty: {file_path}")
+                if attempt < max_retries - 1:
+                    time.sleep(retry_delay)
+                    continue
+                return None
+            
+            logger.info(f"Opening image: {file_path} (size: {file_size} bytes, attempt: {attempt + 1})")
+            
+            # Try to open with PIL
+            image = Image.open(file_path)
+            
+            # Verify image can be loaded (trigger actual parsing)
+            image.load()
+            
+            logger.info(f"Successfully opened image: {image.format} {image.mode} {image.size}")
+            return image
+            
+        except (OSError, IOError) as e:
+            if "truncated" in str(e).lower() or "cannot identify" in str(e).lower():
+                logger.warning(f"Attempt {attempt + 1}: Image file appears truncated or corrupted: {e}")
+                if attempt < max_retries - 1:
+                    logger.info(f"Retrying in {retry_delay} seconds...")
+                    time.sleep(retry_delay)
+                    continue
+            else:
+                logger.error(f"IO error opening image: {e}")
+                break
+        except Exception as e:
+            logger.error(f"Unexpected error opening image: {e}")
+            break
+    
+    logger.error(f"Failed to open image after {max_retries} attempts: {file_path}")
+    return None
+
 def check_palette_and_structure_sim(src_path, tgt_path):
    """
    Check if the src image is palette-based and the structure of the two images are similar
+    Enhanced with robust error handling for file format issues and truncated files
    gimp:06ca5602-62ca-47f6-ad4f-da151cde54cc
    """
+    import logging
+    logger = logging.getLogger(__name__)
+    
+    logger.info(f"Evaluating palette and structure similarity: src={src_path}, tgt={tgt_path}")
+    
    if src_path is None or tgt_path is None:
+        logger.warning("Source or target path is None")
        return 0.

-    # Check if the source image is palette-based
-    source_image = Image.open(src_path)
-    palette_based = source_image.mode == 'P'
-
-    # Check structure
-    target_image = Image.open(tgt_path)
-    source_image = source_image.convert('RGB')
-    structure_same = structure_check_by_ssim(source_image, target_image)
-    if palette_based and structure_same:
-        return 1.
-    else:
+    # Safely open source image with retry mechanism
+    source_image = safe_open_image_with_retry(src_path)
+    if source_image is None:
+        logger.error("Failed to open source image")
        return 0.

+    try:
+        # Check if the source image is palette-based
+        palette_based = source_image.mode == 'P'
+        logger.info(f"Source image mode: {source_image.mode}, palette-based: {palette_based}")
+
+        # Safely open target image
+        target_image = safe_open_image_with_retry(tgt_path)
+        if target_image is None:
+            logger.error("Failed to open target image")
+            source_image.close()
+            return 0.
+
+        try:
+            # Convert source image to RGB for comparison
+            source_rgb = source_image.convert('RGB')
+            logger.info(f"Source converted to RGB: {source_rgb.mode} {source_rgb.size}")
+            
+            # Check structure
+            structure_same = structure_check_by_ssim(source_rgb, target_image)
+            logger.info(f"Structure similarity check: {structure_same}")
+            
+            # Evaluation logic
+            if palette_based and structure_same:
+                result = 1.0
+            else:
+                result = 0.0
+                
+            logger.info(f"Evaluation result: {result} (palette_based={palette_based}, structure_same={structure_same})")
+            return result
+            
+        finally:
+            target_image.close()
+            
+    except Exception as e:
+        logger.error(f"Error during evaluation: {e}")
+        return 0.
+    finally:
+        source_image.close()
+

 def check_textbox_on_leftside(src_path):
    """
--- a/desktop_env/evaluators/metrics/others.py
+++ b/desktop_env/evaluators/metrics/others.py
@ -23,22 +23,34 @@ def process_epub(filename: str) -> List[str]:

    try:
        with zipfile.ZipFile(filename, "r") as z_f:
-            with z_f.open("toc.ncx") as in_f \
-                    , open(os.path.join(base_dir, "toc.ncx"), "w") as out_f:
-                contents: str = in_f.read().decode()
-                contents = contents.splitlines()
-                for l in contents:
-                    if "navPoint" not in l:
-                        out_f.write(l + "\n")
-            file_list.append(os.path.join(base_dir, "toc.ncx"))
-            with z_f.open("content.opf") as in_f \
-                    , open(os.path.join(base_dir, "content.opf"), "w") as out_f:
-                contents: str = in_f.read().decode()
-                contents = contents.splitlines()
-                for l in contents:
-                    if "dc:identifier" not in l:
-                        out_f.write(l + "\n")
-            file_list.append(os.path.join(base_dir, "content.opf"))
+            # Get list of all files in the zip archive
+            zip_file_list = z_f.namelist()
+            
+            # Process toc.ncx if it exists
+            if "toc.ncx" in zip_file_list:
+                with z_f.open("toc.ncx") as in_f \
+                        , open(os.path.join(base_dir, "toc.ncx"), "w") as out_f:
+                    contents: str = in_f.read().decode()
+                    contents = contents.splitlines()
+                    for l in contents:
+                        if "navPoint" not in l:
+                            out_f.write(l + "\n")
+                file_list.append(os.path.join(base_dir, "toc.ncx"))
+            else:
+                logger.debug("toc.ncx not found in epub file: %s", filename)
+            
+            # Process content.opf if it exists
+            if "content.opf" in zip_file_list:
+                with z_f.open("content.opf") as in_f \
+                        , open(os.path.join(base_dir, "content.opf"), "w") as out_f:
+                    contents: str = in_f.read().decode()
+                    contents = contents.splitlines()
+                    for l in contents:
+                        if "dc:identifier" not in l:
+                            out_f.write(l + "\n")
+                file_list.append(os.path.join(base_dir, "content.opf"))
+            else:
+                logger.debug("content.opf not found in epub file: %s", filename)
            for f_n in z_f.namelist():
                if f_n.endswith(".html"):
                    with z_f.open(f_n) as in_f \
--- a/desktop_env/evaluators/metrics/slides.py
+++ b/desktop_env/evaluators/metrics/slides.py
@ -73,6 +73,9 @@ def check_image_stretch_and_center(modified_ppt, original_ppt):
    original_slide_images = [shape for shape in original_slide.shapes if shape.shape_type == 13]
    modified_slide_images = [shape for shape in modified_slide.shapes if shape.shape_type == 13]

+    if not original_slide_images:
+        return 0.
+    
    the_image = original_slide_images[0]

    the_modified_image = None
@ -395,12 +398,38 @@ def compare_pptx_files(file1_path, file2_path, **options):
                table2 = shape2.table
                if enable_debug:
                    debug_logger.debug(f"  Shape {shape_idx} - Comparing TABLE with {len(table1.rows)} rows and {len(table1.columns)} columns")
+                    debug_logger.debug(f"  Shape {shape_idx} - Table2 has {len(table2.rows)} rows and {len(table2.columns)} columns")
+                
+                # Check if tables have the same dimensions
+                if len(table1.rows) != len(table2.rows) or len(table1.columns) != len(table2.columns):
+                    if enable_debug:
+                        debug_logger.debug(f"    MISMATCH: Slide {slide_idx}, Shape {shape_idx} (TABLE) - Table dimensions differ:")
+                        debug_logger.debug(f"      Table1: {len(table1.rows)} rows x {len(table1.columns)} columns")
+                        debug_logger.debug(f"      Table2: {len(table2.rows)} rows x {len(table2.columns)} columns")
+                    return 0
+                
                for row_idx in range(len(table1.rows)):
                    for col_idx in range(len(table1.columns)):
                        cell1 = table1.cell(row_idx, col_idx)
                        cell2 = table2.cell(row_idx, col_idx)

+                        # Check if cells have the same number of paragraphs
+                        if len(cell1.text_frame.paragraphs) != len(cell2.text_frame.paragraphs):
+                            if enable_debug:
+                                debug_logger.debug(f"    MISMATCH: Slide {slide_idx}, Shape {shape_idx} (TABLE) - Cell [{row_idx},{col_idx}] - Different number of paragraphs:")
+                                debug_logger.debug(f"      Cell1 paragraphs: {len(cell1.text_frame.paragraphs)}")
+                                debug_logger.debug(f"      Cell2 paragraphs: {len(cell2.text_frame.paragraphs)}")
+                            return 0
+
                        for para_idx, (para1, para2) in enumerate(zip(cell1.text_frame.paragraphs, cell2.text_frame.paragraphs)):
+                            # Check if paragraphs have the same number of runs
+                            if len(para1.runs) != len(para2.runs):
+                                if enable_debug:
+                                    debug_logger.debug(f"    MISMATCH: Slide {slide_idx}, Shape {shape_idx} (TABLE) - Cell [{row_idx},{col_idx}], Para {para_idx} - Different number of runs:")
+                                    debug_logger.debug(f"      Para1 runs: {len(para1.runs)}")
+                                    debug_logger.debug(f"      Para2 runs: {len(para2.runs)}")
+                                return 0
+
                            for run_idx, (run1, run2) in enumerate(zip(para1.runs, para2.runs)):
                                # Check font color
                                if hasattr(run1.font.color, "rgb") and hasattr(run2.font.color, "rgb"):
@ -451,6 +480,14 @@ def compare_pptx_files(file1_path, file2_path, **options):
                if shape1.text.strip() != shape2.text.strip() and examine_text:
                    return 0

+                # check if the number of paragraphs are the same
+                if len(shape1.text_frame.paragraphs) != len(shape2.text_frame.paragraphs):
+                    if enable_debug:
+                        debug_logger.debug(f"    MISMATCH: Slide {slide_idx}, Shape {shape_idx} - Different number of paragraphs:")
+                        debug_logger.debug(f"      Shape1 paragraphs: {len(shape1.text_frame.paragraphs)}")
+                        debug_logger.debug(f"      Shape2 paragraphs: {len(shape2.text_frame.paragraphs)}")
+                    return 0
+
                    # check if the paragraphs are the same
                para_idx = 0
                for para1, para2 in zip(shape1.text_frame.paragraphs, shape2.text_frame.paragraphs):
@ -487,6 +524,14 @@ def compare_pptx_files(file1_path, file2_path, **options):
                    if para1.level != para2.level and examine_indent:
                        return 0

+                    # check if the number of runs are the same
+                    if len(para1.runs) != len(para2.runs):
+                        if enable_debug:
+                            debug_logger.debug(f"    MISMATCH: Slide {slide_idx}, Shape {shape_idx}, Para {para_idx} - Different number of runs:")
+                            debug_logger.debug(f"      Para1 runs: {len(para1.runs)}")
+                            debug_logger.debug(f"      Para2 runs: {len(para2.runs)}")
+                        return 0
+
                    for run1, run2 in zip(para1.runs, para2.runs):

                        # check if the font properties are the same                        
@ -634,6 +679,12 @@ def compare_pptx_files(file1_path, file2_path, **options):
                        debug_logger.debug(f"    MISMATCH: Text differs - '{tshape1.text.strip()}' vs '{tshape2.text.strip()}'")
                    return 0
                
+                # Check if text shapes have the same number of paragraphs
+                if len(tshape1.text_frame.paragraphs) != len(tshape2.text_frame.paragraphs):
+                    if enable_debug:
+                        debug_logger.debug(f"    MISMATCH: Different number of paragraphs - {len(tshape1.text_frame.paragraphs)} vs {len(tshape2.text_frame.paragraphs)}")
+                    return 0
+                
                # Compare alignment of each paragraph
                for para_idx, (para1, para2) in enumerate(zip(tshape1.text_frame.paragraphs, tshape2.text_frame.paragraphs)):
                    from pptx.enum.text import PP_ALIGN
--- a/desktop_env/evaluators/metrics/table.py
+++ b/desktop_env/evaluators/metrics/table.py
@ -2,6 +2,9 @@ import functools
 import itertools
 import logging
 import os.path
+import re
+import unicodedata
+
 # import operator
 from numbers import Number
 from typing import Any, Union, cast, Callable, Iterable
@ -17,9 +20,19 @@ from openpyxl.worksheet.datavalidation import DataValidation
 from openpyxl.worksheet.worksheet import Worksheet
 from rapidfuzz import fuzz

-from desktop_env.evaluators.metrics.utils import _match_value_to_rule, _read_cell_style, read_cell_value
-from desktop_env.evaluators.metrics.utils import load_charts, load_sparklines, load_rows_or_cols, load_xlsx_styles \
-    , load_filters, load_pivot_tables
+from desktop_env.evaluators.metrics.utils import (
+    _match_value_to_rule,
+    _read_cell_style,
+    read_cell_value,
+)
+from desktop_env.evaluators.metrics.utils import (
+    load_charts,
+    load_sparklines,
+    load_rows_or_cols,
+    load_xlsx_styles,
+    load_filters,
+    load_pivot_tables,
+)

 # from openpyxl.utils import coordinate_to_tuple

@ -28,16 +41,26 @@ logger = logging.getLogger("desktopenv.metric.table")
 BOOK = Union[pd.ExcelFile, Workbook, str]


-def _parse_sheet_idx(sheet_idx: Union[int, str]
-                     , result: BOOK, expected: BOOK
-                     , result_sheet_names: List[str]
-                     , expected_sheet_names: List[str]
-                     ) -> Tuple[BOOK, str]:
-    #  function _parse_sheet_idx {{{ # 
+def _parse_sheet_idx(
+    sheet_idx: Union[int, str],
+    result: BOOK,
+    expected: BOOK,
+    result_sheet_names: List[str],
+    expected_sheet_names: List[str],
+) -> Tuple[BOOK, str]:
+    #  function _parse_sheet_idx {{{ #
    if isinstance(sheet_idx, int):
        try:
-            index: str = result_sheet_names[sheet_idx]
-        except:
+            if not result_sheet_names or sheet_idx >= len(result_sheet_names):
+                logger.error(
+                    f"Sheet index {sheet_idx} out of range. Available sheets: {result_sheet_names}"
+                )
+                index = ""
+            else:
+                index: str = result_sheet_names[sheet_idx]
+                logger.debug(f"Sheet index {sheet_idx} resolved to sheet: {index}")
+        except Exception as e:
+            logger.error(f"Error resolving sheet index {sheet_idx}: {e}")
            index = ""
        book: BOOK = result
    elif sheet_idx.startswith("RI"):
@ -62,27 +85,31 @@ def _parse_sheet_idx(sheet_idx: Union[int, str]
        logger.error("Unrecognized sheet index")
        raise ValueError("Unrecognized sheet index")
    return book, index
-    #  }}} function _parse_sheet_idx # 
+    #  }}} function _parse_sheet_idx #


 SHEET = Union[pd.DataFrame, Worksheet, List[str]]


 def _load_sheet(book: BOOK, index: str) -> SHEET:
-    #  function _load_sheet {{{ # 
+    #  function _load_sheet {{{ #
    try:
        if isinstance(book, str):
            book: str = cast(str, book)
            csv_name: str = "{:}-{:}.csv".format(os.path.splitext(book)[0], index)

-            with open(csv_name) as f:
-                csv_lines: List[str] = list(itertools.dropwhile(lambda l: len(l) == 0
-                                                                , map(lambda l: l.strip()
-                                                                      , reversed(f.read().splitlines())
-                                                                      )
-                                                                )
-                                            )
-            return csv_lines
+            try:
+                all_lines: List[str] = _safe_read_file(csv_name)
+                csv_lines: List[str] = list(
+                    itertools.dropwhile(
+                        lambda l: len(l) == 0,
+                        map(lambda l: l.strip(), reversed(all_lines)),
+                    )
+                )
+                return csv_lines
+            except (FileNotFoundError, IOError) as e:
+                logger.error(f"Failed to read CSV file {csv_name}: {e}")
+                return None
        if isinstance(book, pd.ExcelFile):
            return pd.read_excel(book, index)
        if isinstance(book, Workbook):
@ -93,11 +120,124 @@ def _load_sheet(book: BOOK, index: str) -> SHEET:
        raise e
    except:
        return None
-    #  }}} function _load_sheet # 
+    #  }}} function _load_sheet #
+
+
+def _safe_read_file(file_path: str) -> List[str]:
+    """
+    Safely read a file with multiple encoding attempts.
+
+    Args:
+        file_path: Path to the file to read
+
+    Returns:
+        List of lines from the file
+
+    Raises:
+        FileNotFoundError: If file doesn't exist
+        IOError: If file cannot be read with any encoding
+    """
+    # Common encodings to try in order of preference
+    encodings = [
+        "utf-8",  # Most common modern encoding
+        "utf-8-sig",  # UTF-8 with BOM
+        "latin-1",  # ISO-8859-1, works with any byte sequence
+        "windows-1252",  # Common Windows encoding
+        "gbk",  # Chinese encoding
+        "cp1251",  # Cyrillic encoding
+        "iso-8859-1",  # Alternative latin-1
+    ]
+
+    last_error = None
+
+    for encoding in encodings:
+        try:
+            with open(file_path, "r", encoding=encoding) as f:
+                lines = f.read().splitlines()
+                logger.debug(
+                    f"Successfully read file {file_path} with encoding {encoding}"
+                )
+                return lines
+        except UnicodeDecodeError as e:
+            last_error = e
+            logger.debug(f"Failed to read {file_path} with encoding {encoding}: {e}")
+            continue
+        except (FileNotFoundError, IOError) as e:
+            # These are non-encoding related errors, re-raise immediately
+            raise e
+
+    # If all encodings fail, try with error handling as last resort
+    try:
+        with open(file_path, "r", encoding="utf-8", errors="replace") as f:
+            lines = f.read().splitlines()
+            logger.warning(f"Read file {file_path} with UTF-8 and error replacement")
+            return lines
+    except Exception as e:
+        logger.error(
+            f"Failed to read file {file_path} with any encoding. Last error: {last_error}"
+        )
+        raise IOError(
+            f"Cannot read file {file_path} with any supported encoding"
+        ) from last_error
+
+
+def compare_csv(result: str, expected: Union[str, List[str]], **options) -> float:
+    """
+    Compare CSV files. If expected is a list, returns 1.0 if result matches any of the expected files.
+
+    Args:
+        result: Path to result CSV file
+        expected: Path to expected CSV file or list of paths to expected CSV files
+        options: Additional options (strict, ignore_case)
+
+    Returns:
+        1.0 if result matches expected (or any file in expected list), 0.0 otherwise
+    """
+    if result is None:
+        return 0.0
+
+    try:
+        result_lines: List[str] = _safe_read_file(result)
+    except (FileNotFoundError, IOError) as e:
+        logger.error(f"Failed to read result file {result}: {e}")
+        return 0.0
+
+    # Convert expected to list if it's a single string (for backward compatibility)
+    if isinstance(expected, str):
+        expected_files = [expected]
+    else:
+        expected_files = expected
+
+    # Try to match against each expected file
+    for expected_file in expected_files:
+        try:
+            expected_lines: List[str] = _safe_read_file(expected_file)
+
+            # Process lines based on options
+            current_result_lines = result_lines
+            current_expected_lines = expected_lines
+
+            if not options.get("strict", True):
+                current_result_lines = map(str.strip, current_result_lines)
+                current_expected_lines = map(str.strip, current_expected_lines)
+            if options.get("ignore_case", False):
+                current_result_lines = map(str.lower, current_result_lines)
+                current_expected_lines = map(str.lower, current_expected_lines)
+
+            # Check if this expected file matches
+            if list(current_result_lines) == list(current_expected_lines):
+                return 1.0
+
+        except (FileNotFoundError, IOError):
+            # If this expected file doesn't exist, continue to next one
+            continue
+
+    # No match found
+    return 0.0


 def compare_table(result: str, expected: str = None, **options) -> float:
-    #  function compare_table {{{ # 
+    #  function compare_table {{{ #
    """
    Args:
        result (str): path to result xlsx
@ -114,13 +254,24 @@ def compare_table(result: str, expected: str = None, **options) -> float:
    """

    if result is None:
-        return 0.
+        logger.error("Result file path is None")
+        return 0.0
+
+    # Check if result file exists
+    if not os.path.exists(result):
+        logger.error(f"Result file not found: {result}")
+        return 0.0

    try:
+        logger.info(f"Loading result file: {result}")
        xlworkbookr: Workbook = openpyxl.load_workbook(filename=result)
        pdworkbookr = pd.ExcelFile(result)
-    except:
-        return 0.
+        logger.info(
+            f"Successfully loaded result file with sheets: {pdworkbookr.sheet_names}"
+        )
+    except Exception as e:
+        logger.error(f"Failed to load result file {result}: {e}")
+        return 0.0
    worksheetr_names: List[str] = pdworkbookr.sheet_names

    if expected is not None:
@ -132,32 +283,42 @@ def compare_table(result: str, expected: str = None, **options) -> float:
        pdworkbooke = None
        worksheete_names: List[str] = None

-    parse_idx: Callable[[Union[str, int], BOOK, BOOK], Tuple[BOOK, str]] = \
+    parse_idx: Callable[[Union[str, int], BOOK, BOOK], Tuple[BOOK, str]] = (
        functools.partial(
            _parse_sheet_idx,
            result_sheet_names=worksheetr_names,
-            expected_sheet_names=worksheete_names
+            expected_sheet_names=worksheete_names,
        )
+    )

    passes = True
    for r in options["rules"]:
        if r["type"] == "sheet_name":
-            #  Compare Sheet Names {{{ # 
+            #  Compare Sheet Names {{{ #
            metric: bool = worksheetr_names == worksheete_names
-            logger.debug("Assertion: %s.sheet_names == %s.sheet_names - %s", result, expected, metric)
-            #  }}} Compare Sheet Names # 
+            logger.debug(
+                "Assertion: %s.sheet_names == %s.sheet_names - %s",
+                result,
+                expected,
+                metric,
+            )
+            #  }}} Compare Sheet Names #

        elif r["type"] == "sheet_data":
-            #  Compare Sheet Data by Internal Value {{{ # 
+            #  Compare Sheet Data by Internal Value {{{ #
            # sheet_idx0: 0 == "RI0" == "RNSheet1" | "EI0" == "ENSheet1"
            # sheet_idx1: as sheet_idx0
            # precision: int as number of decimal digits, default to 4

            error_limit: int = r.get("precision", 4)
-            sheet1: pd.DataFrame = _load_sheet(*parse_idx(r["sheet_idx0"], pdworkbookr, pdworkbooke))
+            sheet1: pd.DataFrame = _load_sheet(
+                *parse_idx(r["sheet_idx0"], pdworkbookr, pdworkbooke)
+            )
            if sheet1 is None:
-                return 0.
-            sheet2: pd.DataFrame = _load_sheet(*parse_idx(r["sheet_idx1"], pdworkbookr, pdworkbooke))
+                return 0.0
+            sheet2: pd.DataFrame = _load_sheet(
+                *parse_idx(r["sheet_idx1"], pdworkbookr, pdworkbooke)
+            )

            sheet1 = sheet1.round(error_limit)
            sheet2 = sheet2.round(error_limit)
@ -168,28 +329,36 @@ def compare_table(result: str, expected: str = None, **options) -> float:
                logger.debug("Sheet1 =v= Sheet2: \n%s", str(sheet1 == sheet2))
            except:
                logger.debug("Sheet1 =/v= Sheet2")
-            logger.debug("Assertion: %s =v= %s - %s", r["sheet_idx0"], r["sheet_idx1"], metric)
-            #  }}} Compare Sheet Data by Internal Value # 
+            logger.debug(
+                "Assertion: %s =v= %s - %s", r["sheet_idx0"], r["sheet_idx1"], metric
+            )
+            #  }}} Compare Sheet Data by Internal Value #

        elif r["type"] == "sheet_print":
-            #  Compare Sheet Data by Printed Value {{{ # 
+            #  Compare Sheet Data by Printed Value {{{ #
            # sheet_idx0: 0 == "RI0" == "RNSheet1" | "EI0" == "ENSheet1"
            # sheet_idx1: as sheet_idx0
            # ignore_case: optional, defaults to False

-            sheet1: List[str] = _load_sheet(*parse_idx(r["sheet_idx0"], result, expected))
+            sheet1: List[str] = _load_sheet(
+                *parse_idx(r["sheet_idx0"], result, expected)
+            )
            if sheet1 is None:
-                return 0.
-            sheet2: List[str] = _load_sheet(*parse_idx(r["sheet_idx1"], result, expected))
+                return 0.0
+            sheet2: List[str] = _load_sheet(
+                *parse_idx(r["sheet_idx1"], result, expected)
+            )
            if r.get("ignore_case", False):
                sheet1 = [l.lower() for l in sheet1]
                sheet2 = [l.lower() for l in sheet2]
            metric: bool = sheet1 == sheet2
-            logger.debug("Assertion: %s =p= %s - %s", r["sheet_idx0"], r["sheet_idx1"], metric)
-            #  }}} Compare Sheet Data by Printed Value # 
+            logger.debug(
+                "Assertion: %s =p= %s - %s", r["sheet_idx0"], r["sheet_idx1"], metric
+            )
+            #  }}} Compare Sheet Data by Printed Value #

        elif r["type"] == "sheet_fuzzy":
-            #  Fuzzy Match for Ranges {{{ # 
+            #  Fuzzy Match for Ranges {{{ #
            # sheet_idx0: 0 == "RI0" == "RNSheet1" | "EI0" == "ENSheet1"
            # sheet_idx1: as sheet_idx0
            # rules: list of dict, each dict is like
@ -209,7 +378,9 @@ def compare_table(result: str, expected: str = None, **options) -> float:
            for rl in r["rules"]:
                for rng in MultiCellRange(rl["range"]):
                    for cdn in rng.cells:
-                        coordinate: str = "{:}{:d}".format(get_column_letter(cdn[1]), cdn[0])
+                        coordinate: str = "{:}{:d}".format(
+                            get_column_letter(cdn[1]), cdn[0]
+                        )
                        value1: str = str(read_cell_value(*sheet1, coordinate))
                        value2: str = str(read_cell_value(*sheet2, coordinate))
                        logger.debug("%s: %s vs %s", cdn, value1, value2)
@ -225,8 +396,12 @@ def compare_table(result: str, expected: str = None, **options) -> float:
                            value2 = value2.rstrip(rl["trim_trailings"])
                        if "ignore_chars" in rl:
                            ignore_chars: Set[str] = set(rl["ignore_chars"])
-                            value1 = "".join(filter(lambda ch: ch not in ignore_chars, value1))
-                            value2 = "".join(filter(lambda ch: ch not in ignore_chars, value2))
+                            value1 = "".join(
+                                filter(lambda ch: ch not in ignore_chars, value1)
+                            )
+                            value2 = "".join(
+                                filter(lambda ch: ch not in ignore_chars, value2)
+                            )
                        if rl.get("ignore_case", False):
                            value1 = value1.lower()
                            value2 = value2.lower()
@ -236,91 +411,141 @@ def compare_table(result: str, expected: str = None, **options) -> float:
                        elif rl["type"] == "included_by":
                            metric: bool = value1 in value2
                        elif rl["type"] == "fuzzy_match":
-                            metric: bool = fuzz.ratio(value1, value2) >= rl.get("threshold", 85.)
+                            metric: bool = fuzz.ratio(value1, value2) >= rl.get(
+                                "threshold", 85.0
+                            )
                        elif rl["type"] == "exact_match":
                            metric: bool = value1 == value2
                        total_metric = total_metric and metric

            metric: bool = total_metric
-            logger.debug("Assertion: %s =~= %s - %s", r["sheet_idx0"], r["sheet_idx1"], metric)
-            #  }}} Fuzzy Match for Ranges # 
+            logger.debug(
+                "Assertion: %s =~= %s - %s", r["sheet_idx0"], r["sheet_idx1"], metric
+            )
+            #  }}} Fuzzy Match for Ranges #

        elif r["type"] == "sparkline":
-            #  Compare Sparklines {{{ # 
+            #  Compare Sparklines {{{ #
            # sheet_idx0: 0 == "RI0" == "RNSheet1" | "EI0" == "ENSheet1"
            # sheet_idx1: as sheet_idx0

-            sparkline1: Dict[str, str] = load_sparklines(*parse_idx(r["sheet_idx0"], result, expected))
-            sparkline2: Dict[str, str] = load_sparklines(*parse_idx(r["sheet_idx1"], result, expected))
+            sparkline1: Dict[str, str] = load_sparklines(
+                *parse_idx(r["sheet_idx0"], result, expected)
+            )
+            sparkline2: Dict[str, str] = load_sparklines(
+                *parse_idx(r["sheet_idx1"], result, expected)
+            )
            metric: bool = sparkline1 == sparkline2
-            logger.debug("Assertion: %s.sp == %.sp - %s", r["sheet_idx0"], r["sheet_idx1"], metric)
-            #  }}} Compare Sparklines # 
+            logger.debug(
+                "Assertion: %s.sp == %.sp - %s",
+                r["sheet_idx0"],
+                r["sheet_idx1"],
+                metric,
+            )
+            #  }}} Compare Sparklines #

        elif r["type"] == "chart":
-            #  Compare Charts {{{ # 
+            #  Compare Charts {{{ #
            # sheet_idx0: 0 == "RI0" == "RNSheet1" | "EI0" == "ENSheet1"
            # sheet_idx1: as sheet_idx0
            # chart_props: list of str, see utils.load_charts

-            charts1: Dict[str, Any] = load_charts(*parse_idx(r["sheet_idx0"], xlworkbookr, xlworkbooke), **r)
-            charts2: Dict[str, Any] = load_charts(*parse_idx(r["sheet_idx1"], xlworkbookr, xlworkbooke), **r)
+            charts1: Dict[str, Any] = load_charts(
+                *parse_idx(r["sheet_idx0"], xlworkbookr, xlworkbooke), **r
+            )
+            charts2: Dict[str, Any] = load_charts(
+                *parse_idx(r["sheet_idx1"], xlworkbookr, xlworkbooke), **r
+            )
            metric: bool = charts1 == charts2
-            logger.debug("Assertion: %s[chart] == %s[chart] - %s", r["sheet_idx0"], r["sheet_idx1"], metric)
-            #  }}} Compare Charts # 
+            logger.debug(
+                "Assertion: %s[chart] == %s[chart] - %s",
+                r["sheet_idx0"],
+                r["sheet_idx1"],
+                metric,
+            )
+            #  }}} Compare Charts #

        elif r["type"] == "style":
-            #  Compare Style (Also Conditional Formatiing) {{{ # 
+            #  Compare Style (Also Conditional Formatiing) {{{ #
            # sheet_idx0: 0 == "RI0" == "RNSheet1" | "EI0" == "ENSheet1"
            # sheet_idx1: as sheet_idx0
            # props: list of str indicating concerned styles, see utils._read_cell_style

-            sheet_idx1: Tuple[BOOK, str] = parse_idx(r["sheet_idx0"], xlworkbookr, xlworkbooke)
+            sheet_idx1: Tuple[BOOK, str] = parse_idx(
+                r["sheet_idx0"], xlworkbookr, xlworkbooke
+            )
            book_name1: str = parse_idx(r["sheet_idx0"], result, expected)[0]
-            styles1: Dict[str, List[Any]] = load_xlsx_styles(*sheet_idx1, book_name1, **r)
+            styles1: Dict[str, List[Any]] = load_xlsx_styles(
+                *sheet_idx1, book_name1, **r
+            )

-            sheet_idx2: Tuple[BOOK, str] = parse_idx(r["sheet_idx1"], xlworkbookr, xlworkbooke)
+            sheet_idx2: Tuple[BOOK, str] = parse_idx(
+                r["sheet_idx1"], xlworkbookr, xlworkbooke
+            )
            book_name2: str = parse_idx(r["sheet_idx1"], result, expected)[0]
-            styles2: Dict[str, List[Any]] = load_xlsx_styles(*sheet_idx2, book_name2, **r)
+            styles2: Dict[str, List[Any]] = load_xlsx_styles(
+                *sheet_idx2, book_name2, **r
+            )
            # number_formats1: List[str] = [c.number_format.lower() for col in sheet1.iter_cols() for c in col if c.value is not None and c.data_type=="n"]
            # number_formats2: List[str] = [c.number_format.lower() for col in sheet2.iter_cols() for c in col if c.value is not None and c.data_type=="n"]
            metric: bool = styles1 == styles2
-            logger.debug("Assertion: %s.style == %s.style - %s", r["sheet_idx0"], r["sheet_idx1"], metric)
-            #  }}} Compare Style (Also Conditional Formatiing) # 
+            logger.debug(
+                "Assertion: %s.style == %s.style - %s",
+                r["sheet_idx0"],
+                r["sheet_idx1"],
+                metric,
+            )
+            #  }}} Compare Style (Also Conditional Formatiing) #

        elif r["type"] == "freeze":
-            #  Compare Freezing {{{ # 
+            #  Compare Freezing {{{ #
            # sheet_idx0: 0 == "RI0" == "RNSheet1" | "EI0" == "ENSheet1"
            # sheet_idx1: as sheet_idx0

-            sheet1: Worksheet = _load_sheet(*parse_idx(r["sheet_idx0"], xlworkbookr, xlworkbooke))
+            sheet1: Worksheet = _load_sheet(
+                *parse_idx(r["sheet_idx0"], xlworkbookr, xlworkbooke)
+            )
            if sheet1 is None:
-                return 0.
-            sheet2: Worksheet = _load_sheet(*parse_idx(r["sheet_idx1"], xlworkbookr, xlworkbooke))
+                return 0.0
+            sheet2: Worksheet = _load_sheet(
+                *parse_idx(r["sheet_idx1"], xlworkbookr, xlworkbooke)
+            )
            metric: bool = sheet1.freeze_panes == sheet2.freeze_panes
-            logger.debug("Assertion: %s.freeze(%s) == %s.freeze(%s) - %s"
-                         , r["sheet_idx0"], sheet1.freeze_panes
-                         , r["sheet_idx1"], sheet2.freeze_panes
-                         , metric
-                         )
-            #  }}} Compare Freezing # 
+            logger.debug(
+                "Assertion: %s.freeze(%s) == %s.freeze(%s) - %s",
+                r["sheet_idx0"],
+                sheet1.freeze_panes,
+                r["sheet_idx1"],
+                sheet2.freeze_panes,
+                metric,
+            )
+            #  }}} Compare Freezing #

        elif r["type"] == "zoom":
-            #  Check Zooming {{{ # 
+            #  Check Zooming {{{ #
            # sheet_idx: 0 == "RI0" == "RNSheet1" | "EI0" == "ENSheet1"
            # method: str
            # ref: value

-            sheet: Worksheet = _load_sheet(*parse_idx(r["sheet_idx"], xlworkbookr, xlworkbooke))
+            sheet: Worksheet = _load_sheet(
+                *parse_idx(r["sheet_idx"], xlworkbookr, xlworkbooke)
+            )
            if sheet is None:
-                return 0.
-            zoom_scale: Number = sheet.sheet_view.zoomScale or 100.
+                return 0.0
+            zoom_scale: Number = sheet.sheet_view.zoomScale or 100.0
            metric: bool = _match_value_to_rule(zoom_scale, r)
-            logger.debug("Assertion: %s.zoom(%.1f) %s %.1f - %s", r["sheet_idx"], zoom_scale, r["method"], r["ref"],
-                         metric)
-            #  }}} Check Zooming # 
+            logger.debug(
+                "Assertion: %s.zoom(%.1f) %s %.1f - %s",
+                r["sheet_idx"],
+                zoom_scale,
+                r["method"],
+                r["ref"],
+                metric,
+            )
+            #  }}} Check Zooming #

        elif r["type"] == "data_validation":
-            #  Check Data Validation {{{ # 
+            #  Check Data Validation {{{ #
            # sheet_idx: 0 == "RI0" == "RNSheet1" | "EI0" == "ENSheet1"
            # dv_props: list of dict like {attribute: {"method": str, "ref": anything}}
            #   available attributes:
@ -340,146 +565,197 @@ def compare_table(result: str, expected: str = None, **options) -> float:
            #     * promptTitle
            #     * imeMode

-            sheet: Worksheet = _load_sheet(*parse_idx(r["sheet_idx"], xlworkbookr, xlworkbooke))
+            sheet: Worksheet = _load_sheet(
+                *parse_idx(r["sheet_idx"], xlworkbookr, xlworkbooke)
+            )
            if sheet is None:
-                return 0.
-            data_validators: List[DataValidation] = sheet.data_validations.dataValidation
+                return 0.0
+            data_validators: List[DataValidation] = (
+                sheet.data_validations.dataValidation
+            )

            total_metric = len(data_validators) >= len(r["dv_props"])
            for dat_vldt in data_validators:
                metric = False
                for prpt in r["dv_props"]:
-                    metric = metric or all(_match_value_to_rule(getattr(dat_vldt, attrbt)
-                                                                , mr
-                                                                ) \
-                                           for attrbt, mr in prpt.items()
-                                           )
+                    metric = metric or all(
+                        _match_value_to_rule(getattr(dat_vldt, attrbt), mr)
+                        for attrbt, mr in prpt.items()
+                    )
                    if metric:
                        break
                total_metric = total_metric and metric
                if not total_metric:
                    break

-            logger.debug("Assertion: %s.data_validation - %s", r["sheet_idx"], total_metric)
+            logger.debug(
+                "Assertion: %s.data_validation - %s", r["sheet_idx"], total_metric
+            )
            metric: bool = total_metric
-            #  }}} Check Data Validation # 
+            #  }}} Check Data Validation #

        elif r["type"] == "row_props":
-            #  Check Row Properties {{{ # 
+            #  Check Row Properties {{{ #
            # sheet_idx0: 0 == "RI0" == "RNSheet1" | "EI0" == "ENSheet1"
            # sheet_idx1: as sheet_idx0
            # props: list of str, see utils.load_rows_or_cols

-            rows1: Dict[str, Any] = load_rows_or_cols(*parse_idx(r["sheet_idx0"], xlworkbookr, xlworkbooke)
-                                                      , obj="row"
-                                                      , **r
-                                                      )
-            rows2: Dict[str, Any] = load_rows_or_cols(*parse_idx(r["sheet_idx1"], xlworkbookr, xlworkbooke)
-                                                      , obj="row"
-                                                      , **r
-                                                      )
+            rows1: Dict[str, Any] = load_rows_or_cols(
+                *parse_idx(r["sheet_idx0"], xlworkbookr, xlworkbooke), obj="row", **r
+            )
+            rows2: Dict[str, Any] = load_rows_or_cols(
+                *parse_idx(r["sheet_idx1"], xlworkbookr, xlworkbooke), obj="row", **r
+            )
            logger.debug("Rows1: %s", repr(rows1))
            logger.debug("Rows2: %s", repr(rows2))
            metric: bool = rows1 == rows2
-            logger.debug("Assertion: %s[rows] == %s[rows] - %s", r["sheet_idx0"], r["sheet_idx1"], metric)
-            #  }}} Check Row Properties # 
+            logger.debug(
+                "Assertion: %s[rows] == %s[rows] - %s",
+                r["sheet_idx0"],
+                r["sheet_idx1"],
+                metric,
+            )
+            #  }}} Check Row Properties #

        elif r["type"] == "col_props":
-            #  Check Row Properties {{{ # 
+            #  Check Row Properties {{{ #
            # sheet_idx0: 0 == "RI0" == "RNSheet1" | "EI0" == "ENSheet1"
            # sheet_idx1: as sheet_idx0
            # props: list of str, see utils.load_rows_or_cols

-            cols1: Dict[str, Any] = load_rows_or_cols(*parse_idx(r["sheet_idx0"], xlworkbookr, xlworkbooke)
-                                                      , obj="column"
-                                                      , **r
-                                                      )
-            cols2: Dict[str, Any] = load_rows_or_cols(*parse_idx(r["sheet_idx1"], xlworkbookr, xlworkbooke)
-                                                      , obj="column"
-                                                      , **r
-                                                      )
+            cols1: Dict[str, Any] = load_rows_or_cols(
+                *parse_idx(r["sheet_idx0"], xlworkbookr, xlworkbooke), obj="column", **r
+            )
+            cols2: Dict[str, Any] = load_rows_or_cols(
+                *parse_idx(r["sheet_idx1"], xlworkbookr, xlworkbooke), obj="column", **r
+            )
            metric: bool = cols1 == cols2
-            logger.debug("Assertion: %s[cols] == %s[cols] - %s", r["sheet_idx0"], r["sheet_idx1"], metric)
-            #  }}} Check Row Properties # 
+            logger.debug(
+                "Assertion: %s[cols] == %s[cols] - %s",
+                r["sheet_idx0"],
+                r["sheet_idx1"],
+                metric,
+            )
+            #  }}} Check Row Properties #

        elif r["type"] == "filter":
-            #  Compare Filters {{{ # 
+            #  Compare Filters {{{ #
            # sheet_idx0: 0 == "RI0" == "RNSheet1" | "EI0" == "ENSheet1"
            # sheet_idx1: as sheet_idx0

-            filters1: Dict[str, Any] = load_filters(*parse_idx(r["sheet_idx0"], xlworkbookr, xlworkbooke), **r)
-            filters2: Dict[str, Any] = load_filters(*parse_idx(r["sheet_idx1"], xlworkbookr, xlworkbooke), **r)
+            filters1: Dict[str, Any] = load_filters(
+                *parse_idx(r["sheet_idx0"], xlworkbookr, xlworkbooke), **r
+            )
+            filters2: Dict[str, Any] = load_filters(
+                *parse_idx(r["sheet_idx1"], xlworkbookr, xlworkbooke), **r
+            )
            metric: bool = filters1 == filters2
-            logger.debug("Assertion: %s[filter] == %s[filter] - %s", r["sheet_idx0"], r["sheet_idx1"], metric)
-            #  }}} Compare Filters # 
+            logger.debug(
+                "Assertion: %s[filter] == %s[filter] - %s",
+                r["sheet_idx0"],
+                r["sheet_idx1"],
+                metric,
+            )
+            #  }}} Compare Filters #

        elif r["type"] == "pivot_table":
-            #  Compare Pivot Tables {{{ # 
+            #  Compare Pivot Tables {{{ #
            # sheet_idx0: 0 == "RI0" == "RNSheet1" | "EI0" == "ENSheet1"
            # sheet_idx1: as sheet_idx0
            # pivot_props: list of str, see utils.load_pivot_tables

-            pivots1: Dict[str, Any] = load_pivot_tables(*parse_idx(r["sheet_idx0"], xlworkbookr, xlworkbooke), **r)
-            pivots2: Dict[str, Any] = load_pivot_tables(*parse_idx(r["sheet_idx1"], xlworkbookr, xlworkbooke), **r)
+            pivots1: Dict[str, Any] = load_pivot_tables(
+                *parse_idx(r["sheet_idx0"], xlworkbookr, xlworkbooke), **r
+            )
+            pivots2: Dict[str, Any] = load_pivot_tables(
+                *parse_idx(r["sheet_idx1"], xlworkbookr, xlworkbooke), **r
+            )
            metric: bool = pivots1 == pivots2
-            logger.debug("Assertion: %s[pivot]==%s[pivot] - %s", r["sheet_idx0"], r["sheet_idx1"], metric)
-            #  }}} Compare Pivot Tables # 
+            logger.debug(
+                "Assertion: %s[pivot]==%s[pivot] - %s",
+                r["sheet_idx0"],
+                r["sheet_idx1"],
+                metric,
+            )
+            #  }}} Compare Pivot Tables #

        elif r["type"] == "check_cell":
-            #  Check Cell Properties {{{ # 
+            #  Check Cell Properties {{{ #
            # sheet_idx: 0 == "RI0" == "RNSheet1" | "EI0" == "ENSheet1"
            # coordinate: str, "E3"
            # props: dict like {attribute: {"method": str, "ref": anything}}
            #   supported attributes: value & those supported by utils._read_cell_style

-            sheet: Worksheet = _load_sheet(*parse_idx(r["sheet_idx"], xlworkbookr, xlworkbooke))
-            if sheet is None:
-                return 0.
-            # data_frame: pd.DataFrame = _load_sheet(*parse_idx(r["sheet_idx"], pdworkbookr, pdworkbooke))
-            cell: Cell = sheet[r["coordinate"]]
-            metric: bool = True
-            for prpt, rule in r["props"].items():
-                if prpt == "value":
-                    val = read_cell_value(*parse_idx(r["sheet_idx"], result, expected), r["coordinate"])
-                else:
-                    val = _read_cell_style(prpt, cell)
+            try:
+                sheet: Worksheet = _load_sheet(
+                    *parse_idx(r["sheet_idx"], xlworkbookr, xlworkbooke)
+                )
+                if sheet is None:
+                    logger.error(
+                        f"Failed to load sheet for sheet_idx: {r['sheet_idx']}"
+                    )
+                    return 0.0
+                # data_frame: pd.DataFrame = _load_sheet(*parse_idx(r["sheet_idx"], pdworkbookr, pdworkbooke))
+                cell: Cell = sheet[r["coordinate"]]
+                metric: bool = True
+                for prpt, rule in r["props"].items():
+                    if prpt == "value":
+                        try:
+                            parsed_result = parse_idx(r["sheet_idx"], result, expected)
+                            logger.debug(f"parse_idx result: {parsed_result}")
+                            val = read_cell_value(*parsed_result, r["coordinate"])
+                            logger.debug(f"Cell {r['coordinate']} value: {val}")
+                        except Exception as e:
+                            logger.error(
+                                f"Failed to read cell value at {r['coordinate']}: {e}"
+                            )
+                            val = None
+                    else:
+                        try:
+                            val = _read_cell_style(prpt, cell)
+                        except Exception as e:
+                            logger.error(
+                                f"Failed to read cell style {prpt} at {r['coordinate']}: {e}"
+                            )
+                            val = None

-                metric = metric and _match_value_to_rule(val, rule)
+                    metric = metric and _match_value_to_rule(val, rule)
+            except Exception as e:
+                logger.error(f"Error in check_cell processing: {e}")
+                return 0.0

-            logger.debug("Assertion: %s[%s] :%s - %s"
-                         , r["sheet_idx"], r["coordinate"]
-                         , repr(r["props"]), metric
-                         )
-            #  }}} Check Cell Properties # 
+            logger.debug(
+                "Assertion: %s[%s] :%s - %s",
+                r["sheet_idx"],
+                r["coordinate"],
+                repr(r["props"]),
+                metric,
+            )
+            #  }}} Check Cell Properties #

        else:
-            raise NotImplementedError("Unimplemented sheet check: {:}".format(r["type"]))
+            raise NotImplementedError(
+                "Unimplemented sheet check: {:}".format(r["type"])
+            )

        passes = passes and metric
        if not passes:
            break

    return float(passes)
-    #  }}} function compare_table # 
+    #  }}} function compare_table #


-def compare_csv(result: str, expected: str, **options) -> float:
-    if result is None:
-        return 0.
-
-    with open(result) as f:
-        result_lines: Iterable[str] = f.read().splitlines()
-    with open(expected) as f:
-        expected_lines: Iterable[str] = f.read().splitlines()
-    if not options.get("strict", True):
-        result_lines = map(str.strip, result_lines)
-        expected_lines = map(str.strip, expected_lines)
-    if options.get("ignore_case", False):
-        result_lines = map(str.lower, result_lines)
-        expected_lines = map(str.lower, expected_lines)
-
-    metric: bool = list(result_lines) == list(expected_lines)
-    return float(metric)
+def _normalize_city_string(value: Any) -> str:
+    """Lowercase, strip punctuation, and remove accents for tolerant matching."""
+    if value is None:
+        return ""
+    if not isinstance(value, str):
+        value = str(value)
+    normalized = unicodedata.normalize("NFKD", value)
+    normalized = "".join(ch for ch in normalized if not unicodedata.combining(ch))
+    normalized = re.sub(r"[^a-z0-9]+", " ", normalized.lower())
+    return normalized.strip()


 def compare_conference_city_in_order(actual_city_list_path, expected_city):
@ -490,28 +766,35 @@ def compare_conference_city_in_order(actual_city_list_path, expected_city):
    for row in sheet["C2:C22"]:
        for cell in row:
            actual_city_list.append(cell.value)
-    # expected_city is the city that we want to compare with the actual city list
-    # must in order index
-    # debug
+
    try:
-        for i in range(len(actual_city_list)):
-            if isinstance(expected_city_list[i], str):
-                if expected_city_list[i] not in actual_city_list[i]:
-                    logger.debug(f"Expected city {expected_city_list[i]}; Actual city {actual_city_list[i]}")
-                    print(f"Expected city {expected_city_list[i]}; Actual city {actual_city_list[i]}")
-                    return 0.
-
-
-            elif isinstance(expected_city_list[i], List):
-                if not any(possible_str in actual_city_list[i] for possible_str in expected_city_list[i]):
-                    logger.debug(f"Expected city {expected_city_list[i]}; Actual city {actual_city_list[i]}")
-                    print(f"Expected city {expected_city_list[i]}; Actual city {actual_city_list[i]}")
-                    return 0.
+        for i, actual_city in enumerate(actual_city_list):
+            actual_normalized = _normalize_city_string(actual_city)
+            expected_entry = expected_city_list[i]

+            if isinstance(expected_entry, str):
+                expected_candidates = [expected_entry]
+            elif isinstance(expected_entry, List):
+                expected_candidates = expected_entry
            else:
                raise TypeError("Expected city should be a string or a list of strings")

-    except:
-        return 0.
+            matched = False
+            for candidate in expected_candidates:
+                normalized_candidate = _normalize_city_string(candidate)
+                if normalized_candidate and normalized_candidate in actual_normalized:
+                    matched = True
+                    break

-    return 1.
+            if not matched:
+                logger.debug(
+                    f"Expected city {expected_entry}; Actual city {actual_city}"
+                )
+                print(f"Expected city {expected_entry}; Actual city {actual_city}")
+                return 0.0
+
+    except Exception as exc:
+        logger.error(f"Error comparing conference cities: {exc}")
+        return 0.0
+
+    return 1.0
--- a/desktop_env/evaluators/metrics/utils.py
+++ b/desktop_env/evaluators/metrics/utils.py
@ -4,12 +4,13 @@ import functools
 import itertools
 import logging
 import operator
+import os
 import re
 import zipfile
 #import pandas as pd
 from typing import Any, TypeVar, Union, Iterable, Optional, Callable
 from typing import Dict, List, Set, Match, Tuple, Pattern
-from urllib.parse import urlparse, urlunparse
+from urllib.parse import urlparse, urlunparse, ParseResult

 import formulas
 import lxml.cssselect
@ -28,15 +29,17 @@ from openpyxl.worksheet.cell_range import MultiCellRange, CellRange
 from openpyxl.worksheet.dimensions import DimensionHolder
 from openpyxl.worksheet.filters import AutoFilter, SortState
 from openpyxl.worksheet.worksheet import Worksheet
+import tldextract

 V = TypeVar("Value")

 logger = logging.getLogger("desktopenv.metrics.utils")

-_xlsx_namespaces = [("oo", "http://schemas.openxmlformats.org/spreadsheetml/2006/main")
-    , ("x14", "http://schemas.microsoft.com/office/spreadsheetml/2009/9/main")
-    , ("xm", "http://schemas.microsoft.com/office/excel/2006/main")
-                    ]
+_xlsx_namespaces = [
+    ("oo", "http://schemas.openxmlformats.org/spreadsheetml/2006/main"),
+    ("x14", "http://schemas.microsoft.com/office/spreadsheetml/2009/9/main"),
+    ("xm", "http://schemas.microsoft.com/office/excel/2006/main")
+]
 _xlsx_ns_mapping = dict(_xlsx_namespaces)
 _xlsx_ns_imapping = dict(map(lambda itm: (itm[1], itm[0]), _xlsx_namespaces))
 _xlsx_ns_imapping["http://schemas.openxmlformats.org/spreadsheetml/2006/main"] = None
@ -282,6 +285,13 @@ _shared_str_value_selector = lxml.cssselect.CSSSelector("oo|t", namespaces=_xlsx

 def read_cell_value(xlsx_file: str, sheet_name: str, coordinate: str) -> Any:
    #  read_cell_value {{{ # 
+    logger.debug(f"Reading cell value from {xlsx_file}, sheet: {sheet_name}, coordinate: {coordinate}")
+    
+    # Check if file exists
+    if not os.path.exists(xlsx_file):
+        logger.error(f"Excel file not found: {xlsx_file}")
+        return None
+    
    try:
        with zipfile.ZipFile(xlsx_file, "r") as z_f:
            try:
@ -308,9 +318,17 @@ def read_cell_value(xlsx_file: str, sheet_name: str, coordinate: str) -> Any:
                                               , namespaces=_xlsx_ns_mapping
                                               )(sheet)
                if len(cells) == 0:
+                    logger.debug(f"Cell {coordinate} not found in sheet {sheet_name}")
                    return None
                cell: _Element = cells[0]
-    except zipfile.BadZipFile:
+    except zipfile.BadZipFile as e:
+        logger.error(f"Bad zip file {xlsx_file}: {e}")
+        return None
+    except KeyError as e:
+        logger.error(f"Sheet {sheet_name} not found in {xlsx_file}: {e}")
+        return None
+    except Exception as e:
+        logger.error(f"Error reading {xlsx_file}: {e}")
        return None

    cell: Dict[str, str] = xmltodict.parse(lxml.etree.tostring(cell, encoding="unicode")
@ -328,6 +346,8 @@ def read_cell_value(xlsx_file: str, sheet_name: str, coordinate: str) -> Any:
            return cell["c"]["v"]
        if cell["c"]["@t"] == "inlineStr":
            return cell["c"]["is"]["t"]
+        if cell["c"]["@t"] == "e":
+            return cell["c"]["v"]
    except (KeyError, ValueError):
        return None
    #  }}} read_cell_value # 
@ -391,6 +411,43 @@ def _read_cell_style(style_name: str, cell: Union[Cell, MergedCell], diff_style:
    else:
        raise NotImplementedError("Unsupported Style: {:}".format(style_name))

+def _process_xlsx_cf_operator(operator: str, value: Any, ref: List[Any]) -> bool:
+    #  function _process_xlsx_cf_operator {{{ # 
+    # "containsText", "lessThanOrEqual", "notBetween", "lessThan", "notContains", "beginsWith", "equal", "greaterThanOrEqual", "between", "endsWith", "notEqual", "greaterThan"
+    try:
+        if operator=="lessThanOrEqual":
+            result: bool = value<=ref[0]
+        elif operator=="lessThan":
+            result: bool = value<ref[0]
+        elif operator=="equal":
+            result: bool = value==ref[0]
+        elif operator=="greaterThanOrEqual":
+            result: bool = value>=ref[0]
+        elif operator=="notEqual":
+            result: bool = value!=ref[0]
+        elif operator=="greaterThan":
+            result: bool = value>ref[0]
+        elif operator=="between":
+            small_one: float
+            large_one: float
+            small_one, large_one = min(ref), max(ref)
+            result: bool = value>=small_one and value<=large_one
+        elif operator=="notBetween":
+            small_one: float
+            large_one: float
+            small_one, large_one = min(ref), max(ref)
+            result: bool = value<small_one or value>large_one
+        else:
+            #raise NotImplementedError("Not Implemented CondFormat Operator: {:}".format(operator))
+            logger.exception("Not Implemented CondFormat Operator: {:}".format(operator))
+        return result
+    except TypeError:
+        logger.exception("Unmatched type of %s and %s. Auto to False", repr(value), repr(ref))
+        return False
+    except IndexError:
+        logger.exception("ref array doesn't have enough elements. Auto to False: %s", repr(ref))
+        return False
+    #  }}} function _process_xlsx_cf_operator # 

 _absolute_range_pattern: Pattern[str] = re.compile(r"""\$(?P<col1>[A-Z]{1,3})\$(?P<row1>\d+) # coord1
                                                        (?::
@ -441,12 +498,23 @@ def load_xlsx_styles(xlsx_file: Workbook, sheet_name: str, book_name: str, **opt
    for fmt in conditional_formattings:
        for r in fmt.rules:
            active_cells: List[Cell] = []
-            if r.type == "expression":
-                condition: Callable[[str], bool] = formula_parser.ast("=" + r.formula[0])[1].compile()
-                logger.debug("Expression condition: %s", r.formula[0])
+
+            #  Process CF Formulae {{{ # 
+            formulae: List[Callable[[Any], Any]] = []
+            argument_lists: List[List[Any]] = []
+            has_error = False
+            for fml in r.formula:
+                try:
+                    formula_func: Callable[[Any], Any] =\
+                            formula_parser.ast("=" + fml)[1].compile()
+                    logger.debug("CondFormat rule formula: %s", fml)
+                except:
+                    logger.exception("Formula parsing error: %s. Skipping.", repr(fml))
+                    has_error = True
+                    break

                arguments: List[Any] = []
-                absolute_range_match: List[Tuple[str, str, str, str]] = _absolute_range_pattern.findall(r.formula[0])
+                absolute_range_match: List[Tuple[str, str, str, str]] = _absolute_range_pattern.findall(fml)
                for m in absolute_range_match:
                    logger.debug("Absolute ranges: %s", repr(m))
                    if m[2] is None and m[3] is None:
@ -462,25 +530,65 @@ def load_xlsx_styles(xlsx_file: Workbook, sheet_name: str, book_name: str, **opt
                                         )
                logger.debug("Absolute range arguments: %s", repr(arguments))

-                nb_contiguous_nothings = 0
-                for rge in fmt.cells:
-                    for c in rge.cells:
-                        cell: Cell = worksheet.cell(row=c[0], column=c[1])
-                        cell_value = read_cell_value(book_name, sheet_name
-                                                     , coordinate="{:}{:d}".format(get_column_letter(c[1])
-                                                                                   , c[0]
-                                                                                   )
-                                                     )
-                        if cell_value is None:
-                            nb_contiguous_nothings += 1
-                            if nb_contiguous_nothings>50:
-                                break
-                            continue
-                        elif condition(cell_value, *arguments):
-                            logger.debug("Active Cell %s(%s) for %s", repr(cell), str(cell_value), r.formula[0])
-                            active_cells.append(cell)
+                formulae.append(formula_func)
+                argument_lists.append(arguments)
+
+            if has_error:
+                continue
+            #  }}} Process CF Formulae # 
+
+            #  Process Condition Accroding to Type {{{ # 
+            if r.type in { "expression"
+                         , "containsText", "notContainsText"
+                         , "endsWith", "beginsWith"
+                         , "containsErrors", "notContainsErrors"
+                         }:
+                condition: Callable[[Any], bool] = formulae[0]
+                arguments: List[Any] = argument_lists[0]
+                is_active: Callable[[Any], bool] = lambda v: condition(v, *arguments)
+            elif r.type == "cellIs":
+                operator: str = r.operator
+                try:
+                    references: List[Any] = [fml() for fml in formulae]
+                except:
+                    logger.exception("Error occurs while calculating reference values for cellIs condition formatting.")
+                    continue
+                is_active: Callable[[Any], bool] =\
+                        lambda v: _process_xlsx_cf_operator(operator, v, references)
            else:
-                raise NotImplementedError("Not Implemented Condition Type: {:}".format(r.type))
+                #raise NotImplementedError("Not Implemented Condition Type: {:}".format(r.type))
+                # e.g., type=top10 (rank=number, percent=bool, bottom=bool)
+                # type=aboveAverage (equalAverage=bool, aboveAverage=bool)
+                # type=duplicateValues / type=uniqueValues
+                logger.exception("Not Implemented Condition Type: {:}".format(r.type))
+            #  }}} Process Condition Accroding to Type # 
+
+
+            #  Test Each Cell {{{ # 
+            nb_contiguous_nothings = 0
+            for rge in fmt.cells:
+                for c in rge.cells:
+                    cell: Cell = worksheet.cell(row=c[0], column=c[1])
+                    cell_value = read_cell_value(book_name, sheet_name
+                                                 , coordinate="{:}{:d}".format(get_column_letter(c[1])
+                                                                               , c[0]
+                                                                               )
+                                                 )
+                    if cell_value is None:
+                        nb_contiguous_nothings += 1
+                        if nb_contiguous_nothings>50:
+                            break
+                        continue
+                    else:
+                        try:
+                            satisfies_condition: bool = is_active(cell_value)
+                        except:
+                            logger.exception("Error in formula calculation with cell value %d", repr(cell_value))
+                            satisfies_condition = False
+                        if satisfies_condition:
+                            logger.debug("Active Cell %s(%s) for %s", repr(cell), repr(cell_value), r.formula[0])
+                            active_cells.append(cell)
+            #  }}} Test Each Cell # 

            for c in active_cells:
                style_dict[c.coordinate] = [_read_cell_style(st, c, r.dxf) for st in concerned_styles]
@ -672,29 +780,59 @@ def are_lists_equal(list1, list2, comparison_func):
    return True


-def compare_urls(url1, url2):
+def compare_urls(url1, url2, full=True):
    if url1 is None or url2 is None:
        return url1 == url2
+    
+    logger.info(f"compare_urls. url1: {url1}; url2: {url2}")
+
+    def parse_with_default_scheme(url):
+        """
+        Ensure the URL has a scheme. If not, prepend 'http://'
+        so it parses as host + path instead of just a path.
+        """
+        # Regex to check if URL has scheme like 'http://', 'https://', etc.
+        if not re.match(r'^[a-zA-Z][a-zA-Z0-9+\-.]*://', url):
+            url = f"http://{url}"
+        return urlparse(url)

    def normalize_url(url):
-        # Parse the URL
-        parsed_url = urlparse(url)
+        # Parse the URL; if no scheme is present, assume 'http'
+        parsed_url = parse_with_default_scheme(url)
+        scheme = parsed_url.scheme.lower()

-        # If no scheme is present, assume 'http'
-        scheme = parsed_url.scheme if parsed_url.scheme else 'http'
+        # Extract the domain parts using tldextract
+        extracted = tldextract.extract(parsed_url.netloc.lower())
+        # e.g., extracted = TLDExtractResult(subdomain='www', domain='airbnb', suffix='com.sg')
+        
+        # Drop 'www' if it's the only subdomain
+        subdomain = extracted.subdomain
+        if subdomain == 'www':
+            subdomain = ''

-        # Lowercase the scheme and netloc, remove 'www.', and handle trailing slash
-        normalized_netloc = parsed_url.netloc.lower().replace("www.", "")
+        # Instead of using the suffix (e.g., 'com', 'com.sg'), ignore it completely
+        # so that both 'airbnb.com' and 'airbnb.com.sg' become just 'airbnb' or 'www.airbnb'
+        if subdomain:
+            normalized_netloc = f"{subdomain}.{extracted.domain}"
+        else:
+            normalized_netloc = extracted.domain
+
+        # Handle trailing slash in the path
        normalized_path = parsed_url.path if parsed_url.path != '/' else ''

-        # Reassemble the URL with normalized components
-        normalized_parsed_url = parsed_url._replace(scheme=scheme.lower(), netloc=normalized_netloc,
-                                                    path=normalized_path)
-        normalized_url = urlunparse(normalized_parsed_url)
+        # Reassemble the URL with the normalized components
+        normalized_parsed_url = ParseResult(
+            scheme=scheme.lower(),
+            netloc=normalized_netloc,
+            path=normalized_path,
+            params=parsed_url.params if full else '',       # Keep the params
+            query=parsed_url.query if full else '',         # Keep the query string
+            fragment=parsed_url.fragment if full else '',   # Keep the fragment
+        )
+        return urlunparse(normalized_parsed_url)

-        return normalized_url
-
-    # Normalize both URLs for comparison
+    logger.info(f"After normalization. url1: {normalize_url(url1)}; url2: {normalize_url(url2)}")
+    # Normalize both URLs
    norm_url1 = normalize_url(url1)
    norm_url2 = normalize_url(url2)

--- a/desktop_env/evaluators/metrics/vscode.py
+++ b/desktop_env/evaluators/metrics/vscode.py
@ -208,27 +208,151 @@ def is_extension_installed(actual: str, rules: Dict, **options):


 def check_python_file_by_test_suite(actual_files, test_file, **options) -> float:
-    """Check the python file by running the test suite in the given test file."""
-
+    """Check the python file by running the test suite in the given test file.
+    
+    This function is now more robust and handles various error conditions:
+    - File existence validation
+    - Module loading errors
+    - Function execution errors
+    - Proper resource cleanup
+    - Working directory management
+    """
+    import os
+    import uuid
+    import logging
+    from pathlib import Path
+    
+    logger = logging.getLogger(__name__)
    test_function_name = options.get('test_function_name', 'test')
-    # Create a unique module name, it can be arbitrary but must be unique in the current runtime environment
-    module_name = 'dynamic_module'
-
-    # Load the module from the given file path
-    spec = importlib.util.spec_from_file_location(module_name, test_file)
-    module = importlib.util.module_from_spec(spec)
-    sys.modules[module_name] = module  # Add the loaded module to sys.modules
-    spec.loader.exec_module(module)  # Execute the module to make its content available
-
-    # Retrieve the function by name from the loaded module and execute it
-    test_function = getattr(module, test_function_name)
-    try:
-        if test_function():
-            return 1.0
-        else:
-            return 0.0
-    except Exception as e:
+    
+    # Validate inputs
+    if not test_file:
+        logger.error("test_file is None or empty")
        return 0.0
+    
+    # Convert to absolute path and check existence
+    test_file_path = Path(test_file).resolve()
+    if not test_file_path.exists():
+        logger.error(f"Test file does not exist: {test_file_path}")
+        return 0.0
+    
+    if not test_file_path.is_file():
+        logger.error(f"Test file path is not a file: {test_file_path}")
+        return 0.0
+    
+    # Create unique module name to avoid conflicts
+    module_name = f'dynamic_test_module_{uuid.uuid4().hex[:8]}'
+    
+    # Store original working directory and sys.path
+    original_cwd = os.getcwd()
+    original_sys_path = sys.path.copy()
+    
+    try:
+        # Change to the directory containing the test file
+        test_dir = test_file_path.parent
+        os.chdir(test_dir)
+        logger.debug(f"Changed working directory to: {test_dir}")
+        
+        # Add test directory to Python path if not already present
+        if str(test_dir) not in sys.path:
+            sys.path.insert(0, str(test_dir))
+            logger.debug(f"Added {test_dir} to sys.path")
+        
+        # Try to load the module
+        try:
+            spec = importlib.util.spec_from_file_location(module_name, test_file_path)
+            if spec is None:
+                logger.error(f"Could not create module spec for {test_file_path}")
+                return 0.0
+            
+            if spec.loader is None:
+                logger.error(f"Module spec has no loader for {test_file_path}")
+                return 0.0
+            
+            module = importlib.util.module_from_spec(spec)
+            if module is None:
+                logger.error(f"Could not create module from spec for {test_file_path}")
+                return 0.0
+            
+            # Add to sys.modules temporarily
+            sys.modules[module_name] = module
+            
+            # Execute the module
+            spec.loader.exec_module(module)
+            logger.debug(f"Successfully loaded test module: {module_name}")
+            
+        except SyntaxError as e:
+            logger.error(f"Syntax error in test file: {e}")
+            return 0.0
+        except ImportError as e:
+            logger.error(f"Import error loading test file: {e}")
+            return 0.0
+        except Exception as e:
+            logger.error(f"Error loading test module: {e}")
+            return 0.0
+        
+        # Try to get the test function
+        try:
+            if not hasattr(module, test_function_name):
+                logger.error(f"Test function '{test_function_name}' not found in {test_file_path}")
+                return 0.0
+            
+            test_function = getattr(module, test_function_name)
+            
+            if not callable(test_function):
+                logger.error(f"'{test_function_name}' is not callable in {test_file_path}")
+                return 0.0
+            
+            logger.debug(f"Found test function: {test_function_name}")
+            
+        except Exception as e:
+            logger.error(f"Error getting test function: {e}")
+            return 0.0
+        
+        # Execute the test function
+        try:
+            result = test_function()
+            logger.debug(f"Test function returned: {result} (type: {type(result)})")
+            
+            # Handle different return types
+            if isinstance(result, bool):
+                return 1.0 if result else 0.0
+            elif isinstance(result, (int, float)):
+                # Normalize to 0.0-1.0 range
+                normalized = max(0.0, min(1.0, float(result)))
+                if normalized != result:
+                    logger.warning(f"Test result {result} normalized to {normalized}")
+                return normalized
+            else:
+                # For any other type, treat as True if truthy
+                bool_result = bool(result)
+                logger.warning(f"Test returned non-boolean/numeric value {result}, treating as {bool_result}")
+                return 1.0 if bool_result else 0.0
+                
+        except Exception as e:
+            logger.error(f"Error executing test function: {e}")
+            return 0.0
+    
+    except Exception as e:
+        logger.error(f"Unexpected error in check_python_file_by_test_suite: {e}")
+        return 0.0
+    
+    finally:
+        # Cleanup: remove the module from sys.modules
+        if module_name in sys.modules:
+            del sys.modules[module_name]
+            logger.debug(f"Cleaned up module: {module_name}")
+        
+        # Restore original working directory
+        try:
+            os.chdir(original_cwd)
+            logger.debug(f"Restored working directory to: {original_cwd}")
+        except Exception as e:
+            logger.warning(f"Could not restore working directory: {e}")
+        
+        # Restore original sys.path
+        sys.path[:] = original_sys_path
+        logger.debug("Restored sys.path")


 def check_python_file_by_gold_file(actual_files, gold_file: str, **options) -> float:
--- a/desktop_env/providers/init.py
+++ b/desktop_env/providers/init.py
@ -31,5 +31,13 @@ def create_vm_manager_and_provider(provider_name: str, region: str, use_proxy: b
        from desktop_env.providers.docker.manager import DockerVMManager
        from desktop_env.providers.docker.provider import DockerProvider
        return DockerVMManager(), DockerProvider(region)
+    elif provider_name == "aliyun":
+        from desktop_env.providers.aliyun.manager import AliyunVMManager
+        from desktop_env.providers.aliyun.provider import AliyunProvider
+        return AliyunVMManager(), AliyunProvider()
+    elif provider_name == "volcengine":
+        from desktop_env.providers.volcengine.manager import VolcengineVMManager
+        from desktop_env.providers.volcengine.provider import VolcengineProvider
+        return VolcengineVMManager(), VolcengineProvider()
    else:
        raise NotImplementedError(f"{provider_name} not implemented!")
--- a/desktop_env/providers/aliyun/ALIYUN_GUIDELINE.md
+++ b/desktop_env/providers/aliyun/ALIYUN_GUIDELINE.md
@ -0,0 +1,80 @@
+# Aliyun ECS Provider Configuration Guide
+
+This guide explains how to configure and use the Aliyun ECS provider for OSWorld desktop environments.
+
+## Configuration Process
+
+1. **Aliyun Account**: You need an active Aliyun Cloud account. This script uses pay-as-you-go billing by default, so ensure your account balance is above 100.
+2. **Access Keys**: Create AccessKey ID and AccessKey Secret in Aliyun RAM Access Control Console and grant ECS control permissions
+3. **VPC Setup**: Create a VPC, VSwitch, and Security Group in your target region
+4. **Custom Images**: Create OSWorld custom images
+5. It is recommended to manually complete the ECS creation process once to record all required environment variable information.
+
+## Environment Variables
+
+Set the following environment variables in your `.env` file:
+
+```bash
+# Aliyun Access Credentials
+ALIYUN_ACCESS_KEY_ID=your_access_key_id
+ALIYUN_ACCESS_KEY_SECRET=your_access_key_secret
+
+# ECS Configuration Information
+ALIYUN_REGION=eu-central-1
+ALIYUN_IMAGE_ID=your_image_id
+ALIYUN_INSTANCE_TYPE=ecs.e-c1m2.large
+ALIYUN_VSWITCH_ID=vsw-xxxxxxxxx
+ALIYUN_SECURITY_GROUP_ID=sg-xxxxxxxxx
+```
+
+## Required Aliyun Resources
+
+### 1. VPC and VSwitch
+- Create a VPC in your target region
+- Create a VSwitch within the VPC
+- Ensure the VSwitch has internet access for VNC connectivity
+
+### 2. Security Group
+**⚠️ Important**: Please strictly follow the port settings below to prevent OSWorld tasks from failing due to connection issues:
+
+#### Inbound Rules (8 rules required)
+
+| Type | Protocol | Port Range | Source | Description |
+|------|----------|------------|--------|-------------|
+| SSH | TCP | 22 | 0.0.0.0/0 | SSH access |
+| HTTP | TCP | 80 | 172.31.0.0/16 | HTTP traffic |
+| Custom TCP | TCP | 5000 | 172.31.0.0/16 | OSWorld backend service |
+| Custom TCP | TCP | 5910 | 0.0.0.0/0 | NoVNC visualization port |
+| Custom TCP | TCP | 8006 | 172.31.0.0/16 | VNC service port |
+| Custom TCP | TCP | 8080 | 172.31.0.0/16 | VLC service port |
+| Custom TCP | TCP | 8081 | 172.31.0.0/16 | Additional service port |
+| Custom TCP | TCP | 9222 | 172.31.0.0/16 | Chrome control port |
+
+#### Outbound Rules (1 rule required)
+
+| Type | Protocol | Port Range | Destination | Description |
+|------|----------|------------|-------------|-------------|
+| All traffic | All | All | 0.0.0.0/0 | Allow all outbound traffic |
+
+### 3. Custom Images
+You need to create a custom OSWorld image for Aliyun ECS. Please follow the instructions in the "Creating Custom ECS Images for OSWorld" section.
+
+## Creating Custom ECS Images for OSWorld
+
+This section provides guidance on how to create the custom ECS images required for OSWorld desktop environments. The process involves setting up a base instance with desktop environment and VNC server, then creating a custom image from it.
+
+### Step-by-Step Image Creation Process
+#### Step 1: Upload existing qcow2 image to Aliyun
+- Download the provided qcow2 image from the link in `desktop_env/providers/docker/manager.py`: https://huggingface.co/datasets/xlangai/ubuntu_osworld/resolve/main/Ubuntu.qcow2.zip
+- Unzip the downloaded file and upload it to Aliyun Object Storage Service (OSS). Make sure the OSS is in the same region as your target region to launch ECS instance.
+- In your ECS dashboard, go to "Images" and You will see the "Import Image" button. Click it and follow the instructions to import the qcow2 image from OSS.
+- After the import is complete, you will see the imported image in the "Images" list.
+#### Step 2: Create a new image
+Note that the image you created in Step 1 will have a different resolution than the one you want to use for OSWorld (1920x1080). We need to customize the image to have the correct resolution and setup noVNC.
+- Go to `Instances` tab and create a new instance with the imported image.
+- Connect to the running instance via VNC.
+- After connecting to the instance, please open the terminal and download this configuration script: `https://gist.githubusercontent.com/qykong/bea58ff98f20057d3a69921276dd4553/raw/cd1a91a0840c4192d793f43cfb90553370343b08/config.sh`.
+- If you want ssh and vnc password also be setup, use this `https://huggingface.co/datasets/xlangai/ubuntu_osworld/resolve/main/aliyun_config.sh?download=true`.
+- Run the script and reboot your instance.
+- After rebooting, the instance will have the correct resolution and noVNC setup. You can connect to the instance via "http://<your_instance_public_ip>:5910/vnc.html" (make sure your security group allows port 5910).
+- Save the running instance as a new image. The new image will be used as the OSWorld image.
--- a/desktop_env/providers/aliyun/ALIYUN_GUIDELINE_CN.md
+++ b/desktop_env/providers/aliyun/ALIYUN_GUIDELINE_CN.md
@ -0,0 +1,82 @@
+# 阿里云ECS提供商配置指南
+
+本指南介绍如何为OSWorld桌面环境配置和使用阿里云ECS。
+
+## 配置流程
+
+1. **阿里云账户**：您需要一个有效的阿里云账户，本脚本默认ECS通过按量付费方式拉起，需保证账户余额在100以上。
+2. **访问密钥**：在阿里云RAM访问控制控制台中创建AccessKey ID和AccessKey Secret，并授权ECS控制权限
+3. **VPC设置**：在目标地域创建VPC、交换机和安全组
+4. **自定义镜像**：创建OSWorld自定义镜像。
+5. 建议手动完成一次ECS创建流程后，记录所有需要的环境变量信息。
+
+## 环境变量
+
+在您的`.env`文件中设置以下环境变量：
+
+```bash
+# 阿里云访问凭证
+ALIYUN_ACCESS_KEY_ID=your_access_key_id
+ALIYUN_ACCESS_KEY_SECRET=your_access_key_secret
+
+# ECS配置信息
+ALIYUN_REGION=eu-central-1
+ALIYUN_IMAGE_ID=your_image_id
+ALIYUN_INSTANCE_TYPE=ecs.e-c1m2.large
+ALIYUN_VSWITCH_ID=vsw-xxxxxxxxx
+ALIYUN_SECURITY_GROUP_ID=sg-xxxxxxxxx
+```
+
+## 所需阿里云资源
+
+### 1. VPC和交换机
+- 在目标地域创建VPC
+- 在VPC内创建交换机
+- 确保交换机具有互联网访问能力以支持VNC连接
+
+### 2. 安全组
+**⚠️ 重要提示**：请严格按照以下端口设置，以防止OSWorld任务因连接问题而失败：
+
+#### 入方向规则（需要8条规则）
+
+| 类型 | 协议 | 端口范围 | 源地址 | 描述 |
+|------|------|----------|--------|------|
+| SSH | TCP | 22 | 0.0.0.0/0 | SSH访问 |
+| HTTP | TCP | 80 | 172.31.0.0/16 | HTTP流量 |
+| 自定义TCP | TCP | 5000 | 172.31.0.0/16 | OSWorld后端服务 |
+| 自定义TCP | TCP | 5910 | 0.0.0.0/0 | NoVNC可视化端口 |
+| 自定义TCP | TCP | 8006 | 172.31.0.0/16 | VNC服务端口 |
+| 自定义TCP | TCP | 8080 | 172.31.0.0/16 | VLC服务端口 |
+| 自定义TCP | TCP | 8081 | 172.31.0.0/16 | 附加服务端口 |
+| 自定义TCP | TCP | 9222 | 172.31.0.0/16 | Chrome控制端口 |
+
+#### 出方向规则（需要1条规则）
+
+| 类型 | 协议 | 端口范围 | 目标地址 | 描述 |
+|------|------|----------|----------|------|
+| 全部流量 | 全部 | 全部 | 0.0.0.0/0 | 允许所有出站流量 |
+
+### 3. 自定义镜像
+您需要为阿里云ECS创建自定义OSWorld镜像。请按照"为OSWorld创建自定义ECS镜像"部分的说明进行操作。
+
+
+## 为OSWorld创建自定义ECS镜像
+
+本部分提供如何创建OSWorld桌面环境所需的自定义ECS镜像的指导。该过程包括设置带有桌面环境和VNC服务器的基础实例，然后从中创建自定义镜像。
+
+### 分步镜像创建过程
+#### 步骤1：上传现有qcow2镜像到阿里云
+- 从`desktop_env/providers/docker/manager.py`中的链接下载提供的qcow2镜像：https://huggingface.co/datasets/xlangai/ubuntu_osworld/resolve/main/Ubuntu.qcow2.zip
+- 解压下载的文件并上传到阿里云对象存储服务（OSS）。确保OSS与您要启动ECS实例的目标地域在同一地域。
+- 在您的ECS控制台中，转到"镜像"页面，您将看到"导入镜像"按钮。点击它并按照说明从OSS导入qcow2镜像。
+- 导入完成后，您将在"镜像"列表中看到导入的镜像。
+
+#### 步骤2：创建新镜像
+请注意，您在步骤1中创建的镜像分辨率与您想要用于OSWorld的分辨率（1920x1080）不同。我们需要自定义镜像以具有正确的分辨率并设置noVNC。
+- 转到"实例"选项卡，使用导入的镜像创建新实例。
+- 通过VNC连接到正在运行的实例。
+- 连接到实例后，请打开终端并下载此配置脚本：`https://gist.githubusercontent.com/qykong/bea58ff98f20057d3a69921276dd4553/raw/cd1a91a0840c4192d793f43cfb90553370343b08/config.sh`。
+- 如果您还想设置ssh和vnc密码，请使用此脚本 `https://huggingface.co/datasets/xlangai/ubuntu_osworld/resolve/main/aliyun_config.sh?download=true`。
+- 运行脚本并重启您的实例。
+- 重启后，实例将具有正确的分辨率和noVNC设置。您可以通过"http://<your_instance_public_ip>:5910/vnc.html"连接到实例（确保您的安全组允许端口5910）。
+- 将正在运行的实例保存为新镜像。新镜像将用作OSWorld镜像。
--- a/desktop_env/providers/aliyun/init.py
+++ b/desktop_env/providers/aliyun/init.py
--- a/desktop_env/providers/aliyun/config.py
+++ b/desktop_env/providers/aliyun/config.py
@ -0,0 +1,31 @@
+import os
+
+
+# Default TTL minutes for instance auto-release (Aliyun-side)
+# Can be overridden via environment variable DEFAULT_TTL_MINUTES
+# ATTENTION: ECS requires TTL to be at least 30 minutes (if TTL > 0)
+MIN_TTL_MINUTES: int = 30
+
+_ttl_env_str = os.getenv("DEFAULT_TTL_MINUTES", "60")
+try:
+    _ttl_env_val = int(_ttl_env_str)
+except Exception:
+    _ttl_env_val = 60
+
+# If TTL is positive but less than Aliyun minimum, clamp to 30 minutes
+if _ttl_env_val > 0 and _ttl_env_val < MIN_TTL_MINUTES:
+    DEFAULT_TTL_MINUTES: int = MIN_TTL_MINUTES
+else:
+    DEFAULT_TTL_MINUTES: int = _ttl_env_val
+
+# Master switch for TTL feature
+ENABLE_TTL: bool = os.getenv("ENABLE_TTL", "true").lower() == "true"
+
+
+def compute_ttl_seconds(ttl_minutes: int) -> int:
+    try:
+        return max(0, int(ttl_minutes) * 60)
+    except Exception:
+        return 0
+
+
--- a/desktop_env/providers/aliyun/manager.py
+++ b/desktop_env/providers/aliyun/manager.py
@ -0,0 +1,325 @@
+import os
+import logging
+import dotenv
+import time
+import signal
+import requests
+from datetime import datetime, timedelta, timezone
+
+from alibabacloud_ecs20140526.client import Client as ECSClient
+from alibabacloud_tea_openapi import models as open_api_models
+from alibabacloud_ecs20140526 import models as ecs_models
+from alibabacloud_tea_util.client import Client as UtilClient
+from desktop_env.providers.base import VMManager
+from desktop_env.providers.aliyun.config import ENABLE_TTL, DEFAULT_TTL_MINUTES
+
+
+dotenv.load_dotenv()
+
+for env_name in [
+    "ALIYUN_REGION",
+    "ALIYUN_VSWITCH_ID",
+    "ALIYUN_SECURITY_GROUP_ID",
+    "ALIYUN_IMAGE_ID",
+    "ALIYUN_ACCESS_KEY_ID",
+    "ALIYUN_ACCESS_KEY_SECRET",
+    "ALIYUN_INSTANCE_TYPE",
+]:
+    if not os.getenv(env_name):
+        raise EnvironmentError(f"{env_name} must be set in the environment variables.")
+
+
+logger = logging.getLogger("desktopenv.providers.aliyun.AliyunVMManager")
+logger.setLevel(logging.INFO)
+
+ALIYUN_INSTANCE_TYPE = os.getenv("ALIYUN_INSTANCE_TYPE")
+ALIYUN_ACCESS_KEY_ID = os.getenv("ALIYUN_ACCESS_KEY_ID")
+ALIYUN_ACCESS_KEY_SECRET = os.getenv("ALIYUN_ACCESS_KEY_SECRET")
+ALIYUN_REGION = os.getenv("ALIYUN_REGION")
+ALIYUN_IMAGE_ID = os.getenv("ALIYUN_IMAGE_ID")
+ALIYUN_SECURITY_GROUP_ID = os.getenv("ALIYUN_SECURITY_GROUP_ID")
+ALIYUN_VSWITCH_ID = os.getenv("ALIYUN_VSWITCH_ID")
+ALIYUN_RESOURCE_GROUP_ID = os.getenv("ALIYUN_RESOURCE_GROUP_ID")
+
+WAIT_DELAY = 20
+MAX_ATTEMPTS = 15
+
+
+def _allocate_vm(screen_size=(1920, 1080)):
+    """
+    Allocate a new Aliyun ECS instance
+    """
+    assert screen_size == (1920, 1080), "Only 1920x1080 screen size is supported"
+
+    config = open_api_models.Config(
+        access_key_id=ALIYUN_ACCESS_KEY_ID,
+        access_key_secret=ALIYUN_ACCESS_KEY_SECRET,
+        region_id=ALIYUN_REGION,
+    )
+    client = ECSClient(config)
+    instance_id = None
+    original_sigint_handler = signal.getsignal(signal.SIGINT)
+    original_sigterm_handler = signal.getsignal(signal.SIGTERM)
+
+    def signal_handler(sig, frame):
+        if instance_id:
+            signal_name = "SIGINT" if sig == signal.SIGINT else "SIGTERM"
+            logger.warning(
+                f"Received {signal_name} signal, terminating instance {instance_id}..."
+            )
+            try:
+                delete_request = ecs_models.DeleteInstancesRequest(
+                    region_id=ALIYUN_REGION,
+                    instance_ids=UtilClient.to_jsonstring([instance_id]),
+                    force=True,
+                )
+                client.delete_instances(delete_request)
+                logger.info(
+                    f"Successfully terminated instance {instance_id} after {signal_name}."
+                )
+            except Exception as cleanup_error:
+                logger.error(
+                    f"Failed to terminate instance {instance_id} after {signal_name}: {str(cleanup_error)}"
+                )
+
+        # Restore original signal handlers
+        signal.signal(signal.SIGINT, original_sigint_handler)
+        signal.signal(signal.SIGTERM, original_sigterm_handler)
+
+        # Raise appropriate exception based on signal type
+        if sig == signal.SIGINT:
+            raise KeyboardInterrupt
+        else:
+            # For SIGTERM, exit gracefully
+            import sys
+
+            sys.exit(0)
+
+    try:
+        # Set up signal handlers for both SIGINT and SIGTERM
+        signal.signal(signal.SIGINT, signal_handler)
+        signal.signal(signal.SIGTERM, signal_handler)
+
+        logger.info(
+            f"Creating new ECS instance in region {ALIYUN_REGION} with image {ALIYUN_IMAGE_ID}"
+        )
+
+        # TTL configuration
+        ttl_enabled = ENABLE_TTL
+        ttl_minutes = DEFAULT_TTL_MINUTES
+        ttl_seconds = max(0, int(ttl_minutes) * 60)
+
+        # Aliyun constraints: at least 30 minutes in the future, ISO8601 UTC, seconds must be 00
+        now_utc = datetime.now(timezone.utc)
+        min_eta = now_utc + timedelta(minutes=30)
+        raw_eta = now_utc + timedelta(seconds=ttl_seconds)
+        effective_eta = raw_eta if raw_eta > min_eta else min_eta
+        # round up to the next full minute, zero seconds
+        effective_eta = (effective_eta + timedelta(seconds=59)).replace(second=0, microsecond=0)
+        auto_release_str = effective_eta.strftime('%Y-%m-%dT%H:%M:%SZ')
+        logger.info(
+            f"TTL config: enabled={ttl_enabled}, minutes={ttl_minutes}, seconds={ttl_seconds}, ETA(UTC)={auto_release_str}"
+        )
+
+        # Create instance request (attempt with auto_release_time first when TTL enabled)
+        def _build_request(with_ttl: bool) -> ecs_models.RunInstancesRequest:
+            kwargs = dict(
+                region_id=ALIYUN_REGION,
+                image_id=ALIYUN_IMAGE_ID,
+                instance_type=ALIYUN_INSTANCE_TYPE,
+                security_group_id=ALIYUN_SECURITY_GROUP_ID,
+                v_switch_id=ALIYUN_VSWITCH_ID,
+                instance_name=f"OSWorld-Desktop-{int(time.time())}",
+                description="OSWorld Desktop Environment Instance",
+                internet_max_bandwidth_out=10,
+                internet_charge_type="PayByTraffic",
+                instance_charge_type="PostPaid",
+                system_disk=ecs_models.RunInstancesRequestSystemDisk(
+                    size="50",
+                    category="cloud_essd",
+                ),
+                deletion_protection=False,
+            )
+            
+            if ALIYUN_RESOURCE_GROUP_ID:
+                kwargs["resource_group_id"] = ALIYUN_RESOURCE_GROUP_ID
+            
+            if with_ttl and ttl_enabled and ttl_seconds > 0:
+                kwargs["auto_release_time"] = auto_release_str
+            return ecs_models.RunInstancesRequest(**kwargs)
+
+        try:
+            request = _build_request(with_ttl=True)
+            response = client.run_instances(request)
+        except Exception as create_err:
+            # Retry without auto_release_time if creation-time TTL is rejected
+            logger.warning(
+                f"RunInstances with auto_release_time failed: {create_err}. Retrying without TTL field..."
+            )
+            request = _build_request(with_ttl=False)
+            response = client.run_instances(request)
+        instance_ids = response.body.instance_id_sets.instance_id_set
+
+        if not instance_ids:
+            raise RuntimeError(
+                "Failed to create ECS instance - no instance ID returned"
+            )
+
+        instance_id = instance_ids[0]
+        logger.info(f"ECS instance {instance_id} created successfully")
+
+        # Wait for the instance to be running
+        logger.info(f"Waiting for instance {instance_id} to be running...")
+        _wait_for_instance_running(client, instance_id)
+
+        logger.info(f"Instance {instance_id} is now running and ready")
+
+    except KeyboardInterrupt:
+        logger.warning("VM allocation interrupted by user (SIGINT).")
+        if instance_id:
+            logger.info(f"Terminating instance {instance_id} due to interruption.")
+            try:
+                delete_request = ecs_models.DeleteInstancesRequest(
+                    region_id=ALIYUN_REGION,
+                    instance_ids=UtilClient.to_jsonstring([instance_id]),
+                    force=True,
+                )
+                client.delete_instances(delete_request)
+            except Exception as cleanup_error:
+                logger.error(
+                    f"Failed to cleanup instance {instance_id}: {str(cleanup_error)}"
+                )
+        raise
+    except Exception as e:
+        logger.error(f"Failed to allocate ECS instance: {str(e)}")
+        if instance_id:
+            logger.info(f"Terminating instance {instance_id} due to an error.")
+            try:
+                delete_request = ecs_models.DeleteInstancesRequest(
+                    region_id=ALIYUN_REGION,
+                    instance_ids=UtilClient.to_jsonstring([instance_id]),
+                    force=True,
+                )
+                client.delete_instances(delete_request)
+            except Exception as cleanup_error:
+                logger.error(
+                    f"Failed to cleanup instance {instance_id}: {str(cleanup_error)}"
+                )
+        raise
+    finally:
+        # Restore original signal handlers
+        signal.signal(signal.SIGINT, original_sigint_handler)
+        signal.signal(signal.SIGTERM, original_sigterm_handler)
+
+    return instance_id
+
+
+def _wait_for_instance_running(
+    client: ECSClient, instance_id: str, max_attempts: int = MAX_ATTEMPTS
+):
+    """Wait for instance to reach Running state"""
+    for _ in range(max_attempts):
+        try:
+            req = ecs_models.DescribeInstancesRequest(
+                region_id=ALIYUN_REGION,
+                instance_ids=UtilClient.to_jsonstring([instance_id]),
+            )
+            response = client.describe_instances(req)
+
+            if response.body.instances.instance:
+                instance = response.body.instances.instance[0]
+                status = instance.status
+                logger.info(f"Instance {instance_id} status: {status}")
+
+                if status == "Running":
+                    return
+                elif status in ["Stopped", "Stopping"]:
+                    start_req = ecs_models.StartInstanceRequest(instance_id=instance_id)
+                    client.start_instance(start_req)
+                    logger.info(f"Started instance {instance_id}")
+
+            time.sleep(WAIT_DELAY)
+
+        except Exception as e:
+            logger.warning(f"Error checking instance status: {e}")
+            time.sleep(WAIT_DELAY)
+
+    raise TimeoutError(
+        f"Instance {instance_id} did not reach Running state within {max_attempts * WAIT_DELAY} seconds"
+    )
+
+
+def _wait_until_server_ready(public_ip: str):
+    """Wait until the server is ready"""
+    for _ in range(MAX_ATTEMPTS):
+        try:
+            logger.info(f"Checking server status on {public_ip}...")
+            response = requests.get(f"http://{public_ip}:5000/", timeout=2)
+            if response.status_code == 404:
+                logger.info(f"Server {public_ip} is ready")
+                return
+        except Exception:
+            time.sleep(WAIT_DELAY)
+
+    raise TimeoutError(
+        f"Server {public_ip} did not respond within {MAX_ATTEMPTS * WAIT_DELAY} seconds"
+    )
+
+
+class AliyunVMManager(VMManager):
+    """
+    Aliyun ECS VM Manager for managing virtual machines on Aliyun Cloud.
+
+    Aliyun ECS does not need to maintain a registry of VMs, as it can dynamically allocate and deallocate VMs.
+    """
+
+    def __init__(self, **kwargs):
+        self.initialize_registry()
+
+    def initialize_registry(self, **kwargs):
+        pass
+
+    def add_vm(self, vm_path, lock_needed=True, **kwargs):
+        pass
+
+    def _add_vm(self, vm_path):
+        pass
+
+    def delete_vm(self, vm_path, lock_needed=True, **kwargs):
+        pass
+
+    def _delete_vm(self, vm_path):
+        pass
+
+    def occupy_vm(self, vm_path, pid, lock_needed=True, **kwargs):
+        pass
+
+    def _occupy_vm(self, vm_path, pid):
+        pass
+
+    def check_and_clean(self, lock_needed=True, **kwargs):
+        pass
+
+    def _check_and_clean(self):
+        pass
+
+    def list_free_vms(self, lock_needed=True, **kwargs):
+        pass
+
+    def _list_free_vms(self):
+        pass
+
+    def get_vm_path(self, screen_size=(1920, 1080), **kwargs):
+        """Get a VM path (instance ID) for use"""
+        logger.info(
+            f"Allocating new ECS instance in region {ALIYUN_REGION} with screen size {screen_size}"
+        )
+
+        try:
+            instance_id = _allocate_vm(screen_size)
+            logger.info(f"Successfully allocated instance {instance_id}")
+            return instance_id
+
+        except Exception as e:
+            logger.error(f"Failed to allocate instance: {str(e)}")
+            raise
--- a/desktop_env/providers/aliyun/provider.py
+++ b/desktop_env/providers/aliyun/provider.py
@ -0,0 +1,224 @@
+import os
+import logging
+from datetime import datetime
+
+from alibabacloud_ecs20140526.client import Client as ECSClient
+from alibabacloud_tea_openapi import models as open_api_models
+from alibabacloud_ecs20140526 import models as ecs_models
+from alibabacloud_tea_util.client import Client as UtilClient
+
+from desktop_env.providers.base import Provider
+from desktop_env.providers.aliyun.manager import (
+    _allocate_vm,
+    _wait_for_instance_running,
+    _wait_until_server_ready,
+)
+
+
+logger = logging.getLogger("desktopenv.providers.aliyun.AliyunProvider")
+logger.setLevel(logging.INFO)
+
+
+class AliyunProvider(Provider):
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self.region = os.getenv("ALIYUN_REGION", "eu-central-1")
+        self.client = self._create_client()
+        # Whether to use private IP instead of public IP. Default: enabled.
+        # Priority: explicit kwarg > env var ALIYUN_USE_PRIVATE_IP > default True
+        env_use_private = os.getenv("ALIYUN_USE_PRIVATE_IP", "1").lower() in {"1", "true", "yes", "on"}
+        kw_flag = kwargs.get("use_private_ip", None)
+        self.use_private_ip = env_use_private if kw_flag is None else bool(kw_flag)
+
+    def _create_client(self) -> ECSClient:
+        config = open_api_models.Config(
+            access_key_id=os.getenv("ALIYUN_ACCESS_KEY_ID"),
+            access_key_secret=os.getenv("ALIYUN_ACCESS_KEY_SECRET"),
+            region_id=self.region,
+        )
+        return ECSClient(config)
+
+    def start_emulator(self, path_to_vm: str, headless: bool, *args, **kwargs):
+        logger.info("Starting Aliyun ECS instance...")
+
+        try:
+            # Check the current state of the instance
+            response = self._describe_instance(path_to_vm)
+            if not response.body.instances.instance:
+                logger.error(f"Instance {path_to_vm} not found")
+                return
+
+            instance = response.body.instances.instance[0]
+            state = instance.status
+            logger.info(f"Instance {path_to_vm} current state: {state}")
+
+            if state == "Running":
+                # If the instance is already running, skip starting it
+                logger.info(
+                    f"Instance {path_to_vm} is already running. Skipping start."
+                )
+                return
+
+            if state == "Stopped":
+                # Start the instance if it's currently stopped
+                req = ecs_models.StartInstanceRequest(instance_id=path_to_vm)
+                self.client.start_instance(req)
+                logger.info(f"Instance {path_to_vm} is starting...")
+
+                # Wait until the instance reaches 'Running' state
+                _wait_for_instance_running(self.client, path_to_vm)
+                logger.info(f"Instance {path_to_vm} is now running.")
+            else:
+                # For all other states (Pending, Starting, etc.), log a warning
+                logger.warning(
+                    f"Instance {path_to_vm} is in state '{state}' and cannot be started."
+                )
+
+        except Exception as e:
+            logger.error(
+                f"Failed to start the Aliyun ECS instance {path_to_vm}: {str(e)}"
+            )
+            raise
+
+    def get_ip_address(self, path_to_vm: str) -> str:
+        logger.info("Getting Aliyun ECS instance IP address...")
+
+        try:
+            response = self._describe_instance(path_to_vm)
+            if not response.body.instances.instance:
+                logger.error(f"Instance {path_to_vm} not found")
+                return ""
+
+            instance = response.body.instances.instance[0]
+
+            # Get private and public IP addresses
+            private_ip = ""
+            public_ip = ""
+
+            if hasattr(instance, "vpc_attributes") and instance.vpc_attributes:
+                private_ip = (
+                    instance.vpc_attributes.private_ip_address.ip_address[0]
+                    if instance.vpc_attributes.private_ip_address.ip_address
+                    else ""
+                )
+
+            if hasattr(instance, "public_ip_address") and instance.public_ip_address:
+                public_ip = (
+                    instance.public_ip_address.ip_address[0]
+                    if instance.public_ip_address.ip_address
+                    else ""
+                )
+
+            if hasattr(instance, "eip_address") and instance.eip_address:
+                public_ip = instance.eip_address.ip_address or public_ip
+
+            # Select which IP to use based on configuration
+            ip_to_use = private_ip if (self.use_private_ip and private_ip) else public_ip
+
+            if not ip_to_use:
+                logger.warning("No usable IP address available (private/public both missing)")
+                return ""
+
+            _wait_until_server_ready(ip_to_use)
+            if public_ip:
+                vnc_url = f"http://{public_ip}:5910/vnc.html"
+                logger.info(f"🖥️  VNC Web Access URL: {vnc_url}")
+                logger.info("=" * 80)
+            logger.info(f"📡 Public IP: {public_ip}")
+            logger.info(f"🏠 Private IP: {private_ip}")
+            logger.info(f"🔧 Using IP: {'Private' if ip_to_use == private_ip else 'Public'} -> {ip_to_use}")
+            logger.info("=" * 80)
+            print(f"\n🌐 VNC Web Access URL: {vnc_url}")
+            print(
+                "📍 Please open the above address in the browser "
+                "for remote desktop access\n"
+            )
+
+            return ip_to_use
+
+        except Exception as e:
+            logger.error(
+                f"Failed to retrieve IP address for the instance {path_to_vm}: {str(e)}"
+            )
+            raise
+
+    def save_state(self, path_to_vm: str, snapshot_name: str):
+        logger.info("Saving Aliyun ECS instance state...")
+
+        try:
+            req = ecs_models.CreateImageRequest(
+                region_id=self.region,
+                instance_id=path_to_vm,
+                image_name=snapshot_name,
+                description=f"Snapshot created at {datetime.now().isoformat()}",
+            )
+            response = self.client.create_image(req)
+            image_id = response.body.image_id
+            logger.info(
+                f"Image {image_id} created successfully from instance {path_to_vm}."
+            )
+            return image_id
+
+        except Exception as e:
+            logger.error(
+                f"Failed to create image from the instance {path_to_vm}: {str(e)}"
+            )
+            raise
+
+    def revert_to_snapshot(self, path_to_vm: str, snapshot_name: str):
+        logger.info(
+            f"Reverting Aliyun ECS instance to snapshot image: {snapshot_name}..."
+        )
+
+        try:
+            # Step 1: Retrieve the original instance details
+            response = self._describe_instance(path_to_vm)
+            if not response.body.instances.instance:
+                logger.error(f"Instance {path_to_vm} not found")
+                return
+            # Step 2: Delete the old instance
+            req = ecs_models.DeleteInstancesRequest(
+                region_id=self.region, instance_id=[path_to_vm], force=True
+            )
+            self.client.delete_instances(req)
+            logger.info(f"Old instance {path_to_vm} has been deleted.")
+
+            # Step 3: Launch a new instance from the snapshot image
+            new_instance_id = _allocate_vm()
+            logger.info(f"Instance {new_instance_id} is ready.")
+
+            # Get VNC access information
+            self.get_ip_address(new_instance_id)
+
+            return new_instance_id
+
+        except Exception as e:
+            logger.error(
+                f"Failed to revert to snapshot {snapshot_name} for the instance {path_to_vm}: {str(e)}"
+            )
+            raise
+
+    def stop_emulator(self, path_to_vm: str, region: str = None):
+        logger.info(f"Stopping Aliyun ECS instance {path_to_vm}...")
+
+        try:
+            req = ecs_models.DeleteInstancesRequest(
+                region_id=self.region, instance_id=[path_to_vm], force=True
+            )
+            self.client.delete_instances(req)
+            logger.info(f"Instance {path_to_vm} has been deleted.")
+
+        except Exception as e:
+            logger.error(
+                f"Failed to stop the Aliyun ECS instance {path_to_vm}: {str(e)}"
+            )
+            raise
+
+    def _describe_instance(
+        self, instance_id: str
+    ) -> ecs_models.DescribeInstancesResponse:
+        """Get instance details"""
+        req = ecs_models.DescribeInstancesRequest(
+            region_id=self.region, instance_ids=UtilClient.to_jsonstring([instance_id])
+        )
+        return self.client.describe_instances(req)
--- a/desktop_env/providers/aws/AWS_GUIDELINE.md
+++ b/desktop_env/providers/aws/AWS_GUIDELINE.md
@ -4,18 +4,62 @@

 Welcome to the AWS VM Management documentation. Before you proceed with using the code to manage AWS services, please ensure the following variables are set correctly according to your AWS environment.

+## Overview
+The AWS cloud service architecture consists of a host machine that controls multiple virtual machines (each virtual machine serves as an OSWorld environment, for which we provide AMI images) for testing and potential training purposes. To prevent security breaches, we need to properly configure security groups for both the host machine and virtual machines, as well as configure appropriate subnets.
+
+## Security Group Configuration
+
+### Security Group for OSWorld Virtual Machines
+OSWorld requires certain ports to be open, such as port 5000 for backend connections to OSWorld services, port 5910 for VNC visualization, port 9222 for Chrome control, etc. The `AWS_SECURITY_GROUP_ID` variable represents the security group configuration for virtual machines serving as OSWorld environments. Please complete the configuration and set this environment variable to the ID of the configured security group.
+
+**⚠️ Important**: Please strictly follow the port settings below to prevent OSWorld tasks from failing due to connection issues:
+
+#### Inbound Rules (8 rules required)
+
+| Type | Protocol | Port Range | Source | Description |
+|------|----------|------------|--------|-------------|
+| SSH | TCP | 22 | 0.0.0.0/0 | SSH access |
+| HTTP | TCP | 80 | 172.31.0.0/16 | HTTP traffic |
+| Custom TCP | TCP | 5000 | 172.31.0.0/16 | OSWorld backend service |
+| Custom TCP | TCP | 5910 | 0.0.0.0/0 | NoVNC visualization port |
+| Custom TCP | TCP | 8006 | 172.31.0.0/16 | VNC service port |
+| Custom TCP | TCP | 8080 | 172.31.0.0/16 | VLC service port |
+| Custom TCP | TCP | 8081 | 172.31.0.0/16 | Additional service port |
+| Custom TCP | TCP | 9222 | 172.31.0.0/16 | Chrome control port |
+
+#### Outbound Rules (1 rule required)
+
+| Type | Protocol | Port Range | Destination | Description |
+|------|----------|------------|-------------|-------------|
+| All traffic | All | All | 0.0.0.0/0 | Allow all outbound traffic |
+
+### Host Machine Security Group Configuration
+Configure according to your specific requirements. This project provides a monitor service that runs on port 8080 by default. You need to open this port to use this functionality.
+
+
+## VPC Configuration  
+To isolate the entire evaluation stack, we run both the host machine and all client virtual machines inside a dedicated VPC. The setup is straightforward:
+
+1. Launch the host instance via the AWS console and note the **VPC ID** and **Subnet ID** shown in its network settings.  
+2. Export the same **Subnet ID** as the environment variable `AWS_SUBNET_ID` before starting the client code.  
+   ```bash
+   export AWS_SUBNET_ID=subnet-xxxxxxxxxxxxxxxxx
+   ```
+   (Both the client and host must reside in this subnet for the evaluation to work.)
+
+
 ## Configuration Variables
+That’s essentially all the setup you need to perform. From here on, you only have to supply a few extra details and environment variables—just make sure they’re all present in your environment.
+
 You need to assign values to several variables crucial for the operation of these scripts on AWS:

- **`REGISTRY_PATH`**: Sets the file path for VM registration logging.
-  - Example: `'.aws_vms'`
 - **`DEFAULT_REGION`**: Default AWS region where your instances will be launched.
  - Example: `"us-east-1"`
 - **`IMAGE_ID_MAP`**: Dictionary mapping regions to specific AMI IDs that should be used for instance creation. Here we already set the AMI id to the official OSWorld image of Ubuntu supported by us.
  - Formatted as follows:
    ```python
    IMAGE_ID_MAP = {
-        "us-east-1": "ami-00674d875de9addc1"
+        "us-east-1": "ami-0d23263edb96951d8"
        # Add other regions and corresponding AMIs
    }
    ```
@ -32,7 +76,6 @@ You need to assign values to several variables crucial for the operation of thes
    AWS_SECURITY_GROUP_ID=sg-xxxx
    ```

-
 ### AWS CLI Configuration
 Before using these scripts, you must configure your AWS CLI with your credentials. This can be done via the following commands:

--- a/desktop_env/providers/aws/config.py
+++ b/desktop_env/providers/aws/config.py
@ -0,0 +1,21 @@
+import os
+
+
+# Default TTL minutes for instance auto-termination (cloud-side scheduler)
+# Can be overridden via environment variable DEFAULT_TTL_MINUTES
+DEFAULT_TTL_MINUTES: int = int(os.getenv("DEFAULT_TTL_MINUTES", "180"))
+
+# Master switch for TTL feature
+ENABLE_TTL: bool = os.getenv("ENABLE_TTL", "true").lower() == "true"
+
+# EventBridge Scheduler role ARN for scheduling EC2 termination
+AWS_SCHEDULER_ROLE_ARN: str = os.getenv("AWS_SCHEDULER_ROLE_ARN", "").strip()
+
+
+def compute_ttl_seconds(ttl_minutes: int) -> int:
+    try:
+        return max(0, int(ttl_minutes) * 60)
+    except Exception:
+        return 0
+
+
--- a/desktop_env/providers/aws/manager.py
+++ b/desktop_env/providers/aws/manager.py
@ -1,10 +1,13 @@
 import os
-from filelock import FileLock
 import boto3
-import psutil
 import logging
 import dotenv
 import signal
+from datetime import datetime, timedelta, timezone
+
+# TTL configuration
+from desktop_env.providers.aws.config import ENABLE_TTL, DEFAULT_TTL_MINUTES, AWS_SCHEDULER_ROLE_ARN
+from desktop_env.providers.aws.scheduler_utils import schedule_instance_termination


 INSTANCE_TYPE = "t3.xlarge" 
@ -36,15 +39,25 @@ DEFAULT_REGION = "us-east-1"
 # todo: Add doc for the configuration of image, security group and network interface
 # todo: public the AMI images
 IMAGE_ID_MAP = {
-    "us-east-1": "ami-09138bff939f82bd8",
-    "ap-east-1": "ami-0c092a5b8be4116f5",
+    "us-east-1": {
+        (1920, 1080): "ami-0d23263edb96951d8",
+        # For CoACT-1, uncomment to use the following AMI
+        # (1920, 1080): "ami-0b505e9d0d99ba88c"
+    },
+    "ap-east-1": {
+        (1920, 1080): "ami-06850864d18fad836"
+        # Please transfer AMI by yourself from AWS us-east-1 for CoACT-1
+    }
 }


-def _allocate_vm(region=DEFAULT_REGION):
+def _allocate_vm(region=DEFAULT_REGION, screen_size=(1920, 1080)):
    
    if region not in IMAGE_ID_MAP:
        raise ValueError(f"Region {region} is not supported. Supported regions are: {list(IMAGE_ID_MAP.keys())}")
+    if screen_size not in IMAGE_ID_MAP[region]:
+        raise ValueError(f"Screen size {screen_size} not supported for region {region}. Supported: {list(IMAGE_ID_MAP[region].keys())}")
+    ami_id = IMAGE_ID_MAP[region][screen_size]

    ec2_client = boto3.client('ec2', region_name=region)
    instance_id = None
@ -83,12 +96,20 @@ def _allocate_vm(region=DEFAULT_REGION):
        if not os.getenv('AWS_SUBNET_ID'):
            raise ValueError("AWS_SUBNET_ID is not set in the environment variables.")

+        # TTL configuration (cloud-init removed; use cloud-side scheduler only)
+        ttl_enabled = ENABLE_TTL
+        ttl_minutes = DEFAULT_TTL_MINUTES
+        ttl_seconds = max(0, int(ttl_minutes) * 60)
+        eta_utc = datetime.now(timezone.utc) + timedelta(seconds=ttl_seconds)
+        logger.info(f"TTL config: minutes={ttl_minutes}, seconds={ttl_seconds}, ETA(UTC)={eta_utc.isoformat()}")
+
        run_instances_params = {
            "MaxCount": 1,
            "MinCount": 1,
-            "ImageId": IMAGE_ID_MAP[region],
+            "ImageId": ami_id,
            "InstanceType": INSTANCE_TYPE,
            "EbsOptimized": True,
+            "InstanceInitiatedShutdownBehavior": "terminate",
            "NetworkInterfaces": [
                {
                    "SubnetId": os.getenv('AWS_SUBNET_ID'),
@ -115,13 +136,20 @@ def _allocate_vm(region=DEFAULT_REGION):
        
        response = ec2_client.run_instances(**run_instances_params)
        instance_id = response['Instances'][0]['InstanceId']
-        
+
+        # Create TTL schedule immediately after instance is created, to survive early interruptions
+        try:
+            # Always attempt; helper resolves ARN via env or role name
+            if ttl_enabled:
+                schedule_instance_termination(region, instance_id, ttl_seconds, AWS_SCHEDULER_ROLE_ARN, logger)
+        except Exception as e:
+            logger.warning(f"Failed to create EventBridge Scheduler for {instance_id}: {e}")
+
        waiter = ec2_client.get_waiter('instance_running')
        logger.info(f"Waiting for instance {instance_id} to be running...")
        waiter.wait(InstanceIds=[instance_id])
        logger.info(f"Instance {instance_id} is ready.")
-        
-        # 获取并显示VNC访问地址
+
        try:
            instance_details = ec2_client.describe_instances(InstanceIds=[instance_id])
            instance = instance_details['Reservations'][0]['Instances'][0]
@ -133,8 +161,8 @@ def _allocate_vm(region=DEFAULT_REGION):
                logger.info(f"📡 Public IP: {public_ip}")
                logger.info(f"🆔 Instance ID: {instance_id}")
                logger.info("="*80)
-                print(f"\n🌐 VNC访问地址: {vnc_url}")
-                print(f"📍 请在浏览器中打开上述地址进行远程桌面访问\n")
+                print(f"\n🌐 VNC Web Access URL: {vnc_url}")
+                print(f"📍 Please open the above address in the browser for remote desktop access\n")
        except Exception as e:
            logger.warning(f"Failed to get VNC address for instance {instance_id}: {e}")
    except KeyboardInterrupt:
@ -157,60 +185,6 @@ def _allocate_vm(region=DEFAULT_REGION):
    return instance_id


-def _allocate_vm_with_proxy(region=DEFAULT_REGION, proxy_config_file=None):
-    """Allocate a VM with proxy configuration"""
-    if not PROXY_SUPPORT_AVAILABLE:
-        logger.warning("Proxy support not available, falling back to regular VM allocation")
-        return _allocate_vm(region)
-    
-    from desktop_env.providers.aws.provider_with_proxy import AWSProviderWithProxy
-    
-    # Initialize proxy pool if needed
-    if proxy_config_file:
-        init_proxy_pool(proxy_config_file)
-    
-    # Get current proxy
-    proxy_pool = get_global_proxy_pool()
-    current_proxy = proxy_pool.get_next_proxy()
-    
-    if current_proxy:
-        logger.info(f"Allocating VM with proxy: {current_proxy.host}:{current_proxy.port}")
-    
-    # Create provider instance
-    provider = AWSProviderWithProxy(region=region, proxy_config_file=proxy_config_file)
-    
-    # Create new instance
-    instance_id = provider.create_instance_with_proxy(
-        image_id=IMAGE_ID_MAP[region],
-        instance_type=INSTANCE_TYPE,
-        security_groups=[os.getenv('AWS_SECURITY_GROUP_ID')],
-        subnet_id=os.getenv('AWS_SUBNET_ID')
-    )
-    
-    try:
-        ec2_client = boto3.client('ec2', region_name=region)
-        instance_details = ec2_client.describe_instances(InstanceIds=[instance_id])
-        instance = instance_details['Reservations'][0]['Instances'][0]
-        public_ip = instance.get('PublicIpAddress', '')
-        if public_ip:
-            vnc_url = f"http://{public_ip}:5910/vnc.html"
-            logger.info("="*80)
-            logger.info(f"🖥️  VNC Web Access URL: {vnc_url}")
-            logger.info(f"📡 Public IP: {public_ip}")
-            logger.info(f"🆔 Instance ID: {instance_id}")
-            if current_proxy:
-                logger.info(f"🌐 Proxy: {current_proxy.host}:{current_proxy.port}")
-            logger.info("="*80)
-            print(f"\n🌐 VNC Web Access URL: {vnc_url}")
-            if current_proxy:
-                print(f"🔄 Current Proxy: {current_proxy.host}:{current_proxy.port}")
-            print(f"📍 Please open the above address in the browser for remote desktop access\n")
-    except Exception as e:
-        logger.warning(f"Failed to get VNC address for proxy instance {instance_id}: {e}")
-    
-    return instance_id
-
-
 class AWSVMManager(VMManager):
    """
    AWS VM Manager for managing virtual machines on AWS.
@ -218,15 +192,9 @@ class AWSVMManager(VMManager):
    AWS does not need to maintain a registry of VMs, as it can dynamically allocate and deallocate VMs.
    This class supports both regular VM allocation and proxy-enabled VM allocation.
    """
-    def __init__(self, proxy_config_file=None, **kwargs):
-        self.proxy_config_file = proxy_config_file
+    def __init__(self, **kwargs):
        # self.lock = FileLock(".aws_lck", timeout=60)
        self.initialize_registry()
-        
-        # Initialize proxy pool if proxy configuration is provided
-        if proxy_config_file and PROXY_SUPPORT_AVAILABLE:
-            init_proxy_pool(proxy_config_file)
-            logger.info(f"Proxy pool initialized with config: {proxy_config_file}")

    def initialize_registry(self, **kwargs):
        pass
@ -261,11 +229,7 @@ class AWSVMManager(VMManager):
    def _list_free_vms(self, region=DEFAULT_REGION):
        pass

-    def get_vm_path(self, region=DEFAULT_REGION, **kwargs):
-        if self.proxy_config_file:
-            logger.info("Allocating a new VM with proxy configuration in region: {}".format(region))
-            new_vm_path = _allocate_vm_with_proxy(region, self.proxy_config_file)
-        else:
-            logger.info("Allocating a new VM in region: {}".format(region))
-            new_vm_path = _allocate_vm(region)
+    def get_vm_path(self, region=DEFAULT_REGION, screen_size=(1920, 1080), **kwargs):
+        logger.info("Allocating a new VM in region: {}".format(region))
+        new_vm_path = _allocate_vm(region, screen_size=screen_size)
        return new_vm_path
--- a/desktop_env/providers/aws/provider.py
+++ b/desktop_env/providers/aws/provider.py
@ -2,11 +2,14 @@ import boto3
 from botocore.exceptions import ClientError

 import logging
-
-from desktop_env.providers.base import Provider
-from datetime import datetime
+import os
 import time
+from datetime import datetime, timedelta, timezone
+from desktop_env.providers.base import Provider

+# TTL configuration
+from desktop_env.providers.aws.config import ENABLE_TTL, DEFAULT_TTL_MINUTES, AWS_SCHEDULER_ROLE_ARN
+from desktop_env.providers.aws.scheduler_utils import schedule_instance_termination

 logger = logging.getLogger("desktopenv.providers.aws.AWSProvider")
 logger.setLevel(logging.INFO)
@ -78,6 +81,7 @@ class AWSProvider(Provider):
                        logger.warning("No public IP address available for VNC access")
                    
                    return private_ip_address
+                    # return public_ip_address
            return ''  # Return an empty string if no IP address is found
        except ClientError as e:
            logger.error(f"Failed to retrieve IP address for the instance {path_to_vm}: {str(e)}")
@ -104,23 +108,68 @@ class AWSProvider(Provider):
            # Step 1: Retrieve the original instance details
            instance_details = ec2_client.describe_instances(InstanceIds=[path_to_vm])
            instance = instance_details['Reservations'][0]['Instances'][0]
-            security_groups = [sg['GroupId'] for sg in instance['SecurityGroups']]
-            subnet_id = instance['SubnetId']
-            instance_type = instance['InstanceType']
+            # Resolve security groups with fallbacks
+            security_groups = [sg['GroupId'] for sg in instance.get('SecurityGroups', []) if 'GroupId' in sg]
+            if not security_groups:
+                env_sg = os.getenv('AWS_SECURITY_GROUP_ID')
+                if env_sg:
+                    security_groups = [env_sg]
+                    logger.info("SecurityGroups missing on instance; using AWS_SECURITY_GROUP_ID from env")
+                else:
+                    raise ValueError("No security groups found on instance and AWS_SECURITY_GROUP_ID not set")
+
+            # Resolve subnet with fallbacks
+            subnet_id = instance.get('SubnetId')
+            if not subnet_id:
+                nis = instance.get('NetworkInterfaces', []) or []
+                if nis and isinstance(nis, list):
+                    for ni in nis:
+                        if isinstance(ni, dict) and ni.get('SubnetId'):
+                            subnet_id = ni.get('SubnetId')
+                            break
+                if not subnet_id:
+                    env_subnet = os.getenv('AWS_SUBNET_ID')
+                    if env_subnet:
+                        subnet_id = env_subnet
+                        logger.info("SubnetId missing on instance; using AWS_SUBNET_ID from env")
+                    else:
+                        raise ValueError("SubnetId not available on instance, NetworkInterfaces, or environment")
+
+            # Resolve instance type with fallbacks
+            instance_type = instance.get('InstanceType') or os.getenv('AWS_INSTANCE_TYPE') or 't3.large'
+            if instance.get('InstanceType') is None:
+                logger.info(f"InstanceType missing on instance; using '{instance_type}' from env/default")
            
-            # Step 2: Terminate the old instance
-            ec2_client.terminate_instances(InstanceIds=[path_to_vm])
-            logger.info(f"Old instance {path_to_vm} has been terminated.")
+            # Step 2: Terminate the old instance (skip if already terminated/shutting-down)
+            state = (instance.get('State') or {}).get('Name')
+            if state in ['shutting-down', 'terminated']:
+                logger.info(f"Old instance {path_to_vm} is already in state '{state}', skipping termination.")
+            else:
+                try:
+                    ec2_client.terminate_instances(InstanceIds=[path_to_vm])
+                    logger.info(f"Old instance {path_to_vm} has been terminated.")
+                except ClientError as e:
+                    error_code = getattr(getattr(e, 'response', {}), 'get', lambda *_: None)('Error', {}).get('Code') if hasattr(e, 'response') else None
+                    if error_code in ['InvalidInstanceID.NotFound', 'IncorrectInstanceState']:
+                        logger.info(f"Ignore termination error for {path_to_vm}: {error_code}")
+                    else:
+                        raise

            # Step 3: Launch a new instance from the snapshot(AMI) with performance optimization
            logger.info(f"Launching a new instance from AMI {snapshot_name}...")
            
+            # TTL configuration follows the same env flags as allocation (centralized)
+            enable_ttl = ENABLE_TTL
+            default_ttl_minutes = DEFAULT_TTL_MINUTES
+            ttl_seconds = max(0, default_ttl_minutes * 60)
+
            run_instances_params = {
                "MaxCount": 1,
                "MinCount": 1,
                "ImageId": snapshot_name,
                "InstanceType": instance_type,
                "EbsOptimized": True,
+                "InstanceInitiatedShutdownBehavior": "terminate",
                "NetworkInterfaces": [
                    {
                        "SubnetId": subnet_id,
@ -150,7 +199,40 @@ class AWSProvider(Provider):
            ec2_client.get_waiter('instance_running').wait(InstanceIds=[new_instance_id])

            logger.info(f"Instance {new_instance_id} is ready.")
-            
+            # Schedule cloud-side termination via EventBridge Scheduler (auto-resolve role ARN)
+            try:
+                if enable_ttl:
+                    schedule_instance_termination(self.region, new_instance_id, ttl_seconds, AWS_SCHEDULER_ROLE_ARN, logger)
+            except Exception as e:
+                logger.warning(f"Failed to create EventBridge Scheduler for {new_instance_id}: {e}")
+
+            # Schedule cloud-side termination via EventBridge Scheduler (same as allocation path)
+            try:
+                if enable_ttl and os.getenv('AWS_SCHEDULER_ROLE_ARN'):
+                    scheduler_client = boto3.client('scheduler', region_name=self.region)
+                    schedule_name = f"osworld-ttl-{new_instance_id}-{int(time.time())}"
+                    eta_scheduler = datetime.now(timezone.utc) + timedelta(seconds=ttl_seconds)
+                    schedule_expression = f"at({eta_scheduler.strftime('%Y-%m-%dT%H:%M:%S')})"
+                    target_arn = "arn:aws:scheduler:::aws-sdk:ec2:terminateInstances"
+                    input_payload = '{"InstanceIds":["' + new_instance_id + '"]}'
+                    scheduler_client.create_schedule(
+                        Name=schedule_name,
+                        ScheduleExpression=schedule_expression,
+                        FlexibleTimeWindow={"Mode": "OFF"},
+                        Target={
+                            "Arn": target_arn,
+                            "RoleArn": os.getenv('AWS_SCHEDULER_ROLE_ARN'),
+                            "Input": input_payload
+                        },
+                        State='ENABLED',
+                        Description=f"OSWorld TTL terminate for {new_instance_id}"
+                    )
+                    logger.info(f"Scheduled EC2 termination via EventBridge Scheduler for snapshot revert: name={schedule_name}, when={eta_scheduler.isoformat()} (UTC)")
+                else:
+                    logger.info("TTL enabled but AWS_SCHEDULER_ROLE_ARN not set; skipping scheduler for snapshot revert.")
+            except Exception as e:
+                logger.warning(f"Failed to create EventBridge Scheduler for {new_instance_id}: {e}")
+
            try:
                instance_details = ec2_client.describe_instances(InstanceIds=[new_instance_id])
                instance = instance_details['Reservations'][0]['Instances'][0]
--- a/desktop_env/providers/aws/provider_with_proxy.py
+++ b/desktop_env/providers/aws/provider_with_proxy.py
@ -1,315 +0,0 @@
-import boto3
-from botocore.exceptions import ClientError
-import base64
-import logging
-import json
-from typing import Optional
-
-from desktop_env.providers.base import Provider
-from desktop_env.providers.aws.proxy_pool import get_global_proxy_pool, init_proxy_pool, ProxyInfo
-
-logger = logging.getLogger("desktopenv.providers.aws.AWSProviderWithProxy")
-logger.setLevel(logging.INFO)
-
-WAIT_DELAY = 15
-MAX_ATTEMPTS = 10
-
-
-class AWSProviderWithProxy(Provider):
-    
-    def __init__(self, region: str = None, proxy_config_file: str = None):
-        super().__init__(region)
-        self.current_proxy: Optional[ProxyInfo] = None
-        
-        # 初始化代理池
-        if proxy_config_file:
-            init_proxy_pool(proxy_config_file)
-            logger.info(f"Initialized proxy pool from {proxy_config_file}")
-        
-        # 获取下一个可用代理
-        self._rotate_proxy()
-
-    def _rotate_proxy(self):
-        """轮换到下一个可用代理"""
-        proxy_pool = get_global_proxy_pool()
-        self.current_proxy = proxy_pool.get_next_proxy()
-        
-        if self.current_proxy:
-            logger.info(f"Switched to proxy: {self.current_proxy.host}:{self.current_proxy.port}")
-        else:
-            logger.warning("No proxy available, using direct connection")
-
-    def _generate_proxy_user_data(self) -> str:
-        """生成包含代理配置的user data脚本"""
-        if not self.current_proxy:
-            return ""
-        
-        proxy_url = self._format_proxy_url(self.current_proxy)
-        
-        user_data_script = f"""#!/bin/bash
-# Configure system proxy
-echo 'export http_proxy={proxy_url}' >> /etc/environment
-echo 'export https_proxy={proxy_url}' >> /etc/environment
-echo 'export HTTP_PROXY={proxy_url}' >> /etc/environment  
-echo 'export HTTPS_PROXY={proxy_url}' >> /etc/environment
-
-# Configure apt proxy
-cat > /etc/apt/apt.conf.d/95proxy << EOF
-Acquire::http::Proxy "{proxy_url}";
-Acquire::https::Proxy "{proxy_url}";
-EOF
-
-# Configure chrome/chromium proxy
-mkdir -p /etc/opt/chrome/policies/managed
-cat > /etc/opt/chrome/policies/managed/proxy.json << EOF
-{{
-    "ProxyMode": "fixed_servers",
-    "ProxyServer": "{self.current_proxy.host}:{self.current_proxy.port}"
-}}
-EOF
-
-# Configure chromium proxy (Ubuntu default)
-mkdir -p /etc/chromium/policies/managed
-cat > /etc/chromium/policies/managed/proxy.json << EOF
-{{
-    "ProxyMode": "fixed_servers", 
-    "ProxyServer": "{self.current_proxy.host}:{self.current_proxy.port}"
-}}
-EOF
-
-# Configure firefox proxy - support multiple possible paths
-for firefox_dir in /etc/firefox/policies /usr/lib/firefox/distribution/policies /etc/firefox-esr/policies; do
-    if [ -d "$(dirname "$firefox_dir")" ]; then
-        mkdir -p "$firefox_dir"
-        cat > "$firefox_dir/policies.json" << EOF
-{{
-    "policies": {{
-        "Proxy": {{
-            "Mode": "manual",
-            "HTTPProxy": "{self.current_proxy.host}:{self.current_proxy.port}",
-            "HTTPSProxy": "{self.current_proxy.host}:{self.current_proxy.port}",
-            "UseHTTPProxyForAllProtocols": true
-        }}
-    }}
-}}
-EOF
-        break
-    fi
-done
-
-# Reload environment variables
-source /etc/environment
-
-# Log proxy configuration
-echo "$(date): Configured proxy {self.current_proxy.host}:{self.current_proxy.port}" >> /var/log/proxy-setup.log
-"""
-        
-        return base64.b64encode(user_data_script.encode()).decode()
-
-    def _format_proxy_url(self, proxy: ProxyInfo) -> str:
-        """格式化代理URL"""
-        if proxy.username and proxy.password:
-            return f"{proxy.protocol}://{proxy.username}:{proxy.password}@{proxy.host}:{proxy.port}"
-        else:
-            return f"{proxy.protocol}://{proxy.host}:{proxy.port}"
-
-    def start_emulator(self, path_to_vm: str, headless: bool, *args, **kwargs):
-        logger.info("Starting AWS VM with proxy configuration...")
-        ec2_client = boto3.client('ec2', region_name=self.region)
-
-        try:
-            # 如果实例已经存在，直接启动
-            ec2_client.start_instances(InstanceIds=[path_to_vm])
-            logger.info(f"Instance {path_to_vm} is starting...")
-
-            # Wait for the instance to be in the 'running' state
-            waiter = ec2_client.get_waiter('instance_running')
-            waiter.wait(InstanceIds=[path_to_vm], WaiterConfig={'Delay': WAIT_DELAY, 'MaxAttempts': MAX_ATTEMPTS})
-            logger.info(f"Instance {path_to_vm} is now running.")
-
-        except ClientError as e:
-            logger.error(f"Failed to start the AWS VM {path_to_vm}: {str(e)}")
-            raise
-
-    def create_instance_with_proxy(self, image_id: str, instance_type: str, 
-                                 security_groups: list, subnet_id: str) -> str:
-        """创建带有代理配置的新实例"""
-        ec2_client = boto3.client('ec2', region_name=self.region)
-        
-        user_data = self._generate_proxy_user_data()
-        
-        run_instances_params = {
-            "MaxCount": 1,
-            "MinCount": 1,
-            "ImageId": image_id,
-            "InstanceType": instance_type,
-            "EbsOptimized": True,
-            "NetworkInterfaces": [
-                {
-                    "SubnetId": subnet_id,
-                    "AssociatePublicIpAddress": True,
-                    "DeviceIndex": 0,
-                    "Groups": security_groups
-                }
-            ]
-        }
-        
-        if user_data:
-            run_instances_params["UserData"] = user_data
-            
-        try:
-            response = ec2_client.run_instances(**run_instances_params)
-            instance_id = response['Instances'][0]['InstanceId']
-            
-            logger.info(f"Created new instance {instance_id} with proxy configuration")
-            
-            logger.info(f"Waiting for instance {instance_id} to be running...")
-            ec2_client.get_waiter('instance_running').wait(InstanceIds=[instance_id])
-            logger.info(f"Instance {instance_id} is ready.")
-
-            try:
-                instance_details = ec2_client.describe_instances(InstanceIds=[instance_id])
-                instance = instance_details['Reservations'][0]['Instances'][0]
-                public_ip = instance.get('PublicIpAddress', '')
-                if public_ip:
-                    vnc_url = f"http://{public_ip}:5910/vnc.html"
-                    logger.info("="*80)
-                    logger.info(f"🖥️  VNC Web Access URL: {vnc_url}")
-                    logger.info(f"📡 Public IP: {public_ip}")
-                    logger.info(f"🆔 Instance ID: {instance_id}")
-                    if self.current_proxy:
-                        logger.info(f"🌐 Proxy: {self.current_proxy.host}:{self.current_proxy.port}")
-                    logger.info("="*80)
-                    print(f"\n🌐 VNC Web Access URL: {vnc_url}")
-                    if self.current_proxy:
-                        print(f"🔄 Current Proxy: {self.current_proxy.host}:{self.current_proxy.port}")
-                    print(f"📍 Please open the above address in the browser for remote desktop access\n")
-            except Exception as e:
-                logger.warning(f"Failed to get VNC address for instance {instance_id}: {e}")
-            
-            return instance_id
-            
-        except ClientError as e:
-            logger.error(f"Failed to create instance with proxy: {str(e)}")
-            if self.current_proxy:
-                proxy_pool = get_global_proxy_pool()
-                proxy_pool.mark_proxy_failed(self.current_proxy)
-                self._rotate_proxy()
-            raise
-
-    def get_ip_address(self, path_to_vm: str) -> str:
-        logger.info("Getting AWS VM IP address...")
-        ec2_client = boto3.client('ec2', region_name=self.region)
-
-        try:
-            response = ec2_client.describe_instances(InstanceIds=[path_to_vm])
-            for reservation in response['Reservations']:
-                for instance in reservation['Instances']:
-                    private_ip_address = instance.get('PrivateIpAddress', '')
-                    public_ip_address = instance.get('PublicIpAddress', '')
-
-                    if public_ip_address:
-                        vnc_url = f"http://{public_ip_address}:5910/vnc.html"
-                        logger.info("="*80)
-                        logger.info(f"🖥️  VNC Web Access URL: {vnc_url}")
-                        logger.info(f"📡 Public IP: {public_ip_address}")
-                        logger.info(f"🏠 Private IP: {private_ip_address}")
-                        if self.current_proxy:
-                            logger.info(f"🌐 Proxy: {self.current_proxy.host}:{self.current_proxy.port}")
-                        logger.info("="*80)
-                        print(f"\n🌐 VNC Web Access URL: {vnc_url}")
-                        if self.current_proxy:
-                            print(f"🔄 Current Proxy: {self.current_proxy.host}:{self.current_proxy.port}")
-                        print(f"📍 Please open the above address in the browser for remote desktop access\n")
-                    else:
-                        logger.warning("No public IP address available for VNC access")
-                    
-                    return private_ip_address
-            return ''
-        except ClientError as e:
-            logger.error(f"Failed to retrieve IP address for the instance {path_to_vm}: {str(e)}")
-            raise
-
-    def save_state(self, path_to_vm: str, snapshot_name: str):
-        logger.info("Saving AWS VM state...")
-        ec2_client = boto3.client('ec2', region_name=self.region)
-
-        try:
-            image_response = ec2_client.create_image(InstanceId=path_to_vm, Name=snapshot_name)
-            image_id = image_response['ImageId']
-            logger.info(f"AMI {image_id} created successfully from instance {path_to_vm}.")
-            return image_id
-        except ClientError as e:
-            logger.error(f"Failed to create AMI from the instance {path_to_vm}: {str(e)}")
-            raise
-
-    def revert_to_snapshot(self, path_to_vm: str, snapshot_name: str):
-        logger.info(f"Reverting AWS VM to snapshot: {snapshot_name}...")
-        ec2_client = boto3.client('ec2', region_name=self.region)
-
-        try:
-            # Get original instance details for config.
-            instance_details = ec2_client.describe_instances(InstanceIds=[path_to_vm])
-            instance = instance_details['Reservations'][0]['Instances'][0]
-            security_groups = [sg['GroupId'] for sg in instance['SecurityGroups']]
-            subnet_id = instance['SubnetId']
-            instance_type = instance['InstanceType']
-
-            # Terminate the old instance. This is a non-blocking call.
-            logger.info(f"Initiating termination for old instance {path_to_vm}...")
-            ec2_client.terminate_instances(InstanceIds=[path_to_vm])
-            logger.info(f"Old instance {path_to_vm} termination initiated.")
-
-            # Rotate to a new proxy
-            self._rotate_proxy()
-            
-            # Create a new instance
-            new_instance_id = self.create_instance_with_proxy(
-                snapshot_name, instance_type, security_groups, subnet_id
-            )
-            
-            # Note: VNC address is displayed within create_instance_with_proxy
-            logger.info(f"Successfully launched new instance {new_instance_id} for revert.")
-
-            return new_instance_id
-
-        except ClientError as e:
-            logger.error(f"Failed to revert to snapshot {snapshot_name} for the instance {path_to_vm}: {str(e)}")
-            raise
-
-    def stop_emulator(self, path_to_vm, region=None):
-        logger.info(f"Stopping AWS VM {path_to_vm}...")
-        ec2_client = boto3.client('ec2', region_name=self.region)
-
-        try:
-            ec2_client.stop_instances(InstanceIds=[path_to_vm])
-            waiter = ec2_client.get_waiter('instance_stopped')
-            waiter.wait(InstanceIds=[path_to_vm], WaiterConfig={'Delay': WAIT_DELAY, 'MaxAttempts': MAX_ATTEMPTS})
-            logger.info(f"Instance {path_to_vm} has been stopped.")
-        except ClientError as e:
-            logger.error(f"Failed to stop the AWS VM {path_to_vm}: {str(e)}")
-            raise
-
-    def get_current_proxy_info(self) -> Optional[dict]:
-        """获取当前代理信息"""
-        if self.current_proxy:
-            return {
-                'host': self.current_proxy.host,
-                'port': self.current_proxy.port,
-                'protocol': self.current_proxy.protocol,
-                'failed_count': self.current_proxy.failed_count
-            }
-        return None
-
-    def force_rotate_proxy(self):
-        """强制轮换代理"""
-        logger.info("Force rotating proxy...")
-        if self.current_proxy:
-            proxy_pool = get_global_proxy_pool()
-            proxy_pool.mark_proxy_failed(self.current_proxy)
-        self._rotate_proxy()
-
-    def get_proxy_stats(self) -> dict:
-        """获取代理池统计信息"""
-        proxy_pool = get_global_proxy_pool()
-        return proxy_pool.get_stats() 
--- a/desktop_env/providers/aws/proxy_pool.py
+++ b/desktop_env/providers/aws/proxy_pool.py
@ -33,7 +33,7 @@ class ProxyPool:
            self.load_proxies_from_file(config_file)
    
    def load_proxies_from_file(self, config_file: str):
-        """从配置文件加载代理列表"""
+        """Load proxy list from config file"""
        try:
            with open(config_file, 'r') as f:
                proxy_configs = json.load(f)
@ -54,7 +54,7 @@ class ProxyPool:
    
    def add_proxy(self, host: str, port: int, username: str = None, 
                  password: str = None, protocol: str = "http"):
-        """添加代理到池中"""
+        """Add proxy to pool"""
        proxy = ProxyInfo(host=host, port=port, username=username, 
                         password=password, protocol=protocol)
        with self.lock:
@ -62,19 +62,19 @@ class ProxyPool:
        logger.info(f"Added proxy {host}:{port}")
    
    def get_next_proxy(self) -> Optional[ProxyInfo]:
-        """获取下一个可用的代理"""
+        """Get next available proxy"""
        with self.lock:
            if not self.proxies:
                return None
            
-            # 过滤掉失败次数过多的代理
+            # Filter out proxies with too many failures
            active_proxies = [p for p in self.proxies if self._is_proxy_available(p)]
            
            if not active_proxies:
                logger.warning("No active proxies available")
                return None
            
-            # 轮询选择代理
+            # Round-robin selection of proxy
            proxy = active_proxies[self.current_index % len(active_proxies)]
            self.current_index += 1
            proxy.last_used = time.time()
@ -82,22 +82,22 @@ class ProxyPool:
            return proxy
    
    def _is_proxy_available(self, proxy: ProxyInfo) -> bool:
-        """检查代理是否可用"""
+        """Check if proxy is available"""
        if not proxy.is_active:
            return False
        
        if proxy.failed_count >= self.max_failures:
-            # 检查是否过了冷却时间
+            # Check if cooldown time has passed
            if time.time() - proxy.last_used < self.cooldown_time:
                return False
            else:
-                # 重置失败计数
+                # Reset failure count
                proxy.failed_count = 0
        
        return True
    
    def mark_proxy_failed(self, proxy: ProxyInfo):
-        """标记代理失败"""
+        """Mark proxy as failed"""
        with self.lock:
            proxy.failed_count += 1
            if proxy.failed_count >= self.max_failures:
@ -105,13 +105,13 @@ class ProxyPool:
                             f"(failures: {proxy.failed_count})")
    
    def mark_proxy_success(self, proxy: ProxyInfo):
-        """标记代理成功"""
+        """Mark proxy as successful"""
        with self.lock:
            proxy.failed_count = 0
    
    def test_proxy(self, proxy: ProxyInfo, test_url: str = "http://httpbin.org/ip", 
                   timeout: int = 10) -> bool:
-        """测试代理是否正常工作"""
+        """Test if proxy is working"""
        try:
            proxy_url = self._format_proxy_url(proxy)
            proxies = {
@ -133,14 +133,14 @@ class ProxyPool:
            return False
    
    def _format_proxy_url(self, proxy: ProxyInfo) -> str:
-        """格式化代理URL"""
+        """Format proxy URL"""
        if proxy.username and proxy.password:
            return f"{proxy.protocol}://{proxy.username}:{proxy.password}@{proxy.host}:{proxy.port}"
        else:
            return f"{proxy.protocol}://{proxy.host}:{proxy.port}"
    
    def get_proxy_dict(self, proxy: ProxyInfo) -> Dict[str, str]:
-        """获取requests库使用的代理字典"""
+        """Get proxy dictionary for requests library"""
        proxy_url = self._format_proxy_url(proxy)
        return {
            'http': proxy_url,
@ -148,7 +148,7 @@ class ProxyPool:
        }
    
    def test_all_proxies(self, test_url: str = "http://httpbin.org/ip"):
-        """测试所有代理"""
+        """Test all proxies"""
        logger.info("Testing all proxies...")
        working_count = 0
        
@ -163,7 +163,7 @@ class ProxyPool:
        return working_count
    
    def get_stats(self) -> Dict:
-        """获取代理池统计信息"""
+        """Get proxy pool statistics"""
        with self.lock:
            total = len(self.proxies)
            active = len([p for p in self.proxies if self._is_proxy_available(p)])
@ -176,18 +176,18 @@ class ProxyPool:
                'success_rate': active / total if total > 0 else 0
            }

-# 全局代理池实例
+# Global proxy pool instance
 _proxy_pool = None

 def get_global_proxy_pool() -> ProxyPool:
-    """获取全局代理池实例"""
+    """Get global proxy pool instance"""
    global _proxy_pool
    if _proxy_pool is None:
        _proxy_pool = ProxyPool()
    return _proxy_pool

 def init_proxy_pool(config_file: str = None):
-    """初始化全局代理池"""
+    """Initialize global proxy pool"""
    global _proxy_pool
    _proxy_pool = ProxyPool(config_file)
    return _proxy_pool 
--- a/desktop_env/providers/aws/scheduler_utils.py
+++ b/desktop_env/providers/aws/scheduler_utils.py
@ -0,0 +1,153 @@
+import os
+import time
+import json
+from datetime import datetime, timedelta, timezone
+import boto3
+from botocore.exceptions import ClientError
+
+
+def _resolve_scheduler_role_arn(logger) -> str:
+    # 1) Explicit env takes precedence
+    role_arn = os.getenv('AWS_SCHEDULER_ROLE_ARN', '').strip()
+    if role_arn:
+        return role_arn
+
+    # 2) Derive from role name + account id
+    role_name = os.getenv('AWS_SCHEDULER_ROLE_NAME', 'osworld-scheduler-ec2-terminate').strip()
+    try:
+        sts = boto3.client('sts')
+        account_id = sts.get_caller_identity()['Account']
+        derived_arn = f"arn:aws:iam::{account_id}:role/{role_name}"
+        iam = boto3.client('iam')
+        try:
+            role = iam.get_role(RoleName=role_name)["Role"]
+        except ClientError:
+            auto_create = os.getenv('AWS_AUTO_CREATE_SCHEDULER_ROLE', 'true').lower() == 'true'
+            if not auto_create:
+                logger.warning(f"Scheduler role '{role_name}' not found and auto-create disabled.")
+                return ''
+            try:
+                trust_policy = {
+                    "Version": "2012-10-17",
+                    "Statement": [
+                        {
+                            "Effect": "Allow",
+                            "Principal": {"Service": "scheduler.amazonaws.com"},
+                            "Action": "sts:AssumeRole"
+                        }
+                    ]
+                }
+                iam.create_role(RoleName=role_name, AssumeRolePolicyDocument=json.dumps(trust_policy))
+                role = iam.get_role(RoleName=role_name)["Role"]
+            except ClientError as ce:
+                # If another process created it, fetch again
+                try:
+                    role = iam.get_role(RoleName=role_name)["Role"]
+                except ClientError:
+                    logger.warning(f"Failed to auto-create scheduler role '{role_name}': {ce}")
+                    return ''
+
+        # Ensure trust policy allows scheduler.amazonaws.com
+        assume_doc = role.get("AssumeRolePolicyDocument", {})
+        principal_ok = False
+        try:
+            for stmt in assume_doc.get("Statement", []):
+                principal = stmt.get("Principal", {})
+                svc = principal.get("Service")
+                if isinstance(svc, str) and svc == "scheduler.amazonaws.com":
+                    principal_ok = True
+                    break
+                if isinstance(svc, list) and "scheduler.amazonaws.com" in svc:
+                    principal_ok = True
+                    break
+        except Exception:
+            principal_ok = False
+        if not principal_ok:
+            trust_policy = {
+                "Version": "2012-10-17",
+                "Statement": [
+                    {
+                        "Effect": "Allow",
+                        "Principal": {"Service": "scheduler.amazonaws.com"},
+                        "Action": "sts:AssumeRole"
+                    }
+                ]
+            }
+            iam.update_assume_role_policy(RoleName=role_name, PolicyDocument=json.dumps(trust_policy))
+
+        # Ensure minimal inline policy exists
+        inline_name = f"{role_name}-inline"
+        need_policy = False
+        try:
+            iam.get_role_policy(RoleName=role_name, PolicyName=inline_name)
+        except ClientError:
+            need_policy = True
+        if need_policy:
+            inline_policy = {
+                "Version": "2012-10-17",
+                "Statement": [
+                    {
+                        "Effect": "Allow",
+                        "Action": ["ec2:TerminateInstances", "ec2:DescribeInstances"],
+                        "Resource": "*"
+                    }
+                ]
+            }
+            iam.put_role_policy(RoleName=role_name, PolicyName=inline_name, PolicyDocument=json.dumps(inline_policy))
+
+        # Wait for IAM propagation
+        time.sleep(8)
+        logger.info(f"Derived AWS_SCHEDULER_ROLE_ARN={derived_arn} from role name '{role_name}'")
+        return derived_arn
+    except Exception as e:
+        logger.warning(f"Failed to resolve Scheduler Role ARN: {e}")
+        return ''
+
+
+def schedule_instance_termination(region: str, instance_id: str, ttl_seconds: int, role_arn: str, logger) -> None:
+    if not role_arn:
+        role_arn = _resolve_scheduler_role_arn(logger)
+        if not role_arn:
+            logger.info("Scheduler role ARN not available; skipping TTL schedule creation.")
+            return
+    scheduler_client = boto3.client('scheduler', region_name=region)
+    schedule_name = f"osworld-ttl-{instance_id}-{int(time.time())}"
+    eta_scheduler = datetime.now(timezone.utc) + timedelta(seconds=ttl_seconds)
+    schedule_expression = f"at({eta_scheduler.strftime('%Y-%m-%dT%H:%M:%S')})"
+    target_arn = "arn:aws:scheduler:::aws-sdk:ec2:terminateInstances"
+    input_payload = '{"InstanceIds":["' + instance_id + '"]}'
+
+    # Retry to tolerate IAM eventual consistency
+    last_err = None
+    for attempt in range(1, 7):  # ~ up to ~60s
+        try:
+            scheduler_client.create_schedule(
+                Name=schedule_name,
+                ScheduleExpression=schedule_expression,
+                FlexibleTimeWindow={"Mode": "OFF"},
+                ActionAfterCompletion='DELETE',
+                Target={
+                    "Arn": target_arn,
+                    "RoleArn": role_arn,
+                    "Input": input_payload
+                },
+                State='ENABLED',
+                Description=f"OSWorld TTL terminate for {instance_id}"
+            )
+            logger.info(f"Scheduled EC2 termination via EventBridge Scheduler: name={schedule_name}, when={eta_scheduler.isoformat()} (UTC)")
+            last_err = None
+            break
+        except ClientError as e:
+            last_err = e
+            code = e.response.get('Error', {}).get('Code')
+            msg = e.response.get('Error', {}).get('Message', '')
+            if code == 'ValidationException' and 'must allow AWS EventBridge Scheduler to assume the role' in msg:
+                time.sleep(10)
+                continue
+            else:
+                raise
+    if last_err is not None:
+        # If we exhausted retries, re-raise to surface warning upstream
+        raise last_err
+
+
--- a/desktop_env/providers/azure/manager.py
+++ b/desktop_env/providers/azure/manager.py
@ -67,7 +67,9 @@ class AzureVMManager(VMManager):
                        free_vms.append((vm_path, pid_str))
            return free_vms

-    def get_vm_path(self, region):
+    def get_vm_path(self, region, screen_size=(1920, 1080), **kwargs):
+        # Note: screen_size parameter is ignored for Azure provider
+        # but kept for interface consistency with other providers
        self.check_and_clean()
        free_vms_paths = self.list_free_vms(region)
        if len(free_vms_paths) == 0:
--- a/desktop_env/providers/azure/provider.py
+++ b/desktop_env/providers/azure/provider.py
@ -32,7 +32,9 @@ class AzureProvider(Provider):
        self.compute_client = ComputeManagementClient(credential, self.subscription_id)
        self.network_client = NetworkManagementClient(credential, self.subscription_id)

-    def start_emulator(self, path_to_vm: str, headless: bool):
+    def start_emulator(self, path_to_vm: str, headless: bool, os_type: str = None, *args, **kwargs):
+        # Note: os_type parameter is ignored for Azure provider
+        # but kept for interface consistency with other providers
        logger.info("Starting Azure VM...")
        resource_group_name, vm_name = path_to_vm.split('/')

--- a/desktop_env/providers/docker/init.py
+++ b/desktop_env/providers/docker/init.py
--- a/desktop_env/providers/docker/manager.py
+++ b/desktop_env/providers/docker/manager.py
@ -34,6 +34,12 @@ def _download_vm(vms_dir: str):
    logger.info("Downloading the virtual machine image...")
    downloaded_size = 0

+    # Check for HF_ENDPOINT environment variable and replace domain if set to hf-mirror.com
+    hf_endpoint = os.environ.get('HF_ENDPOINT')
+    if hf_endpoint and 'hf-mirror.com' in hf_endpoint:
+        URL = URL.replace('huggingface.co', 'hf-mirror.com')
+        logger.info(f"Using HF mirror: {URL}")
+
    downloaded_file_name = DOWNLOADED_FILE_NAME

    os.makedirs(vms_dir, exist_ok=True)
@ -93,7 +99,8 @@ class DockerVMManager(VMManager):
    def check_and_clean(self):
        pass

-    def delete_vm(self, vm_path):
+    def delete_vm(self, vm_path, region=None, **kwargs):
+        # Fixed: Added region and **kwargs parameters for interface compatibility
        pass

    def initialize_registry(self):
@ -102,15 +109,25 @@ class DockerVMManager(VMManager):
    def list_free_vms(self):
        return os.path.join(VMS_DIR, DOWNLOADED_FILE_NAME)

-    def occupy_vm(self, vm_path):
+    def occupy_vm(self, vm_path, pid, region=None, **kwargs):
+        # Fixed: Added pid, region and **kwargs parameters for interface compatibility
        pass

-    def get_vm_path(self, os_type, region):
+    def get_vm_path(self, os_type, region, screen_size=(1920, 1080), **kwargs):
+        # Note: screen_size parameter is ignored for Docker provider
+        # but kept for interface consistency with other providers
        global URL, DOWNLOADED_FILE_NAME
        if os_type == "Ubuntu":
            URL = UBUNTU_X86_URL
        elif os_type == "Windows":
            URL = WINDOWS_X86_URL
+        
+        # Check for HF_ENDPOINT environment variable and replace domain if set to hf-mirror.com
+        hf_endpoint = os.environ.get('HF_ENDPOINT')
+        if hf_endpoint and 'hf-mirror.com' in hf_endpoint:
+            URL = URL.replace('huggingface.co', 'hf-mirror.com')
+            logger.info(f"Using HF mirror: {URL}")
+            
        DOWNLOADED_FILE_NAME = URL.split('/')[-1]

        if DOWNLOADED_FILE_NAME.endswith(".zip"):
--- a/desktop_env/providers/docker/provider.py
+++ b/desktop_env/providers/docker/provider.py
@ -97,11 +97,20 @@ class DockerProvider(Provider):
                self.vlc_port = self._get_available_port(8080)

                # Start container while still holding the lock
+                # Check if KVM is available
+                devices = []
+                if os.path.exists("/dev/kvm"):
+                    devices.append("/dev/kvm")
+                    logger.info("KVM device found, using hardware acceleration")
+                else:
+                    self.environment["KVM"] = "N"
+                    logger.warning("KVM device not found, running without hardware acceleration (will be slower)")
+
                self.container = self.client.containers.run(
                    "happysixd/osworld-docker",
                    environment=self.environment,
                    cap_add=["NET_ADMIN"],
-                    devices=["/dev/kvm"],
+                    devices=devices,
                    volumes={
                        os.path.abspath(path_to_vm): {
                            "bind": "/System.qcow2",
@ -144,7 +153,9 @@ class DockerProvider(Provider):
    def revert_to_snapshot(self, path_to_vm: str, snapshot_name: str):
        self.stop_emulator(path_to_vm)

-    def stop_emulator(self, path_to_vm: str):
+    def stop_emulator(self, path_to_vm: str, region=None, *args, **kwargs):
+        # Note: region parameter is ignored for Docker provider
+        # but kept for interface consistency with other providers
        if self.container:
            logger.info("Stopping VM...")
            try:
--- a/desktop_env/providers/virtualbox/manager.py
+++ b/desktop_env/providers/virtualbox/manager.py
@ -57,6 +57,12 @@ def _install_vm(vm_name, vms_dir, downloaded_file_name, original_vm_name="Ubuntu
        else:
            raise Exception("Unsupported platform or architecture.")

+        # Check for HF_ENDPOINT environment variable and replace domain if set to hf-mirror.com
+        hf_endpoint = os.environ.get('HF_ENDPOINT')
+        if hf_endpoint and 'hf-mirror.com' in hf_endpoint:
+            url = url.replace('huggingface.co', 'hf-mirror.com')
+            logger.info(f"Using HF mirror: {url}")
+
        # Download the virtual machine image
        logger.info("Downloading the virtual machine image...")
        downloaded_size = 0
@ -428,7 +434,9 @@ class VirtualBoxVMManager(VMManager):
                        free_vms.append((vm_path, pid_str))
            return free_vms

-    def get_vm_path(self, os_type, region=None):
+    def get_vm_path(self, os_type, region=None, screen_size=(1920, 1080), **kwargs):
+        # Note: screen_size parameter is ignored for VirtualBox provider
+        # but kept for interface consistency with other providers
        if os_type != "Ubuntu":
            raise ValueError("Only support Ubuntu for now.")

--- a/desktop_env/providers/virtualbox/provider.py
+++ b/desktop_env/providers/virtualbox/provider.py
@ -55,7 +55,9 @@ class VirtualBoxProvider(Provider):
                logger.error(f"Error executing command: {e.output.decode().strip()}")
            

-    def start_emulator(self, path_to_vm: str, headless: bool):
+    def start_emulator(self, path_to_vm: str, headless: bool, os_type: str = None, *args, **kwargs):
+        # Note: os_type parameter is ignored for VirtualBox provider
+        # but kept for interface consistency with other providers
        logger.info("Starting VirtualBox VM...")

        while True:
@ -113,7 +115,9 @@ class VirtualBoxProvider(Provider):
        time.sleep(WAIT_TIME)  # Wait for the VM to revert
        return path_to_vm

-    def stop_emulator(self, path_to_vm: str):
+    def stop_emulator(self, path_to_vm: str, region=None, *args, **kwargs):
+        # Note: region parameter is ignored for VirtualBox provider
+        # but kept for interface consistency with other providers
        logger.info("Stopping VirtualBox VM...")
        uuid = VirtualBoxProvider._get_vm_uuid(path_to_vm)
        VirtualBoxProvider._execute_command(["VBoxManage", "controlvm", uuid, "savestate"])
--- a/desktop_env/providers/vmware/manager.py
+++ b/desktop_env/providers/vmware/manager.py
@ -29,13 +29,11 @@ UBUNTU_X86_URL = "https://huggingface.co/datasets/xlangai/ubuntu_osworld/resolve
 WINDOWS_X86_URL = "https://huggingface.co/datasets/xlangai/windows_osworld/resolve/main/Windows-x86.zip"

 # Determine the platform and CPU architecture to decide the correct VM image to download
-if platform.system() == 'Darwin':  # macOS
-    # if os.uname().machine == 'arm64':  # Apple Silicon
-    URL = UBUNTU_ARM_URL
-# else:
-#     url = UBUNTU_X86_URL
-elif platform.machine().lower() in ['amd64', 'x86_64']:
+# sometimes the system is 'Darwin' but the machine is x86-based. 
+if platform.machine().lower() in ['amd64', 'x86_64']:
    URL = UBUNTU_X86_URL
+elif platform.system() == 'Darwin':  # macOS
+    URL = UBUNTU_ARM_URL
 else:
    raise Exception("Unsupported platform or architecture")

@ -125,15 +123,22 @@ def _install_vm(vm_name, vms_dir, downloaded_file_name, os_type, original_vm_nam
        # Download the virtual machine image
        logger.info("Downloading the virtual machine image...")
        downloaded_size = 0
-
+        # sometimes the system is 'Darwin' but the machine is x86-based. 
        if os_type == "Ubuntu":
-            if platform.system() == 'Darwin':
-                URL = UBUNTU_ARM_URL
-            elif platform.machine().lower() in ['amd64', 'x86_64']:
+            if platform.machine().lower() in ['amd64', 'x86_64']:
                URL = UBUNTU_X86_URL
+            elif platform.system() == 'Darwin':
+                URL = UBUNTU_ARM_URL
        elif os_type == "Windows":
            if platform.machine().lower() in ['amd64', 'x86_64']:
                URL = WINDOWS_X86_URL
+        
+        # Check for HF_ENDPOINT environment variable and replace domain if set to hf-mirror.com
+        hf_endpoint = os.environ.get('HF_ENDPOINT')
+        if hf_endpoint and 'hf-mirror.com' in hf_endpoint:
+            URL = URL.replace('huggingface.co', 'hf-mirror.com')
+            logger.info(f"Using HF mirror: {URL}")
+        
        DOWNLOADED_FILE_NAME = URL.split('/')[-1]
        downloaded_file_name = DOWNLOADED_FILE_NAME

@ -417,7 +422,9 @@ class VMwareVMManager(VMManager):
                        free_vms.append((vm_path, pid_str))
            return free_vms

-    def get_vm_path(self, os_type, region=None):
+    def get_vm_path(self, os_type, region=None, screen_size=(1920, 1080), **kwargs):
+        # Note: screen_size parameter is ignored for VMware provider
+        # but kept for interface consistency with other providers
        with self.lock:
            if not VMwareVMManager.checked_and_cleaned:
                VMwareVMManager.checked_and_cleaned = True
--- a/desktop_env/providers/vmware/provider.py
+++ b/desktop_env/providers/vmware/provider.py
@ -97,7 +97,9 @@ class VMwareProvider(Provider):
        time.sleep(WAIT_TIME)  # Wait for the VM to revert
        return path_to_vm

-    def stop_emulator(self, path_to_vm: str):
+    def stop_emulator(self, path_to_vm: str, region=None, *args, **kwargs):
+        # Note: region parameter is ignored for VMware provider
+        # but kept for interface consistency with other providers
        logger.info("Stopping VMware VM...")
        VMwareProvider._execute_command(["vmrun"] + get_vmrun_type(return_list=True) + ["stop", path_to_vm])
        time.sleep(WAIT_TIME)  # Wait for the VM to stop
--- a/desktop_env/providers/volcengine/VOLCENGINE_GUIDELINE.md
+++ b/desktop_env/providers/volcengine/VOLCENGINE_GUIDELINE.md
--- a/desktop_env/providers/volcengine/VOLCENGINE_GUIDELINE_CN.md
+++ b/desktop_env/providers/volcengine/VOLCENGINE_GUIDELINE_CN.md
@ -0,0 +1,72 @@
+# 火山引擎ECS提供商配置指南
+
+本指南介绍如何为OSWorld桌面环境配置和使用火山引擎ECS。
+
+## 配置流程
+
+1. **火山引擎账户**：您需要一个有效的火山引擎账户，本脚本默认ECS通过按量付费方式拉起，需保证账户余额在100以上。
+2. **访问密钥**：在火山引擎IAM控制台中创建AccessKey ID和SecretAccessKey，并授权ECS控制权限
+3. **VPC设置**：在目标地域创建VPC、子网和安全组
+4. **自定义镜像**：创建OSWorld自定义镜像
+5. 建议手动完成一次ECS创建流程后，记录所有需要的环境变量信息。
+
+## 环境变量
+
+在您的`.env`文件中设置以下环境变量：
+
+```bash
+# 火山引擎访问凭证
+VOLCENGINE_ACCESS_KEY_ID=your_access_key_id
+VOLCENGINE_SECRET_ACCESS_KEY=your_secret_access_key
+
+# ECS配置信息
+VOLCENGINE_REGION=ap-southeast-1
+VOLCENGINE_IMAGE_ID=image-xxxxxxxxx
+VOLCENGINE_INSTANCE_TYPE=ecs.e-c1m2.large
+VOLCENGINE_SUBNET_ID=subnet-xxxxxxxxx
+VOLCENGINE_SECURITY_GROUP_ID=sg-xxxxxxxxx
+VOLCENGINE_ZONE_ID=zone-xxxxxxxxx
+VOLCENGINE_DEFAULT_PASSWORD=your_default_password
+```
+
+## 所需火山引擎资源
+
+### 1. VPC和子网
+- 在目标地域创建VPC
+- 在VPC内创建子网
+- 确保子网具有互联网访问能力以支持VNC连接
+
+### 2. 安全组
+**⚠️ 重要提示**：请严格按照以下端口设置，以防止OSWorld任务因连接问题而失败：
+
+#### 入方向规则（需要8条规则）
+
+| 类型 | 协议 | 端口范围 | 源地址 | 描述 |
+|------|------|----------|--------|------|
+| SSH | TCP | 22 | 0.0.0.0/0 | SSH访问 |
+| HTTP | TCP | 80 | 172.31.0.0/16 | HTTP流量 |
+| 自定义TCP | TCP | 5000 | 172.31.0.0/16 | OSWorld后端服务 |
+| 自定义TCP | TCP | 5910 | 0.0.0.0/0 | NoVNC可视化端口 |
+| 自定义TCP | TCP | 8006 | 172.31.0.0/16 | VNC服务端口 |
+| 自定义TCP | TCP | 8080 | 172.31.0.0/16 | VLC服务端口 |
+| 自定义TCP | TCP | 8081 | 172.31.0.0/16 | 附加服务端口 |
+| 自定义TCP | TCP | 9222 | 172.31.0.0/16 | Chrome控制端口 |
+
+#### 出方向规则（需要1条规则）
+
+| 类型 | 协议 | 端口范围 | 目标地址 | 描述 |
+|------|------|----------|----------|------|
+| 全部流量 | 全部 | 全部 | 0.0.0.0/0 | 允许所有出站流量 |
+
+### 3. 自定义镜像
+您需要为火山引擎ECS创建自定义OSWorld镜像。请按照"为OSWorld创建自定义ECS镜像"部分的说明进行操作。
+
+## 为OSWorld创建自定义ECS镜像
+
+本部分提供如何创建OSWorld桌面环境所需的自定义ECS镜像的指导。该过程包括设置带有桌面环境和VNC服务器的基础实例，然后从中创建自定义镜像。
+
+### 镜像创建过程
+- 从`desktop_env/providers/docker/manager.py`中的链接下载提供的qcow2镜像：https://huggingface.co/datasets/xlangai/ubuntu_osworld/resolve/main/Ubuntu.qcow2.zip
+- 解压下载的文件并上传到火山引擎对象存储服务（TOS）。确保TOS与您要启动ECS实例的目标地域在同一地域。
+- 在您的ECS控制台中，转到"镜像"页面，您将看到"导入镜像"按钮。点击它并按照说明从TOS导入qcow2镜像。
+- 导入完成后，您将在"镜像"列表中看到导入的镜像。
--- a/desktop_env/providers/volcengine/init.py
+++ b/desktop_env/providers/volcengine/init.py
--- a/desktop_env/providers/volcengine/manager.py
+++ b/desktop_env/providers/volcengine/manager.py
@ -0,0 +1,221 @@
+import os
+import logging
+import signal
+import dotenv
+import time
+import volcenginesdkcore
+import volcenginesdkecs.models as ecs_models
+from volcenginesdkecs.api import ECSApi
+
+from desktop_env.providers.base import VMManager
+
+# Load environment variables from .env file
+dotenv.load_dotenv()
+
+for env_name in [
+    "VOLCENGINE_ACCESS_KEY_ID",
+    "VOLCENGINE_SECRET_ACCESS_KEY",
+    "VOLCENGINE_REGION",
+    "VOLCENGINE_SUBNET_ID",
+    "VOLCENGINE_SECURITY_GROUP_ID",
+    "VOLCENGINE_INSTANCE_TYPE",
+    "VOLCENGINE_IMAGE_ID",
+    "VOLCENGINE_ZONE_ID",
+    "VOLCENGINE_DEFAULT_PASSWORD",
+]:
+    if not os.getenv(env_name):
+        raise EnvironmentError(f"{env_name} must be set in the environment variables.")
+
+logger = logging.getLogger("desktopenv.providers.volcengine.VolcengineVMManager")
+logger.setLevel(logging.INFO)
+
+VOLCENGINE_ACCESS_KEY_ID = os.getenv("VOLCENGINE_ACCESS_KEY_ID")
+VOLCENGINE_SECRET_ACCESS_KEY = os.getenv("VOLCENGINE_SECRET_ACCESS_KEY")
+VOLCENGINE_REGION = os.getenv("VOLCENGINE_REGION")
+VOLCENGINE_SUBNET_ID = os.getenv("VOLCENGINE_SUBNET_ID")
+VOLCENGINE_SECURITY_GROUP_ID = os.getenv("VOLCENGINE_SECURITY_GROUP_ID")
+VOLCENGINE_INSTANCE_TYPE = os.getenv("VOLCENGINE_INSTANCE_TYPE")
+VOLCENGINE_IMAGE_ID = os.getenv("VOLCENGINE_IMAGE_ID")
+VOLCENGINE_ZONE_ID = os.getenv("VOLCENGINE_ZONE_ID")
+VOLCENGINE_DEFAULT_PASSWORD = os.getenv("VOLCENGINE_DEFAULT_PASSWORD")
+
+def _allocate_vm(screen_size=(1920, 1080)):
+    """分配火山引擎虚拟机"""
+
+    # 初始化火山引擎客户端
+    configuration = volcenginesdkcore.Configuration()
+    configuration.region = VOLCENGINE_REGION
+    configuration.ak = VOLCENGINE_ACCESS_KEY_ID
+    configuration.sk = VOLCENGINE_SECRET_ACCESS_KEY
+    configuration.client_side_validation = True
+    # set default configuration
+    volcenginesdkcore.Configuration.set_default(configuration)
+
+    # use global default configuration
+    api_instance = ECSApi()
+    
+    instance_id = None
+    original_sigint_handler = signal.getsignal(signal.SIGINT)
+    original_sigterm_handler = signal.getsignal(signal.SIGTERM)
+    
+    def signal_handler(sig, frame):
+        if instance_id:
+            signal_name = "SIGINT" if sig == signal.SIGINT else "SIGTERM"
+            logger.warning(f"Received {signal_name} signal, terminating instance {instance_id}...")
+            try:
+                api_instance.delete_instance(ecs_models.DeleteInstanceRequest(
+                    instance_id=instance_id,
+                ))
+                logger.info(f"Successfully terminated instance {instance_id} after {signal_name}.")
+            except Exception as cleanup_error:
+                logger.error(f"Failed to terminate instance {instance_id} after {signal_name}: {str(cleanup_error)}")
+        
+        # Restore original signal handlers
+        signal.signal(signal.SIGINT, original_sigint_handler)
+        signal.signal(signal.SIGTERM, original_sigterm_handler)
+        
+        if sig == signal.SIGINT:
+            raise KeyboardInterrupt
+        else:
+            import sys
+            sys.exit(0)
+    
+    try:
+        # Set up signal handlers
+        signal.signal(signal.SIGINT, signal_handler)
+        signal.signal(signal.SIGTERM, signal_handler)
+        
+        # 创建实例参数
+        create_instance_params = ecs_models.RunInstancesRequest(
+            image_id = VOLCENGINE_IMAGE_ID,
+            instance_type = VOLCENGINE_INSTANCE_TYPE,
+            network_interfaces=[ecs_models.NetworkInterfaceForRunInstancesInput(
+                subnet_id=VOLCENGINE_SUBNET_ID,
+                security_group_ids=[VOLCENGINE_SECURITY_GROUP_ID],
+            )],
+            eip_address=ecs_models.EipAddressForRunInstancesInput(
+                bandwidth_mbps = 5,
+                charge_type = "PayByTraffic",
+            ),
+            instance_name = f"osworld-{os.getpid()}-{int(time.time())}",
+            volumes=[ecs_models.VolumeForRunInstancesInput(
+                volume_type="ESSD_PL0",
+                size=30,
+            )],
+            zone_id=VOLCENGINE_ZONE_ID,
+            password = VOLCENGINE_DEFAULT_PASSWORD,  # 默认密码
+            description = "OSWorld evaluation instance"
+        )
+        
+        # 创建实例
+        response = api_instance.run_instances(create_instance_params)
+        instance_id = response.instance_ids[0]
+        
+        logger.info(f"Waiting for instance {instance_id} to be running...")
+        
+        # 等待实例运行
+        while True:
+            instance_info = api_instance.describe_instances(ecs_models.DescribeInstancesRequest(
+                instance_ids=[instance_id]
+            ))
+            status = instance_info.instances[0].status
+            if status == 'RUNNING':
+                break
+            elif status in ['STOPPED', 'ERROR']:
+                raise Exception(f"Instance {instance_id} failed to start, status: {status}")
+            time.sleep(5)
+        
+        logger.info(f"Instance {instance_id} is ready.")
+        
+        # 获取实例IP地址
+        try:
+            instance_info = api_instance.describe_instances(ecs_models.DescribeInstancesRequest(
+                instance_ids=[instance_id]
+            ))
+            print(instance_info)
+            public_ip = instance_info.instances[0].eip_address.ip_address
+            private_ip = instance_info.instances[0].network_interfaces[0].primary_ip_address
+            
+            if public_ip:
+                vnc_url = f"http://{public_ip}:5910/vnc.html"
+                logger.info("="*80)
+                logger.info(f"🖥️  VNC Web Access URL: {vnc_url}")
+                logger.info(f"📡 Public IP: {public_ip}")
+                logger.info(f"🏠 Private IP: {private_ip}")
+                logger.info(f"🆔 Instance ID: {instance_id}")
+                logger.info("="*80)
+                print(f"\n🌐 VNC Web Access URL: {vnc_url}")
+                print(f"📍 Please open the above address in the browser for remote desktop access\n")
+        except Exception as e:
+            logger.warning(f"Failed to get VNC address for instance {instance_id}: {e}")
+            
+    except KeyboardInterrupt:
+        logger.warning("VM allocation interrupted by user (SIGINT).")
+        if instance_id:
+            logger.info(f"Terminating instance {instance_id} due to interruption.")
+            api_instance.delete_instance(ecs_models.DeleteInstanceRequest(
+                instance_id=instance_id,
+            ))
+        raise
+    except Exception as e:
+        logger.error(f"Failed to allocate VM: {e}", exc_info=True)
+        if instance_id:
+            logger.info(f"Terminating instance {instance_id} due to an error.")
+            api_instance.delete_instance(ecs_models.DeleteInstanceRequest(
+                instance_id=instance_id,
+            ))
+        raise
+    finally:
+        # Restore original signal handlers
+        signal.signal(signal.SIGINT, original_sigint_handler)
+        signal.signal(signal.SIGTERM, original_sigterm_handler)
+
+    return instance_id
+
+
+class VolcengineVMManager(VMManager):
+    """
+    Volcengine VM Manager for managing virtual machines on Volcengine.
+    
+    Volcengine does not need to maintain a registry of VMs, as it can dynamically allocate and deallocate VMs.
+    """
+    def __init__(self, **kwargs):
+        self.initialize_registry()
+
+    def initialize_registry(self, **kwargs):
+        pass
+
+    def add_vm(self, vm_path, lock_needed=True, **kwargs):
+        pass
+
+    def _add_vm(self, vm_path):
+        pass
+
+    def delete_vm(self, vm_path, lock_needed=True, **kwargs):
+        pass
+
+    def _delete_vm(self, vm_path):
+        pass
+
+    def occupy_vm(self, vm_path, pid, lock_needed=True, **kwargs):
+        pass
+
+    def _occupy_vm(self, vm_path, pid):
+        pass
+
+    def check_and_clean(self, lock_needed=True, **kwargs):
+        pass
+
+    def _check_and_clean(self):
+        pass
+
+    def list_free_vms(self, lock_needed=True, **kwargs):
+        pass
+
+    def _list_free_vms(self):
+        pass
+
+    def get_vm_path(self, screen_size=(1920, 1080), **kwargs):
+        logger.info("Allocating a new VM in region: {region}".format(region=VOLCENGINE_REGION))
+        new_vm_path = _allocate_vm(screen_size=screen_size)
+        return new_vm_path
--- a/desktop_env/providers/volcengine/provider.py
+++ b/desktop_env/providers/volcengine/provider.py
@ -0,0 +1,188 @@
+import os
+import time
+import logging
+import volcenginesdkcore
+import volcenginesdkautoscaling
+import volcenginesdkecs.models as ecs_models
+from volcenginesdkcore.rest import ApiException
+from volcenginesdkecs.api import ECSApi
+
+from desktop_env.providers.base import Provider
+from desktop_env.providers.volcengine.manager import _allocate_vm
+
+logger = logging.getLogger("desktopenv.providers.volcengine.VolcengineProvider")
+logger.setLevel(logging.INFO)
+
+WAIT_DELAY = 15
+MAX_ATTEMPTS = 10
+
+
+class VolcengineProvider(Provider):
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+        self.region = os.getenv("VOLCENGINE_REGION", "eu-central-1")
+        self.client = self._create_client()
+
+    def _create_client(self) -> ECSApi:
+        configuration = volcenginesdkcore.Configuration()
+        configuration.ak = os.getenv('VOLCENGINE_ACCESS_KEY_ID')
+        configuration.sk = os.getenv('VOLCENGINE_SECRET_ACCESS_KEY')
+        configuration.region = os.getenv('VOLCENGINE_REGION')
+        configuration.client_side_validation = True
+        # set default configuration
+        volcenginesdkcore.Configuration.set_default(configuration)
+        return ECSApi()
+
+    def start_emulator(self, path_to_vm: str, headless: bool, *args, **kwargs):
+        logger.info("Starting Volcengine VM...")
+
+        try:
+            # 检查实例状态
+            instance_info = self.client.describe_instances(ecs_models.DescribeInstancesRequest(
+                instance_ids=[path_to_vm]
+            ))
+            status = instance_info.instances[0].status
+            logger.info(f"Instance {path_to_vm} current status: {status}")
+
+            if status == 'RUNNING':
+                logger.info(f"Instance {path_to_vm} is already running. Skipping start.")
+                return
+
+            if status == 'STOPPED':
+                # 启动实例
+                self.client.start_instance(ecs_models.StartInstancesRequest(instance_ids=[path_to_vm]))
+                logger.info(f"Instance {path_to_vm} is starting...")
+
+                # 等待实例运行
+                for attempt in range(MAX_ATTEMPTS):
+                    time.sleep(WAIT_DELAY)
+                    instance_info = self.client.describe_instances(ecs_models.DescribeInstancesRequest(
+                        instance_ids=[path_to_vm]
+                    ))
+                    status = instance_info.instances[0].status
+
+                    if status == 'RUNNING':
+                        logger.info(f"Instance {path_to_vm} is now running.")
+                        break
+                    elif status == 'ERROR':
+                        raise Exception(f"Instance {path_to_vm} failed to start")
+                    elif attempt == MAX_ATTEMPTS - 1:
+                        raise Exception(f"Instance {path_to_vm} failed to start within timeout")
+            else:
+                logger.warning(f"Instance {path_to_vm} is in status '{status}' and cannot be started.")
+
+        except ApiException as e:
+            logger.error(f"Failed to start the Volcengine VM {path_to_vm}: {str(e)}")
+            raise
+
+    def get_ip_address(self, path_to_vm: str) -> str:
+        logger.info("Getting Volcengine VM IP address...")
+
+        try:
+            instance_info = self.client.describe_instances(ecs_models.DescribeInstancesRequest(
+                instance_ids=[path_to_vm]
+            ))
+
+            public_ip = instance_info.instances[0].eip_address.ip_address
+            private_ip = instance_info.instances[0].network_interfaces[0].primary_ip_address
+
+            if public_ip:
+                vnc_url = f"http://{public_ip}:5910/vnc.html"
+                logger.info("=" * 80)
+                logger.info(f"🖥️  VNC Web Access URL: {vnc_url}")
+                logger.info(f"📡 Public IP: {public_ip}")
+                logger.info(f"🏠 Private IP: {private_ip}")
+                logger.info("=" * 80)
+                print(f"\n🌐 VNC Web Access URL: {vnc_url}")
+                print(f"📍 Please open the above address in the browser for remote desktop access\n")
+            else:
+                logger.warning("No public IP address available for VNC access")
+
+            return private_ip
+
+        except ApiException as e:
+            logger.error(f"Failed to retrieve IP address for the instance {path_to_vm}: {str(e)}")
+            raise
+
+    def save_state(self, path_to_vm: str, snapshot_name: str):
+        logger.info("Saving Volcengine VM state...")
+
+        try:
+            # 创建镜像
+            response = self.client.create_image(ecs_models.CreateImageRequest(
+                snapshot_id=snapshot_name,
+                instance_id=path_to_vm,
+                description=f"OSWorld snapshot: {snapshot_name}"
+            ))
+            image_id = response['image_id']
+            logger.info(f"Image {image_id} created successfully from instance {path_to_vm}.")
+            return image_id
+        except ApiException as e:
+            logger.error(f"Failed to create image from the instance {path_to_vm}: {str(e)}")
+            raise
+
+    def revert_to_snapshot(self, path_to_vm: str, snapshot_name: str):
+        logger.info(f"Reverting Volcengine VM to snapshot: {snapshot_name}...")
+
+        try:
+            # 删除原实例
+            self.client.delete_instance(ecs_models.DeleteInstanceRequest(
+                instance_id=path_to_vm,
+            ))
+            logger.info(f"Old instance {path_to_vm} has been deleted.")
+
+            # 创建实例
+            new_instance_id = _allocate_vm()
+
+            logger.info(f"New instance {new_instance_id} launched from image {snapshot_name}.")
+            logger.info(f"Waiting for instance {new_instance_id} to be running...")
+
+            # 等待新实例运行
+            while True:
+                instance_info = self.client.describe_instances(ecs_models.DescribeInstancesRequest(
+                    instance_ids=[new_instance_id]
+                ))
+                status = instance_info.instances[0].status
+                if status == 'RUNNING':
+                    break
+                elif status in ['STOPPED', 'ERROR']:
+                    raise Exception(f"New instance {new_instance_id} failed to start, status: {status}")
+                time.sleep(5)
+
+            logger.info(f"Instance {new_instance_id} is ready.")
+
+            # 获取新实例的IP地址
+            try:
+                instance_info = self.client.describe_instances(ecs_models.DescribeInstancesRequest(
+                    instance_ids=[new_instance_id]
+                ))
+                public_ip = instance_info.instances[0].eip_address.ip_address
+                if public_ip:
+                    vnc_url = f"http://{public_ip}:5910/vnc.html"
+                    logger.info("=" * 80)
+                    logger.info(f"🖥️  New Instance VNC Web Access URL: {vnc_url}")
+                    logger.info(f"📡 Public IP: {public_ip}")
+                    logger.info(f"🆔 New Instance ID: {new_instance_id}")
+                    logger.info("=" * 80)
+                    print(f"\n🌐 New Instance VNC Web Access URL: {vnc_url}")
+                    print(f"📍 Please open the above address in the browser for remote desktop access\n")
+            except Exception as e:
+                logger.warning(f"Failed to get VNC address for new instance {new_instance_id}: {e}")
+
+            return new_instance_id
+
+        except ApiException as e:
+            logger.error(f"Failed to revert to snapshot {snapshot_name} for the instance {path_to_vm}: {str(e)}")
+            raise
+
+    def stop_emulator(self, path_to_vm, region=None):
+        logger.info(f"Stopping Volcengine VM {path_to_vm}...")
+
+        try:
+            self.client.delete_instance(ecs_models.DeleteInstanceRequest(
+                instance_id=path_to_vm,
+            ))
+            logger.info(f"Instance {path_to_vm} has been terminated.")
+        except ApiException as e:
+            logger.error(f"Failed to stop the Volcengine VM {path_to_vm}: {str(e)}")
+            raise
--- a/desktop_env/server/README.md
+++ b/desktop_env/server/README.md
@ -442,7 +442,7 @@ Since for some examples like change the settings of certain software, we hardcod

 ##### LibreOffice font installation 
 Some examples in LibreOffice Impress use non-default system fonts, and you need to download the corresponding **TTF files** and put them in the system fonts directory. 
-[Here](https://drive.usercontent.google.com/download?id=1UzmdsfUQRTnvCxkvWrKguwZM3G5eQk87&export=download&authuser=0&confirm=t&uuid=70b9fbb7-9585-4aa4-a2c0-a7d6126469a0&at=AEz70l4rdEjdxBpqkLyW9lcil6S5:1740142224052) we provides all the fonts downloaded, just download it, and unzip to the system fonts directory (which usually `usr/share/fonts/`).
+[Here](https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache/resolve/main/fonts_20250608_fixed.zip) we provides all the fonts downloaded, just download it, and unzip to the system fonts directory (which usually `usr/share/fonts/`).
 ```bash
 unzip fonts.zip -d /usr/share/fonts/
 ```
@ -488,6 +488,20 @@ touch ~/.local/share/keyrings/login.keyring

 Or you can use any ways to disable the keyring service, which will prevent Chrome from requesting a password input.

+3. VSCode Trust Settings:
+To prevent VSCode from showing "Do you trust this workspace?" dialog when opening projects, configure the following setting:
+
+```bash
+# Open VSCode
+# Go to File -> Preferences -> Settings (or press Ctrl+,)
+# In the search bar, type "workspace trust"
+# Find "Security: Workspace Trust Enabled" and uncheck it
+# Find "Security: Workspace Trust Banner" and set to "never"
+# Find "Security: Workspace Trust Empty Window" and uncheck it
+# Find "Security: Workspace Trust Startup Prompt" and set to "never"
+```
+
+This disables the workspace trust feature and prevents trust dialog prompts when opening projects.

 ### Network Configuration

@ -539,20 +553,30 @@ The following is the screenshot of the VLC configuration:
 When VLC is open, the service will be running on port 8080.

 ##### Chrome Configuration
+When you open Chrome through the GUI interface, it doesn't automatically enable remote debugging port 1337 (the port we set for controlling the client machine's Chrome through the host, which gets forwarded to 9222). 
+This is because GUI startup and command-line startup use different parameters.
+The consequence is that once Chrome is closed and reopened, it will fail to connect.
 To ensure Chrome uses consistent debugging ports even after being closed and reopened, follow these steps:

 1. Create or edit Chrome desktop entry:
 ```bash
-sudo nano /usr/share/applications/google-chrome.desktop
+sudo vim ~/.local/share/applications/google-chrome.desktop
+```
+If that doesn't work, try:
+```
+sudo vim /usr/share/applications/google-chrome.desktop
 ```

-2. Modify the Exec lines to include debugging port:
-```bash
-# Find lines starting with "Exec=" and add the following flags:
--remote-debugging-port=1337 --remote-debugging-address=0.0.0.0
+2. Modify **all** the `Exec` lines to include debugging port, for example, change:
+```
+Exec=/usr/bin/google-chrome-stable %U
+```
+into:
+```
+Exec=/usr/bin/google-chrome-stable --remote-debugging-port=1337 --remote-debugging-address=0.0.0.0 %U
 ```

-In cases where need Chrome, the 1337 will be forwarded to 9222 in the virtual machine via socat.
+In cases where Chrome is needed, the 1337 will be forwarded to 9222 in the virtual machine via socat.


 ### Miscellaneous Settings  
--- a/desktop_env/server/main.py
+++ b/desktop_env/server/main.py
@ -1370,7 +1370,7 @@ def open_file():
        if window_found:
            return "File opened and window activated successfully"
        else:
-            return f"Failed to find window for {file_name} within {timeout} seconds.", 500
+            return f"Failed to find window for {file_name} within {TIMEOUT} seconds.", 500

    except Exception as e:
        return f"Failed to open {path}. Error: {e}", 500
@ -1568,5 +1568,230 @@ def end_recording():
        return abort(500, description=f"Recording failed. The output file is missing or empty. ffmpeg stderr: {error_output}")


+@app.route("/run_python", methods=['POST'])
+def run_python():
+    data = request.json
+    code = data.get('code', None)
+
+    if not code:
+        return jsonify({'status': 'error', 'message': 'Code not supplied!'}), 400
+
+    # Create a temporary file to save the Python code
+    import tempfile
+    import uuid
+    
+    # Generate unique filename
+    temp_filename = f"/tmp/python_exec_{uuid.uuid4().hex}.py"
+    
+    try:
+        # Write code to temporary file
+        with open(temp_filename, 'w') as f:
+            f.write(code)
+        
+        # Execute the file using subprocess to capture all output
+        result = subprocess.run(
+            ['/usr/bin/python3', temp_filename],
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+            text=True,
+            timeout=30  # 30 second timeout
+        )
+        
+        # Clean up the temporary file
+        try:
+            os.remove(temp_filename)
+        except:
+            pass  # Ignore cleanup errors
+        
+        # Prepare response
+        output = result.stdout
+        error_output = result.stderr
+        
+        # Combine output and errors if both exist
+        combined_message = output
+        if error_output:
+            combined_message += ('\n' + error_output) if output else error_output
+        
+        # Determine status based on return code and errors
+        if result.returncode != 0:
+            status = 'error'
+            if not error_output:
+                # If no stderr but non-zero return code, add a generic error message
+                error_output = f"Process exited with code {result.returncode}"
+                combined_message = combined_message + '\n' + error_output if combined_message else error_output
+        else:
+            status = 'success'
+        
+        return jsonify({
+            'status': status,
+            'message': combined_message,
+            'need_more': False,      # Not applicable for file execution
+            'output': output,        # stdout only
+            'error': error_output,   # stderr only
+            'return_code': result.returncode
+        })
+        
+    except subprocess.TimeoutExpired:
+        # Clean up the temporary file on timeout
+        try:
+            os.remove(temp_filename)
+        except:
+            pass
+            
+        return jsonify({
+            'status': 'error',
+            'message': 'Execution timeout: Code took too long to execute',
+            'error': 'TimeoutExpired',
+            'need_more': False,
+            'output': None,
+        }), 500
+        
+    except Exception as e:
+        # Clean up the temporary file on error
+        try:
+            os.remove(temp_filename)
+        except:
+            pass
+            
+        # Capture the exception details
+        return jsonify({
+            'status': 'error',
+            'message': f'Execution error: {str(e)}',
+            'error': traceback.format_exc(),
+            'need_more': False,
+            'output': None,
+        }), 500
+
+
+@app.route("/run_bash_script", methods=['POST'])
+def run_bash_script():
+    data = request.json
+    script = data.get('script', None)
+    timeout = data.get('timeout', 100)  # Default timeout of 30 seconds
+    working_dir = data.get('working_dir', None)
+    
+    if not script:
+        return jsonify({
+            'status': 'error',
+            'output': 'Script not supplied!',
+            'error': "",  # Always empty as requested
+            'returncode': -1
+        }), 400
+    
+    # Expand user directory if provided
+    if working_dir:
+        working_dir = os.path.expanduser(working_dir)
+        if not os.path.exists(working_dir):
+            return jsonify({
+                'status': 'error',
+                'output': f'Working directory does not exist: {working_dir}',
+                'error': "",  # Always empty as requested
+                'returncode': -1
+            }), 400
+    
+    # Create a temporary script file
+    import tempfile
+    with tempfile.NamedTemporaryFile(mode='w', suffix='.sh', delete=False) as tmp_file:
+        if "#!/bin/bash" not in script:
+            script = "#!/bin/bash\n\n" + script
+        tmp_file.write(script)
+        tmp_file_path = tmp_file.name
+    
+    try:
+        # Make the script executable
+        os.chmod(tmp_file_path, 0o755)
+        
+        # Execute the script
+        if platform_name == "Windows":
+            # On Windows, use Git Bash or WSL if available, otherwise cmd
+            flags = subprocess.CREATE_NO_WINDOW
+            # Try to use bash if available (Git Bash, WSL, etc.)
+            result = subprocess.run(
+                ['bash', tmp_file_path],
+                stdout=subprocess.PIPE,
+                stderr=subprocess.STDOUT,  # Merge stderr into stdout
+                text=True,
+                timeout=timeout,
+                cwd=working_dir,
+                creationflags=flags,
+                shell=False
+            )
+        else:
+            # On Unix-like systems, use bash directly
+            flags = 0
+            result = subprocess.run(
+                ['/bin/bash', tmp_file_path],
+                stdout=subprocess.PIPE,
+                stderr=subprocess.STDOUT,  # Merge stderr into stdout
+                text=True,
+                timeout=timeout,
+                cwd=working_dir,
+                creationflags=flags,
+                shell=False
+            )
+        
+        # Log the command execution for trajectory recording
+        _append_event("BashScript", 
+                      {"script": script, "output": result.stdout, "error": "", "returncode": result.returncode}, 
+                      ts=time.time())
+        
+        return jsonify({
+            'status': 'success' if result.returncode == 0 else 'error',
+            'output': result.stdout,  # Contains both stdout and stderr merged
+            'error': "",  # Always empty as requested
+            'returncode': result.returncode
+        })
+        
+    except subprocess.TimeoutExpired:
+        return jsonify({
+            'status': 'error',
+            'output': f'Script execution timed out after {timeout} seconds',
+            'error': "",  # Always empty as requested
+            'returncode': -1
+        }), 500
+    except FileNotFoundError:
+        # Bash not found, try with sh
+        try:
+            result = subprocess.run(
+                ['sh', tmp_file_path],
+                stdout=subprocess.PIPE,
+                stderr=subprocess.STDOUT,  # Merge stderr into stdout
+                text=True,
+                timeout=timeout,
+                cwd=working_dir,
+                shell=False
+            )
+            
+            _append_event("BashScript", 
+                          {"script": script, "output": result.stdout, "error": "", "returncode": result.returncode}, 
+                          ts=time.time())
+            
+            return jsonify({
+                'status': 'success' if result.returncode == 0 else 'error',
+                'output': result.stdout,  # Contains both stdout and stderr merged
+                'error': "",  # Always empty as requested
+                'returncode': result.returncode,
+            })
+        except Exception as e:
+            return jsonify({
+                'status': 'error',
+                'output': f'Failed to execute script: {str(e)}',
+                'error': "",  # Always empty as requested
+                'returncode': -1
+            }), 500
+    except Exception as e:
+        return jsonify({
+            'status': 'error',
+            'output': f'Failed to execute script: {str(e)}',
+            'error': "",  # Always empty as requested
+            'returncode': -1
+        }), 500
+    finally:
+        # Clean up the temporary file
+        try:
+            os.unlink(tmp_file_path)
+        except:
+            pass
+
 if __name__ == '__main__':
    app.run(debug=True, host="0.0.0.0")
--- a/evaluation_examples/examples/chrome/030eeff7-b492-4218-b312-701ec99ee0cc.json
+++ b/evaluation_examples/examples/chrome/030eeff7-b492-4218-b312-701ec99ee0cc.json
@ -29,6 +29,32 @@
    "chrome"
  ],
  "evaluator": {
+    "postconfig": [
+      {
+        "type": "launch",
+        "parameters": {
+          "command": [
+            "pkill",
+            "chrome"
+          ]
+        }
+      },
+      {
+        "type": "launch",
+        "parameters": {
+          "command": [
+            "google-chrome",
+            "--remote-debugging-port=1337"
+          ]
+        }
+      },
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
    "func": "exact_match",
    "result": {
      "type": "enable_do_not_track"
@ -40,5 +66,7 @@
      }
    }
  },
-  "proxy": false
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/06fe7178-4491-4589-810f-2e2bc9502122.json
+++ b/evaluation_examples/examples/chrome/06fe7178-4491-4589-810f-2e2bc9502122.json
@ -63,5 +63,7 @@
      }
    }
  },
-  "proxy": true
+  "proxy": true,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/0d8b7de3-e8de-4d86-b9fd-dd2dce58a217.json
+++ b/evaluation_examples/examples/chrome/0d8b7de3-e8de-4d86-b9fd-dd2dce58a217.json
@ -75,5 +75,7 @@
      }
    ]
  },
-  "proxy": true
+  "proxy": true,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/12086550-11c0-466b-b367-1d9e75b3910e.json
+++ b/evaluation_examples/examples/chrome/12086550-11c0-466b-b367-1d9e75b3910e.json
@ -42,5 +42,7 @@
      }
    }
  },
-  "proxy": false
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/121ba48f-9e17-48ce-9bc6-a4fb17a7ebba.json
+++ b/evaluation_examples/examples/chrome/121ba48f-9e17-48ce-9bc6-a4fb17a7ebba.json
@ -58,5 +58,7 @@
      }
    }
  },
-  "proxy": true
+  "proxy": true,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/1704f00f-79e6-43a7-961b-cedd3724d5fd.json
+++ b/evaluation_examples/examples/chrome/1704f00f-79e6-43a7-961b-cedd3724d5fd.json
@ -80,7 +80,6 @@
            "dropLocationName": "Zürich",
            "filterCriteria_carCategory": "large",
            "filterCriteria_sortBy": "PRICE"
-
          }
        }
      },
@ -98,12 +97,13 @@
            "puYear": "{Year}",
            "doDay": "{DayD}",
            "doMonth": "{MonthD}",
-            "doYear":"{Year}"
+            "doYear": "{Year}"
          }
        }
      }
    ]
  },
  "proxy": true,
-  "possibility_of_env_change": "medium"
+  "possibility_of_env_change": "medium",
+  "fixed_ip": false
 }
--- a/evaluation_examples/examples/chrome/2888b4e6-5b47-4b57-8bf5-c73827890774.json
+++ b/evaluation_examples/examples/chrome/2888b4e6-5b47-4b57-8bf5-c73827890774.json
@ -69,5 +69,6 @@
    }
  },
  "proxy": false,
-  "possibility_of_env_change": "medium"
+  "possibility_of_env_change": "medium",
+  "fixed_ip": false
 }
--- a/evaluation_examples/examples/chrome/2ad9387a-65d8-4e33-ad5b-7580065a27ca.json
+++ b/evaluation_examples/examples/chrome/2ad9387a-65d8-4e33-ad5b-7580065a27ca.json
@ -29,6 +29,32 @@
    "chrome"
  ],
  "evaluator": {
+    "postconfig": [
+      {
+        "type": "launch",
+        "parameters": {
+          "command": [
+            "pkill",
+            "chrome"
+          ]
+        }
+      },
+      {
+        "type": "launch",
+        "parameters": {
+          "command": [
+            "google-chrome",
+            "--remote-debugging-port=1337"
+          ]
+        }
+      },
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
    "func": "is_expected_bookmarks",
    "result": {
      "type": "bookmarks"
@ -43,5 +69,7 @@
      }
    }
  },
-  "proxy": false
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/2ae9ba84-3a0d-4d4c-8338-3a1478dc5fe3.json
+++ b/evaluation_examples/examples/chrome/2ae9ba84-3a0d-4d4c-8338-3a1478dc5fe3.json
@ -47,6 +47,12 @@
            "--remote-debugging-port=1337"
          ]
        }
+      },
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
      }
    ],
    "func": "exact_match",
@ -60,5 +66,7 @@
      }
    }
  },
-  "proxy": false
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/3299584d-8f11-4457-bf4c-ce98f7600250.json
+++ b/evaluation_examples/examples/chrome/3299584d-8f11-4457-bf4c-ce98f7600250.json
@ -7,7 +7,7 @@
    {
      "type": "execute",
      "parameters": {
-        "command": "echo password | sudo -S apt update -y && echo password | sudo -S apt install jq -y",
+        "command": "echo {CLIENT_PASSWORD} | sudo -S apt update -y && echo {CLIENT_PASSWORD} | sudo -S apt install jq -y",
        "shell": true
      }
    },
@ -54,5 +54,7 @@
      }
    }
  },
-  "proxy": false
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/35253b65-1c19-4304-8aa4-6884b8218fc0.json
+++ b/evaluation_examples/examples/chrome/35253b65-1c19-4304-8aa4-6884b8218fc0.json
@ -1,7 +1,7 @@
 {
  "id": "35253b65-1c19-4304-8aa4-6884b8218fc0",
  "snapshot": "chrome",
-  "instruction": "Hey, I need a quick way back to this site. Could you whip up a shortcut on my desktop for me?",
+  "instruction": "Hey, I need a quick way back to this site. Could you whip up a shortcut on my desktop for me using Chrome's built-in feature?",
  "source": "https://www.laptopmag.com/articles/how-to-create-desktop-shortcuts-for-web-pages-using-chrome",
  "config": [
    {
@ -49,5 +49,7 @@
      }
    }
  },
-  "proxy": true
+  "proxy": true,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/368d9ba4-203c-40c1-9fa3-da2f1430ce63.json
+++ b/evaluation_examples/examples/chrome/368d9ba4-203c-40c1-9fa3-da2f1430ce63.json
@ -83,5 +83,7 @@
      }
    ]
  },
-  "proxy": true
+  "proxy": true,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/3720f614-37fd-4d04-8a6b-76f54f8c222d.json
+++ b/evaluation_examples/examples/chrome/3720f614-37fd-4d04-8a6b-76f54f8c222d.json
@ -11,5 +11,7 @@
  "evaluator": {
    "func": "infeasible"
  },
-  "proxy": false
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/44ee5668-ecd5-4366-a6ce-c1c9b8d4e938.json
+++ b/evaluation_examples/examples/chrome/44ee5668-ecd5-4366-a6ce-c1c9b8d4e938.json
@ -254,6 +254,12 @@
            "--remote-debugging-port=1337"
          ]
        }
+      },
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
      }
    ],
    "func": "check_history_deleted",
@ -271,5 +277,7 @@
      }
    }
  },
-  "proxy": true
+  "proxy": true,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/47543840-672a-467d-80df-8f7c3b9788c9.json
+++ b/evaluation_examples/examples/chrome/47543840-672a-467d-80df-8f7c3b9788c9.json
@ -1,7 +1,7 @@
 {
  "id": "47543840-672a-467d-80df-8f7c3b9788c9",
  "snapshot": "chrome",
-  "instruction": "Show me the cars available for pickup at Boston Logan Intl Airport from the 10th to the 11th of next month, sorted by the number of seats to find the largest capacity.",
+  "instruction": "On the current website, show me the cars available for pickup at Boston Logan Intl Airport from the 10th to the 11th of next month, sorted by the number of seats to find the largest capacity.",
  "source": "test_task_1",
  "config": [
    {
@ -113,5 +113,7 @@
      }
    ]
  },
-  "proxy": true
+  "proxy": true,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/480bcfea-d68f-4aaa-a0a9-2589ef319381.json
+++ b/evaluation_examples/examples/chrome/480bcfea-d68f-4aaa-a0a9-2589ef319381.json
@ -47,6 +47,12 @@
            "--remote-debugging-port=1337"
          ]
        }
+      },
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
      }
    ],
    "func": "check_enabled_experiments",
@ -64,5 +70,7 @@
      }
    }
  },
-  "proxy": false
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/59155008-fe71-45ec-8a8f-dc35497b6aa8.json
+++ b/evaluation_examples/examples/chrome/59155008-fe71-45ec-8a8f-dc35497b6aa8.json
@ -56,5 +56,7 @@
      }
    }
  },
-  "proxy": true
+  "proxy": true,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/6766f2b8-8a72-417f-a9e5-56fcaa735837.json
+++ b/evaluation_examples/examples/chrome/6766f2b8-8a72-417f-a9e5-56fcaa735837.json
@ -18,7 +18,7 @@
    {
      "type": "execute",
      "parameters": {
-        "command": "echo password | sudo -S apt-get update -y && echo password | sudo -S apt-get install unzip -y && unzip /home/user/Desktop/helloExtension.zip -d /home/user/Desktop/ && rm /home/user/Desktop/helloExtension.zip",
+        "command": "echo {CLIENT_PASSWORD} | sudo -S apt-get update -y && echo {CLIENT_PASSWORD} | sudo -S apt-get install unzip -y && unzip /home/user/Desktop/helloExtension.zip -d /home/user/Desktop/ && rm /home/user/Desktop/helloExtension.zip",
        "shell": true
      }
    },
@ -58,5 +58,7 @@
      }
    }
  },
-  "proxy": false
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/6c4c23a1-42a4-43cc-9db1-2f86ff3738cc.json
+++ b/evaluation_examples/examples/chrome/6c4c23a1-42a4-43cc-9db1-2f86ff3738cc.json
@ -74,5 +74,7 @@
      }
    }
  },
-  "proxy": true
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/7a5a7856-f1b6-42a4-ade9-1ca81ca0f263.json
+++ b/evaluation_examples/examples/chrome/7a5a7856-f1b6-42a4-ade9-1ca81ca0f263.json
@ -38,6 +38,32 @@
    "chrome"
  ],
  "evaluator": {
+    "postconfig": [
+      {
+        "type": "launch",
+        "parameters": {
+          "command": [
+            "pkill",
+            "chrome"
+          ]
+        }
+      },
+      {
+        "type": "launch",
+        "parameters": {
+          "command": [
+            "google-chrome",
+            "--remote-debugging-port=1337"
+          ]
+        }
+      },
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
    "func": "is_expected_bookmarks",
    "result": {
      "type": "bookmarks"
@ -52,5 +78,7 @@
      }
    }
  },
-  "proxy": true
+  "proxy": true,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/7b6c7e24-c58a-49fc-a5bb-d57b80e5b4c3.json
+++ b/evaluation_examples/examples/chrome/7b6c7e24-c58a-49fc-a5bb-d57b80e5b4c3.json
@ -38,6 +38,32 @@
    "chrome"
  ],
  "evaluator": {
+    "postconfig": [
+      {
+        "type": "launch",
+        "parameters": {
+          "command": [
+            "pkill",
+            "chrome"
+          ]
+        }
+      },
+      {
+        "type": "launch",
+        "parameters": {
+          "command": [
+            "google-chrome",
+            "--remote-debugging-port=1337"
+          ]
+        }
+      },
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
    "func": "is_cookie_deleted",
    "result": {
      "type": "cookie_data",
@ -53,5 +79,7 @@
      }
    }
  },
-  "proxy": true
+  "proxy": true,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/7f52cab9-535c-4835-ac8c-391ee64dc930.json
+++ b/evaluation_examples/examples/chrome/7f52cab9-535c-4835-ac8c-391ee64dc930.json
@ -60,7 +60,7 @@
        "goto_prefix": "https://www.",
        "category": "class",
        "class_multiObject_search_exist": {
-          "fT28tf":[
+          "fT28tf": [
            "Black",
            "$25 - $60",
            "On sale",
@ -92,5 +92,7 @@
      }
    ]
  },
-  "proxy": true
+  "proxy": true,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/82279c77-8fc6-46f6-9622-3ba96f61b477.json
+++ b/evaluation_examples/examples/chrome/82279c77-8fc6-46f6-9622-3ba96f61b477.json
@ -66,5 +66,7 @@
      }
    }
  },
-  "proxy": true
+  "proxy": true,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/82bc8d6a-36eb-4d2d-8801-ef714fb1e55a.json
+++ b/evaluation_examples/examples/chrome/82bc8d6a-36eb-4d2d-8801-ef714fb1e55a.json
@ -70,5 +70,7 @@
      }
    }
  },
-  "proxy": true
+  "proxy": true,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/93eabf48-6a27-4cb6-b963-7d5fe1e0d3a9.json
+++ b/evaluation_examples/examples/chrome/93eabf48-6a27-4cb6-b963-7d5fe1e0d3a9.json
@ -11,5 +11,7 @@
  "evaluator": {
    "func": "infeasible"
  },
-  "proxy": false
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/9656a811-9b5b-4ddf-99c7-5117bcef0626.json
+++ b/evaluation_examples/examples/chrome/9656a811-9b5b-4ddf-99c7-5117bcef0626.json
@ -4,6 +4,20 @@
  "instruction": "I want Chrome to warn me whenever I visit a potentially harmful or unsafe website. Can you enable this safety feature?",
  "source": "https://www.quora.com/How-do-I-set-the-security-settings-for-the-Google-Chrome-browser-for-the-best-security#:~:text=Enable%20Safe%20Browsing:%20Chrome%20has%20a%20built%2Din,Security%20%3E%20Security%20%3E%20Enable%20Safe%20Browsing.",
  "config": [
+    {
+      "type": "execute",
+      "parameters": {
+        "command": "echo {CLIENT_PASSWORD} | sudo -S apt update -y && echo {CLIENT_PASSWORD} | sudo -S apt install jq -y",
+        "shell": true
+      }
+    },
+    {
+      "type": "execute",
+      "parameters": {
+        "command": "mkdir -p /home/user/.config/google-chrome/Default && if [ ! -f /home/user/.config/google-chrome/Default/Preferences ]; then echo '{}' > /home/user/.config/google-chrome/Default/Preferences; fi && cd /home/user/.config/google-chrome/Default && jq '. + {\"safebrowsing\":{\"enabled\":false,\"enhanced\":false}}' Preferences > temp && mv temp Preferences",
+        "shell": true
+      }
+    },
    {
      "type": "launch",
      "parameters": {
@ -31,16 +45,33 @@
  "evaluator": {
    "postconfig": [
      {
-        "type": "execute",
+        "type": "launch",
        "parameters": {
-          "command": "pkill chrome",
-          "shell": "true"
+          "command": [
+            "pkill",
+            "chrome"
+          ]
+        }
+      },
+      {
+        "type": "launch",
+        "parameters": {
+          "command": [
+            "google-chrome",
+            "--remote-debugging-port=1337"
+          ]
+        }
+      },
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
        }
      }
    ],
    "func": "exact_match",
    "result": {
-      "type": "enable_enhanced_safety_browsing"
+      "type": "enable_safe_browsing"
    },
    "expected": {
      "type": "rule",
@ -49,5 +80,7 @@
      }
    }
  },
-  "proxy": false
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/99146c54-4f37-4ab8-9327-5f3291665e1e.json
+++ b/evaluation_examples/examples/chrome/99146c54-4f37-4ab8-9327-5f3291665e1e.json
@ -31,10 +31,27 @@
  "evaluator": {
    "postconfig": [
      {
-        "type": "execute",
+        "type": "launch",
        "parameters": {
-          "command": "pkill chrome",
-          "shell": "true"
+          "command": [
+            "pkill",
+            "chrome"
+          ]
+        }
+      },
+      {
+        "type": "launch",
+        "parameters": {
+          "command": [
+            "google-chrome",
+            "--remote-debugging-port=1337"
+          ]
+        }
+      },
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
        }
      }
    ],
@ -49,5 +66,7 @@
      }
    }
  },
-  "proxy": false
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/9f3f70fc-5afc-4958-a7b7-3bb4fcb01805.json
+++ b/evaluation_examples/examples/chrome/9f3f70fc-5afc-4958-a7b7-3bb4fcb01805.json
@ -75,5 +75,7 @@
      }
    }
  },
-  "proxy": true
+  "proxy": true,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/9f935cce-0a9f-435f-8007-817732bfc0a5.json
+++ b/evaluation_examples/examples/chrome/9f935cce-0a9f-435f-8007-817732bfc0a5.json
@ -57,5 +57,7 @@
      }
    }
  },
-  "proxy": true
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/a728a36e-8bf1-4bb6-9a03-ef039a5233f0.json
+++ b/evaluation_examples/examples/chrome/a728a36e-8bf1-4bb6-9a03-ef039a5233f0.json
@ -43,7 +43,7 @@
    "chrome"
  ],
  "evaluator": {
-    "func": "is_expected_active_tab",
+    "func": "is_expected_url_pattern_match",
    "result": {
      "type": "active_url_from_accessTree",
      "goto_prefix": "https://www."
@ -51,10 +51,13 @@
    "expected": {
      "type": "rule",
      "rules": {
-        "type": "url",
-        "url": "https://www.dmv.virginia.gov/licenses-ids/license/applying/eligibility"
+        "expected": [
+          "^https://(www\\.)?dmv\\.virginia\\.gov/licenses-ids/license/applying/eligibility"
+        ]
      }
    }
  },
-  "proxy": true
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/a96b564e-dbe9-42c3-9ccf-b4498073438a.json
+++ b/evaluation_examples/examples/chrome/a96b564e-dbe9-42c3-9ccf-b4498073438a.json
@ -56,5 +56,7 @@
      }
    }
  },
-  "proxy": true
+  "proxy": true,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/ae78f875-5b98-4907-bbb5-9c737fc68c03.json
+++ b/evaluation_examples/examples/chrome/ae78f875-5b98-4907-bbb5-9c737fc68c03.json
@ -11,5 +11,7 @@
  "evaluator": {
    "func": "infeasible"
  },
-  "proxy": false
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/af630914-714e-4a24-a7bb-f9af687d3b91.json
+++ b/evaluation_examples/examples/chrome/af630914-714e-4a24-a7bb-f9af687d3b91.json
@ -47,6 +47,12 @@
            "--remote-debugging-port=1337"
          ]
        }
+      },
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
      }
    ],
    "func": "check_font_size",
@ -62,5 +68,7 @@
      }
    }
  },
-  "proxy": false
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/b070486d-e161-459b-aa2b-ef442d973b92.json
+++ b/evaluation_examples/examples/chrome/b070486d-e161-459b-aa2b-ef442d973b92.json
@ -44,40 +44,40 @@
  ],
  "evaluator": {
    "func": [
-      "exact_match",
-      "exact_match"
+      "is_expected_url_pattern_match",
+      "is_expected_url_pattern_match"
    ],
    "conj": "or",
    "result": [
      {
-        "type": "url_dashPart",
-        "goto_prefix": "https://www.",
-        "partIndex": -1,
-        "needDeleteId": false,
-        "returnType": "string"
+        "type": "active_url_from_accessTree",
+        "goto_prefix": "https://www."
      },
      {
-        "type": "url_dashPart",
-        "goto_prefix": "https://www.",
-        "partIndex": -1,
-        "needDeleteId": false,
-        "returnType": "string"
+        "type": "active_url_from_accessTree",
+        "goto_prefix": "https://www."
      }
    ],
    "expected": [
      {
        "type": "rule",
        "rules": {
-          "expected": "tamiflu.html#side-effects"
+          "expected": [
+            "^https://(www\\.)?drugs\\.com/tamiflu\\.html#side-effects"
+          ]
        }
      },
      {
        "type": "rule",
        "rules": {
-          "expected": "tamiflu-side-effects.html"
+          "expected": [
+            "^https://(www\\.)?drugs\\.com/sfx/tamiflu-side-effects\\.html"
+          ]
        }
      }
    ]
  },
-  "proxy": true
+  "proxy": true,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/b4f95342-463e-4179-8c3f-193cd7241fb2.json
+++ b/evaluation_examples/examples/chrome/b4f95342-463e-4179-8c3f-193cd7241fb2.json
@ -1,7 +1,7 @@
 {
  "id": "b4f95342-463e-4179-8c3f-193cd7241fb2",
  "snapshot": "chrome",
-  "instruction": "Find the next available date for Diamond.",
+  "instruction": "Find the Next Available dates for Diamond.",
  "source": "test_task_1",
  "config": [
    {
@ -63,5 +63,6 @@
    }
  },
  "proxy": false,
-  "possibility_of_env_change": "high"
+  "possibility_of_env_change": "high",
+  "fixed_ip": false
 }
--- a/evaluation_examples/examples/chrome/b7895e80-f4d1-4648-bee0-4eb45a6f1fa8.json
+++ b/evaluation_examples/examples/chrome/b7895e80-f4d1-4648-bee0-4eb45a6f1fa8.json
@ -66,10 +66,10 @@
        "goto_prefix": "https://www.",
        "category": "xpath",
        "xpathObject": {
-          "/html/body/div[1]/main/div[3]/div[5]/div[2]/div/div[1]/div/div/div/div[1]/div/button/div[3]": "from",
-          "/html/body/div[1]/main/div[3]/div[5]/div[2]/div/div[1]/div/div/div/div[2]/button/div[3]": "to",
-          "/html/body/div[1]/main/div[3]/div[2]/div/div[1]/div/h2": "city",
-          "/html/body/div[1]/main/div[3]/div[5]/div[2]/div/div[1]/div/div/div/div[3]/button/div[3]/span/span[2]": "adult",
+          "/html/body/div[1]/main/div[3]/div[5]/div[2]/div/div[1]/div/div/div/div[1]/button/span/div/div": "from",
+          "/html/body/div[1]/main/div[3]/div[5]/div[2]/div/div[1]/div/div/div/div[2]/button/span/div/div": "to",
+          "/html/body/div[1]/main/div[3]/div[2]/div/div/div/h2": "city",
+          "/html/body/div[1]/main/div[3]/div[5]/div[2]/div/div[1]/div/div/div/div[3]/button/span/div/div": "adult",
          "/html/body/div[1]/main/div[3]/div[5]/div[2]/div/div[3]/div/div[2]/div/div/div[2]/div/button/div/div": "rank"
        }
      }
@ -101,10 +101,10 @@
          },
          "timezone": "America/New_York",
          "expected": {
-            "from": "{DoW}, {Month} {Day0D}",
-            "to": "{DoW}, {Month} {Day0D}",
+            "from": "Check In{DoW}, {Month} {Day0D}",
+            "to": "Check Out{DoW}, {Month} {Day0D}",
            "city": "New York City Hotels",
-            "adult": "2 guests",
+            "adult": "Rooms/Guests1 Room, 2 Guests",
            "rank": "Price (low to high)"
          }
        }
@ -112,5 +112,6 @@
    ]
  },
  "proxy": true,
-  "possibility_of_env_change": "medium"
+  "possibility_of_env_change": "high",
+  "fixed_ip": false
 }
--- a/evaluation_examples/examples/chrome/bb5e4c0d-f964-439c-97b6-bdb9747de3f4.json
+++ b/evaluation_examples/examples/chrome/bb5e4c0d-f964-439c-97b6-bdb9747de3f4.json
@ -29,6 +29,32 @@
    "chrome"
  ],
  "evaluator": {
+    "postconfig": [
+      {
+        "type": "launch",
+        "parameters": {
+          "command": [
+            "pkill",
+            "chrome"
+          ]
+        }
+      },
+      {
+        "type": "launch",
+        "parameters": {
+          "command": [
+            "google-chrome",
+            "--remote-debugging-port=1337"
+          ]
+        }
+      },
+      {
+        "type": "sleep",
+        "parameters": {
+          "seconds": 3
+        }
+      }
+    ],
    "func": "match_in_list",
    "result": {
      "type": "default_search_engine"
@ -43,5 +69,7 @@
      }
    }
  },
-  "proxy": false
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/c1fa57f3-c3db-4596-8f09-020701085416.json
+++ b/evaluation_examples/examples/chrome/c1fa57f3-c3db-4596-8f09-020701085416.json
@ -52,11 +52,12 @@
      "type": "rule",
      "rules": {
        "expected": [
-          "united.com/en/us/checked-bag-fee-calculator"
+          "united\\.com/en/us/checked-bag-fee-calculator(/.*)?"
        ]
      }
    }
  },
  "proxy": true,
-  "possibility_of_env_change": "medium"
+  "possibility_of_env_change": "medium",
+  "fixed_ip": false
 }
--- a/evaluation_examples/examples/chrome/cabb3bae-cccb-41bd-9f5d-0f3a9fecd825.json
+++ b/evaluation_examples/examples/chrome/cabb3bae-cccb-41bd-9f5d-0f3a9fecd825.json
@ -79,5 +79,7 @@
      }
    }
  },
-  "proxy": true
+  "proxy": true,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/da46d875-6b82-4681-9284-653b0c7ae241.json
+++ b/evaluation_examples/examples/chrome/da46d875-6b82-4681-9284-653b0c7ae241.json
@ -104,5 +104,7 @@
      }
    ]
  },
-  "proxy": true
+  "proxy": true,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/e1e75309-3ddb-4d09-92ec-de869c928143.json
+++ b/evaluation_examples/examples/chrome/e1e75309-3ddb-4d09-92ec-de869c928143.json
@ -49,5 +49,7 @@
      "dest": "LLM Powered Autonomous Agents _ Lil'Log_gold.pdf"
    }
  },
-  "proxy": true
+  "proxy": true,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/f0b971a1-6831-4b9b-a50e-22a6e47f45ba.json
+++ b/evaluation_examples/examples/chrome/f0b971a1-6831-4b9b-a50e-22a6e47f45ba.json
@ -56,5 +56,7 @@
      }
    }
  },
-  "proxy": true
+  "proxy": true,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/f3b19d1e-2d48-44e9-b4e1-defcae1a0197.json
+++ b/evaluation_examples/examples/chrome/f3b19d1e-2d48-44e9-b4e1-defcae1a0197.json
@ -56,5 +56,7 @@
      }
    }
  },
-  "proxy": false
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/f5d96daf-83a8-4c86-9686-bada31fc66ab.json
+++ b/evaluation_examples/examples/chrome/f5d96daf-83a8-4c86-9686-bada31fc66ab.json
@ -67,5 +67,6 @@
    }
  },
  "proxy": true,
-  "possibility_of_env_change": "medium"
+  "possibility_of_env_change": "medium",
+  "fixed_ip": false
 }
--- a/evaluation_examples/examples/chrome/f79439ad-3ee8-4f99-a518-0eb60e5652b0.json
+++ b/evaluation_examples/examples/chrome/f79439ad-3ee8-4f99-a518-0eb60e5652b0.json
@ -78,5 +78,7 @@
      }
    }
  },
-  "proxy": true
+  "proxy": true,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/evaluation_examples/examples/chrome/fc6d8143-9452-4171-9459-7f515143419a.json
+++ b/evaluation_examples/examples/chrome/fc6d8143-9452-4171-9459-7f515143419a.json
@ -1,7 +1,7 @@
 {
  "id": "fc6d8143-9452-4171-9459-7f515143419a",
  "snapshot": "chrome",
-  "instruction": "Find the status of tomorrow flights from New York airports to Columbus in Ohio.",
+  "instruction": "Find flights from New York–Kennedy Airport to Chicago O'Hare Airport for tomorrow.",
  "source": "test_task_0",
  "config": [
    {
@ -65,12 +65,14 @@
          "from": "tomorrow"
        },
        "expected": {
-          "start": "NYC",
-          "end": "CMH",
+          "start": "JFK",
+          "end": "ORD",
          "time": "{DoW}, {Month} {Day0D}, {Year}"
        }
      }
    }
  },
-  "proxy": true
+  "proxy": false,
+  "fixed_ip": false,
+  "possibility_of_env_change": "low"
 }
--- a/Show More
+++ b/Show More