QiushiSun commited on
Commit
49f8c80
·
verified ·
1 Parent(s): 2c2ea5a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +107 -3
README.md CHANGED
@@ -1,3 +1,107 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ base_model: Qwen/Qwen2-VL-7B-Instruct
5
+ pipeline_tag: image-text-to-text
6
+ ---
7
+
8
+ # OS-Atlas: A Foundation Action Model For Generalist GUI Agents
9
+
10
+ <div align="center">
11
+
12
+ [\[🏠Homepage\]](https://osatlas.github.io) [\[💻Code\]](https://github.com/OS-Copilot/OS-Atlas) [\[🚀Quick Start\]](#quick-start) [\[📝Paper\]](https://arxiv.org/abs/2410.23218) [\[🤗Models\]](https://huggingface.co/collections/OS-Copilot/os-atlas-67246e44003a1dfcc5d0d045)[\[🤗Data\]](https://huggingface.co/datasets/OS-Copilot/OS-Atlas-data) [\[🤗ScreenSpot-v2\]](https://huggingface.co/datasets/OS-Copilot/ScreenSpot-v2)
13
+
14
+ </div>
15
+
16
+ ## Overview
17
+ ![os-atlas](https://github.com/user-attachments/assets/cf2ee020-5e15-4087-9a7e-75cc43662494)
18
+
19
+ OS-Atlas provides a series of models specifically designed for GUI agents.
20
+
21
+ For GUI grounding tasks, you can use:
22
+ - [OS-Atlas-Base-7B](https://huggingface.co/OS-Copilot/OS-Atlas-Base-7B)
23
+ - [OS-Atlas-Base-4B](https://huggingface.co/OS-Copilot/OS-Atlas-Base-4B)
24
+
25
+ For generating single-step actions in GUI agent tasks, you can use:
26
+ - [OS-Atlas-Pro-7B](https://huggingface.co/OS-Copilot/OS-Atlas-Pro-7B)
27
+ - [OS-Atlas-Pro-4B](https://huggingface.co/OS-Copilot/OS-Atlas-Pro-4B)
28
+
29
+
30
+ ## Quick Start
31
+ OS-Atlas-Base-7B is a GUI grounding model finetuned from [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct).
32
+
33
+ **Notes:** Our models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (either a center point or a bounding box defined by top-left and bottom-right coordinates). For visualization, please remember to convert these relative coordinates back to the original image dimensions.
34
+
35
+ ### Inference Example
36
+ First, ensure that the necessary dependencies are installed:
37
+ ```
38
+ pip install transformers
39
+ pip install qwen-vl-utils
40
+ ```
41
+ Then download the [example image](https://github.com/OS-Copilot/OS-Atlas/blob/main/examples/images/web_6f93090a-81f6-489e-bb35-1a2838b18c01.png) and save it to the current directory.
42
+
43
+ Inference code example:
44
+ ```python
45
+ from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
46
+ from qwen_vl_utils import process_vision_info
47
+
48
+ # Default: Load the model on the available device(s)
49
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
50
+ "OS-Copilot/OS-Atlas-Base-7B", torch_dtype="auto", device_map="auto"
51
+ )
52
+ processor = AutoProcessor.from_pretrained("OS-Copilot/OS-Atlas-Base-7B")
53
+
54
+ messages = [
55
+ {
56
+ "role": "user",
57
+ "content": [
58
+ {
59
+ "type": "image",
60
+ "image": "./web_6f93090a-81f6-489e-bb35-1a2838b18c01.png",
61
+ },
62
+ {"type": "text", "text": "In this UI screenshot, what is the position of the element corresponding to the command \"switch language of current page\" (with bbox)?"},
63
+ ],
64
+ }
65
+ ]
66
+
67
+
68
+ # Preparation for inference
69
+ text = processor.apply_chat_template(
70
+ messages, tokenize=False, add_generation_prompt=True
71
+ )
72
+ image_inputs, video_inputs = process_vision_info(messages)
73
+ inputs = processor(
74
+ text=[text],
75
+ images=image_inputs,
76
+ videos=video_inputs,
77
+ padding=True,
78
+ return_tensors="pt",
79
+ )
80
+ inputs = inputs.to("cuda")
81
+
82
+ # Inference: Generation of the output
83
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
84
+
85
+ generated_ids_trimmed = [
86
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
87
+ ]
88
+
89
+ output_text = processor.batch_decode(
90
+ generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
91
+ )
92
+ print(output_text)
93
+ # <|object_ref_start|>language switch<|object_ref_end|><|box_start|>(576,12),(592,42)<|box_end|><|im_end|>
94
+ ```
95
+
96
+
97
+
98
+ ## Citation
99
+ If you find this repository helpful, feel free to cite our paper:
100
+ ```bibtex
101
+ @article{wu2024atlas,
102
+ title={OS-ATLAS: A Foundation Action Model for Generalist GUI Agents},
103
+ author={Wu, Zhiyong and Wu, Zhenyu and Xu, Fangzhi and Wang, Yian and Sun, Qiushi and Jia, Chengyou and Cheng, Kanzhi and Ding, Zichen and Chen, Liheng and Liang, Paul Pu and others},
104
+ journal={arXiv preprint arXiv:2410.23218},
105
+ year={2024}
106
+ }
107
+ ```