metadata
license: llama2
pipeline_tag: image-text-to-text
UGround
UGround is a storng GUI visual grounding model trained with a simple recipe. Check our homepage and paper for more details.
Homepage: https://osu-nlp-group.github.io/UGround/
Repository: https://github.com/OSU-NLP-Group/UGround
Point of Contact: Boyu Gou
Model Weights
Code
- Inference Code of UGround
- Offline Experiments
- Screenspot (along with referring expressions generated by GPT-4/4o)
- Multimodal-Mind2Web
- OmniAct
- Online Experiments
- Mind2Web-Live
- AndroidWorld
Data
- Data Examples
- Data Construction Scripts
- Guidance of Open-source Data
Online Demo (HF Spaces)
Citation Information
If you find this work useful, please consider citing our papers:
@article{gou2024uground,
title={Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents},
author={Boyu Gou and Ruohan Wang and Boyuan Zheng and Yanan Xie and Cheng Chang and Yiheng Shu and Huan Sun and Yu Su},
journal={arXiv preprint arXiv:2410.05243},
year={2024},
url={https://arxiv.org/abs/2410.05243},
}
@article{zheng2023seeact,
title={GPT-4V(ision) is a Generalist Web Agent, if Grounded},
author={Boyuan Zheng and Boyu Gou and Jihyung Kil and Huan Sun and Yu Su},
journal={arXiv preprint arXiv:2401.01614},
year={2024},
}