AskUI
/

PTA-1

Image-Text-to-Text

text-generation

Model card Files Files and versions Community

maxiw commited on 9 days ago

Commit

8eb6285

•

1 Parent(s): dcbe50f

Update README.md

Files changed (1) hide show

README.md +7 -7

README.md CHANGED Viewed

@@ -12,9 +12,9 @@ base_model:
 # PTA-1: Controlling Computers with Small Models
-PTA (Prompt-to-Automation) is a vision language model for computer use applications based on Florence-2.
-With less than 300M parameters it beats larger models in GUI text and element localization.
-This allows low latency computer automations with local execution.
 **Model Input:** Screenshot + description_of_target_element
@@ -62,8 +62,8 @@ print(parsed_answer)
 ## Evaluation
-**Note:** This is a first version of our evaluation with 999 samples (333 samples from each dataset).
-We are still running all models on the full test sets. We are seeing +-5% deviations for a subset of the models we have already evaluated.
 | Model                                      | Parameters | Mean   | agentsea/wave-ui | AskUI/pta-text | ivelin/rico_refexp_combined |
 |--------------------------------------------|------------|--------|------------------|----------------|-----------------------------|
@@ -83,10 +83,10 @@ We are still running all models on the full test sets. We are seeing +-5% deviat
 \* Models is known to be trained on the train split of that dataset.
 The high benchmark scores for our model are partially due to data bias.
-Therefore we expect users of the model to fine-tune it according to the data distributions of their use case.
 #### Metrics
-Click success rate is calculated as the number of clicks inside the target bounding box.
 If a model predicts a target bounding box instead of a click coordinate, its center is used as its click prediction.

 # PTA-1: Controlling Computers with Small Models
+PTA (Prompt-to-Automation) is a vision language model for computer & phone automation, based on Florence-2.
+With only 270M parameters it outperforms much larger models in GUI text and element localization.
+This enables low-latency computer automation with local execution.
 **Model Input:** Screenshot + description_of_target_element
 ## Evaluation
+**Note:** This is a first version of our evaluation, based on 999 samples (333 samples from each dataset).
+We are still running all models on the full test sets, and we are seeing ±5% deviations for a subset of the models we have already evaluated.
 | Model                                      | Parameters | Mean   | agentsea/wave-ui | AskUI/pta-text | ivelin/rico_refexp_combined |
 |--------------------------------------------|------------|--------|------------------|----------------|-----------------------------|
 \* Models is known to be trained on the train split of that dataset.
 The high benchmark scores for our model are partially due to data bias.
+Therefore, we expect users of the model to fine-tune it according to the data distributions of their use case.
 #### Metrics
+Click success rate is calculated as the number of clicks inside the target bounding box relative to all clicks.
 If a model predicts a target bounding box instead of a click coordinate, its center is used as its click prediction.