Maxi PRO

maxiw

AI & ML interests

Computer Agents | VLMs

Recent Activity

Organizations

maxiw's activity

Reacted to merve's post with ๐Ÿ‘€ 1 day ago
view post
Post
2063
Small yet mighty! ๐Ÿ’ซ

We are releasing SmolVLM: a new 2B small vision language made for on-device use, fine-tunable on consumer GPU, immensely memory efficient ๐Ÿค 

We release three checkpoints under Apache 2.0: SmolVLM-Instruct, SmolVLM-Synthetic and SmolVLM-Base HuggingFaceTB/smolvlm-6740bd584b2dcbf51ecb1f39

Learn more from our blog here: huggingface.co/blog/smolvlm
This release comes with a demo, fine-tuning code, MLX integration and TRL integration for DPO ๐Ÿ’
Try the demo: HuggingFaceTB/SmolVLM
Fine-tuning Recipe: https://github.com/huggingface/smollm/blob/main/finetuning/Smol_VLM_FT.ipynb
Also TRL integration for DPO ๐Ÿ’—
Reacted to luigi12345's post with ๐Ÿ‘€ 1 day ago
view post
Post
3521
MinimalScrap
Only Free Dependencies. Save it.It is quite useful uh.


!pip install googlesearch-python requests
from googlesearch import search
import requests
query = "Glaucoma"
for url in search(f"{query} site:nih.gov filetype:pdf", 20):
    if url.endswith(".pdf"):
        with open(url.split("/")[-1], "wb") as f: f.write(requests.get(url).content)
        print("โœ…" + url.split("/")[-1])
print("Done!")

posted an update 1 day ago
view post
Post
1602
You can now try out computer use models from the hub to automate your local machine with https://github.com/askui/vision-agent. ๐Ÿ’ป

import time
from askui import VisionAgent

with VisionAgent() as agent:
    agent.tools.webbrowser.open_new("http://www.google.com")
    time.sleep(0.5)
    agent.click("search field in the center of the screen", model_name="Qwen/Qwen2-VL-7B-Instruct")
    agent.type("cats")
    agent.keyboard("enter")
    time.sleep(0.5)
    agent.click("text 'Images'", model_name="AskUI/PTA-1")
    time.sleep(0.5)
    agent.click("second cat image", model_name="OS-Copilot/OS-Atlas-Base-7B")


Currently these models are integrated with Gradio Spaces API. Also planning to add local inference soon!

Currently supported:
- Qwen/Qwen2-VL-7B-Instruct
- Qwen/Qwen2-VL-2B-Instruct
- AskUI/PTA-1
- OS-Copilot/OS-Atlas-Base-7B
  • 2 replies
ยท
Reacted to prithivMLmods's post with ๐Ÿ”ฅ 2 days ago
view post
Post
2465
HF Posts Receipts ๐Ÿ†๐Ÿš€

[ HF POSTS RECEIPT ] : prithivMLmods/HF-POSTS-RECEIPT

๐Ÿฅ The one thing that needs to be remembered is the 'username'.

๐Ÿฅ And yeah, thank you, @maxiw , for creating the awesome dataset and sharing them here! ๐Ÿ™Œ

๐Ÿฅ [ Dataset ] : maxiw/hf-posts

.
.
.
@prithivMLmods
posted an update 8 days ago
view post
Post
1058
๐Ÿค– Controlling Computers with Small Models ๐Ÿค–

We just released PTA-1, a fine-tuned Florence-2 for localization of GUI text and elements. It runs with ~150ms inference time on a RTX 4080. This means you can now start building fast on-device computer use agents!

Model: AskUI/PTA-1
Demo: AskUI/PTA-1
  • 1 reply
ยท
Reacted to rwightman's post with ๐Ÿš€ 9 days ago
view post
Post
1563
New MobileNetV4 weights were uploaded a few days ago -- more ImageNet-12k training at 384x384 for the speedy 'Conv Medium' models.

There are 3 weight variants here for those who like to tinker. On my hold-out eval they are ordered as below, not that different, but the Adopt 180 epochs closer to AdamW 250 than to AdamW 180.
* AdamW for 250 epochs - timm/mobilenetv4_conv_medium.e250_r384_in12k
* Adopt for 180 epochs - timm/mobilenetv4_conv_medium.e180_ad_r384_in12k
* AdamW for 180 epochs - timm/mobilenetv4_conv_medium.e180_r384_in12k

This was by request as a user reported impressive results using the 'Conv Large' ImagNet-12k pretrains as object detection backbones. ImageNet-1k fine-tunes are pending, the weights do behave differently with the 180 vs 250 epochs and the Adopt vs AdamW optimizer.

Reacted to m-ric's post with ๐Ÿ‘€ 9 days ago
view post
Post
1355
Great feature alert: ๐—ฌ๐—ผ๐˜‚ ๐—ฐ๐—ฎ๐—ป ๐—ป๐—ผ๐˜„ ๐˜‚๐˜€๐—ฒ ๐—ฎ๐—ป๐˜† ๐—ฆ๐—ฝ๐—ฎ๐—ฐ๐—ฒ ๐—ฎ๐˜€ ๐—ฎ ๐˜๐—ผ๐—ผ๐—น ๐—ณ๐—ผ๐—ฟ ๐˜†๐—ผ๐˜‚๐—ฟ ๐˜๐—ฟ๐—ฎ๐—ป๐˜€๐—ณ๐—ผ๐—ฟ๐—บ๐—ฒ๐—ฟ๐˜€.๐—ฎ๐—ด๐—ฒ๐—ป๐˜! ๐Ÿ› ๏ธ๐Ÿ”ฅ๐Ÿ”ฅ

This lets you take the coolest spaces, like FLUX.1-dev, and use them in agentic workflows with a few lines of code! ๐Ÿง‘โ€๐Ÿ’ป

On the video below, I set up my fake vacation pictures where I'm awesome at surfing (I'm really not) ๐Ÿ„

Head to the doc to learn this magic ๐Ÿ‘‰ https://huggingface.co/docs/transformers/main/en/agents_advanced#import-a-space-as-a-tool-
Reacted to merve's post with ๐Ÿ‘€ 11 days ago
view post
Post
4783
OmniVision-968M: a new local VLM for edge devices, fast & small but performant
๐Ÿ’จ a new vision language model with 9x less image tokens, super efficient
๐Ÿ“– aligned with DPO for reducing hallucinations
โšก๏ธ Apache 2.0 license ๐Ÿ”ฅ

Demo hf.co/spaces/NexaAIDev/omnivlm-dpo-demo
Model NexaAIDev/omnivision-968M
  • 4 replies
ยท
Reacted to m-ric's post with โค๏ธ 13 days ago
view post
Post
3685
๐—ง๐—ต๐—ฒ ๐—ป๐—ฒ๐˜…๐˜ ๐—ฏ๐—ถ๐—ด ๐˜€๐—ผ๐—ฐ๐—ถ๐—ฎ๐—น ๐—ป๐—ฒ๐˜๐˜„๐—ผ๐—ฟ๐—ธ ๐—ถ๐˜€ ๐—ป๐—ผ๐˜ ๐Ÿฆ‹, ๐—ถ๐˜'๐˜€ ๐—›๐˜‚๐—ฏ ๐—ฃ๐—ผ๐˜€๐˜๐˜€! [INSERT STONKS MEME WITH LASER EYES]

See below: I got 105k impressions since regularly posting Hub Posts, coming close to my 275k on Twitter!

โš™๏ธ Computed with the great dataset maxiw/hf-posts
โš™๏ธ Thanks to Qwen2.5-Coder-32B for showing me how to access dict attributes in a SQL request!

cc @merve who's far in front of me
ยท
Reacted to merve's post with ๐Ÿ˜Ž 13 days ago
view post
Post
1646
Microsoft released LLM2CLIP: a CLIP model with longer context window for complex text inputs ๐Ÿคฏ
All models with Apache 2.0 license here microsoft/llm2clip-672323a266173cfa40b32d4c

TLDR; they replaced CLIP's text encoder with various LLMs fine-tuned on captioning, better top-k accuracy on retrieval.
This will enable better image-text retrieval, better zero-shot image classification, better vision language models ๐Ÿ”ฅ
Read the paper to learn more: LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation (2411.04997)
replied to their post 14 days ago
view reply

I didn't know about the hub-stats dataset. Really cool, thx for sharing! ๐Ÿค—

posted an update 14 days ago
view post
Post
4576
I was curious to see what people post here on HF so I created a dataset with all HF Posts: maxiw/hf-posts

Some interesting stats:

Top 5 Authors by Total Impressions:
-----------------------------------
@merve : 171,783 impressions (68 posts)
@fdaudens : 135,253 impressions (81 posts)
@singhsidhukuldeep : 122,591 impressions (81 posts)
@akhaliq : 119,526 impressions (78 posts)
@MonsterMMORPG : 112,500 impressions (45 posts)

Top 5 Users by Number of Reactions Given:
----------------------------------------
@osanseviero : 1278 reactions
@clem : 910 reactions
@John6666 : 899 reactions
@victor : 674 reactions
@samusenps : 655 reactions

Top 5 Most Used Reactions:
-------------------------
โค๏ธ: 7048 times
๐Ÿ”ฅ: 5921 times
๐Ÿ‘: 4856 times
๐Ÿš€: 2549 times
๐Ÿค—: 2065 times
ยท
posted an update 19 days ago
view post
Post
1718
Exciting to see open-source models thriving in the computer agent space! ๐Ÿ”ฅ
I just built a demo for OS-ATLAS: A Foundation Action Model For Generalist GUI Agents โ€” check it out here: maxiw/OS-ATLAS

This demo predicts bounding boxes based on screenshot + instructions as input.
Reacted to cbensimon's post with โค๏ธ 3 months ago
view post
Post
4274
Hello everybody,

We've rolled out a major update to ZeroGPU! All the Spaces are now running on it.

Major improvements:

1. GPU cold starts about twice as fast!
2. RAM usage reduced by two-thirds, allowing more effective resource usage, meaning more GPUs for the community!
3. ZeroGPU initializations (coldstarts) can now be tracked and displayed (use progress=gr.Progress(track_tqdm=True))
4. Improved compatibility and PyTorch integration, increasing ZeroGPU compatible spaces without requiring any modifications!

Feel free to answer in the post if you have any questions

๐Ÿค— Best regards,
Charles
replied to m-ric's post 3 months ago
replied to their post 3 months ago
view reply

@fridayfairy this is not fine-tuned. It's the base model just prompted to return bounding boxes in a specific format. The Qwen2-VL models must have been pre-trained on detection data.

Reacted to rwightman's post with ๐Ÿ‘ 3 months ago
view post
Post
1280
The timm leaderboard timm/leaderboard has been updated with the ability to select different hardware benchmark sets: RTX4090, RTX3090, two different CPUs along with some NCHW / NHWC layout and torch.compile (dynamo) variations.

Also worth pointing out, there are three rather newish 'test' models that you'll see at the top of any samples/sec comparison:
* test_vit ( timm/test_vit.r160_in1k)
* test_efficientnet ( timm/test_efficientnet.r160_in1k)
* test_byobnet ( timm/test_byobnet.r160_in1k, a mix of resnet, darknet, effnet/regnet like blocks)

They are < 0.5M params, insanely fast and originally intended for unit testing w/ real weights. They have awful ImageNet top-1, it's rare to have anyone bother to train a model this small on ImageNet (the classifier is roughly 30-70% of the param count!). However, they are FAST on very limited hadware and you can fine-tune them well on small data. Could be the model you're looking for?
replied to their post 3 months ago
posted an update 3 months ago
view post
Post
2382
The new Qwen-2 VL models seem to perform quite well in object detection. You can prompt them to respond with bounding boxes in a reference frame of 1k x 1k pixels and scale those boxes to the original image size.

You can try it out with my space maxiw/Qwen2-VL-Detection

ยท
posted an update 3 months ago
view post
Post
2271
Just added the newly released xGen-MM v1.5 foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research to my xGen-MM HF Space maxiw/XGen-MM
  • 2 replies
ยท