RTX 3090 24GB working with extra env var

by cktlco - opened 8 days ago

8 days ago

•

Thanks for the great work!

FYI that I was able to get the README demo script to run on Linux with a RTX 3090 24GB only after using the following env var to avoid a CUDA OOM:

 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python readme_demo.py

Violet99

8 days ago

Awesome

pbarker

7 days ago

Its running with this change, but its extremely slow for me on an L4

cktlco

7 days ago

Agreed, it's too slow to be usable for anything other than adhoc tests.

As an alternative for enthusiasts, here is a 4-bit quantized version of molmo-7B which fits in ~12GB VRAM and is much more responsive:
https://huggingface.co/cyan2k/molmo-7B-O-bnb-4bit

Muennighoff

Ai2 org 6 days ago

Nice! Yeah the transformers integration is extremely slow unfortunately as it for loops through the experts; We need to integrate it into vLLM/SGLang/llama.cpp like the other OLMoE models.

Jul26

4 days ago

•

edited 4 days ago

What hardware / variable settings is needed to run this model properly? I was able to run it with only CPU 64 GRAM i9 processor but inference is 165 sec per image. And I am getting all kind of errors if use CUDA GPU. This is my settings:
CUDA version: 12.4
CUDA device count: 1
Current CUDA device: 0
Current CUDA device name: NVIDIA GeForce RTX 4090 Laptop GPU

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'GPTNeoXTokenizer'.
The class this function is called from is 'Qwen2TokenizerFast'.
The model weights are not tied. Please use the tie_weights method before using the infer_auto_device function.
Some parameters are on the meta device device because they were offloaded to the cpu.
Running on local URL: http://127.

To create a public link, set share=True in launch().
C:\Users\15023.venv\Lib\site-packages\transformers\generation\utils.py:1885: UserWarning: You are calling .generate() with the input_ids being on a device type different than your model's device. input_ids is on cpu, whereas the model is on cuda. You may experience unexpected behaviors or slower generation. Please make sure that you have put input_ids to the correct device by calling for example input_ids = input_ids.to('cuda') before running .generate().
warnings.warn(
C:\Users\15023.cache\huggingface\modules\transformers_modules\allenai\MolmoE-1B-0924\d33e4c2b8f093f5262875cad2c77fbf52e0c86ed\modeling_molmoe.py:1052: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:566.)
attn_output = F.scaled_dot_product_attention(

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

File "c:\Users\15023\Documents\Models\molmo_test.py", line 71, in describe_image_async
print(f"Input device: {inputs['pixel_values'].device}")
~~~~~~^^^^^^^^^^^^^^^^
KeyError: 'pixel_values'

Used torch.float16 (half-precision) when loading the model to reduce memory usage.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment