SeanScripts commited on
Commit
55ae3f5
1 Parent(s): 320db5c

Add inference example

Browse files
Files changed (1) hide show
  1. README.md +61 -0
README.md CHANGED
@@ -14,3 +14,64 @@ This model just *barely* fits in 48 GB (tested on 2 x 3090, and gets about 6 tok
14
 
15
  For 2 cards with 24 GB VRAM, this requires a very specific device map to work. For single cards with 48 GB VRAM, I imagine it works much more smoothly.
16
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  For 2 cards with 24 GB VRAM, this requires a very specific device map to work. For single cards with 48 GB VRAM, I imagine it works much more smoothly.
16
 
17
+ Example usage for image captioning with 2 x 24 GB VRAM GPUs:
18
+ ```python
19
+ import torch
20
+ from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig, StopStringCriteria
21
+ from PIL import Image
22
+ import time
23
+
24
+ # For 2 x 24 GB. If using 1 x 48 GB or more (lucky you), you can just use device_map="auto"
25
+ device_map = {
26
+ "model.vision_backbone": "cpu", # Seems to be required to not run out of memory at 48 GB
27
+ "model.transformer.wte": 0,
28
+ "model.transformer.ln_f": 0,
29
+ "model.transformer.ff_out": 1,
30
+ }
31
+ # For 2 x 24 GB, this works for *only* 38 or 39. Any higher or lower and it'll either only work for <= 1 token of output.
32
+ switch_point = 38 # layer index to switch to second GPU
33
+ device_map |= {f"model.transformer.blocks.{i}": 0 for i in range(0, switch_point)}
34
+ device_map |= {f"model.transformer.blocks.{i}": 1 for i in range(switch_point, 80)}
35
+
36
+ model_name = "SeanScripts/Molmo-72B-0924-nf4"
37
+ model = AutoModelForCausalLM.from_pretrained(
38
+ model_name,
39
+ use_safetensors=True,
40
+ device_map=device_map,
41
+ trust_remote_code=True, # Required for Molmo at the moment.
42
+ )
43
+ model.model.vision_backbone.float() # vision backbone needs to be in FP32 for this
44
+
45
+ processor = AutoProcessor.from_pretrained(
46
+ model_name,
47
+ trust_remote_code=True, # Required for Molmo at the moment.
48
+ )
49
+
50
+ torch.cuda.empty_cache()
51
+
52
+ image = Image.open("test.png")
53
+ inputs = processor.process(images=img, text="Caption this image.")
54
+ inputs = {k: v.to("cuda:0").unsqueeze(0) for k,v in inputs.items()}
55
+ prompt_tokens = inputs["input_ids"].size(1)
56
+ print("Prompt tokens:", prompt_tokens)
57
+
58
+ t0 = time.time()
59
+ output = model.generate_from_batch(
60
+ inputs,
61
+ generation_config=GenerationConfig(
62
+ max_new_tokens=256,
63
+ ),
64
+ stopping_criteria=[StopStringCriteria(tokenizer=processor.tokenizer, stop_strings=["<|endoftext|>"])],
65
+ tokenizer=processor.tokenizer,
66
+ )
67
+ t1 = time.time()
68
+ total_time = t1 - t0
69
+ generated_tokens = output.size(1) - prompt_tokens
70
+ time_per_token = generated_tokens/total_time
71
+ print(f"Generated {generated_tokens} tokens in {total_time:.3f} s ({time_per_token:.3f} tok/s)")
72
+
73
+ response = processor.tokenizer.decode(output[0, prompt_tokens:])
74
+ print(response)
75
+
76
+ torch.cuda.empty_cache()
77
+ ```