Why the same input will be affected by batch size?
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained("HuggingFaceM4/siglip-so400m-14-384-flash-attn2", trust_remote_code=True)
model.eval().cuda().half()
pixel_values = torch.randn(1, 3, 384, 384).cuda().half()
with torch.inference_mode():
x = model.vision_model(pixel_values)
y = model.vision_model(torch.vstack([pixel_values, pixel_values]))
print(torch.sum(x.last_hidden_state - y.last_hidden_state[:1]))
the output is
tensor(-5.1641, device='cuda:0', dtype=torch.float16)
The same phenomenon was observed in HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit
Hi
@yuzaa
,
I am unfortunately unable to reproduce your discrepancy, neither with HuggingFaceM4/siglip-so400m-14-384-flash-attn2
, nor with HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit
.
I simply copy-pasted your snippet but the output is tensor(0., device='cuda:0', dtype=torch.float16)
for me.
Can you say more about your setup?
@VictorSanh Thanks for reply, Can you provide your version of pytorch and flash-attn? I tested the above code in
torch==1.13.1+cu117
transformers==4.37
flash-attn==2.3.3
Also, I found out that when I was using the HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit
and input pixel_values = torch.randn(1, 3, 980, 980).cuda().half()
, I can get the output tensor(0., device='cuda:0', dtype=torch.float16)
my setup:
torch==2.0.1+cu118
transformers==4.37.1
flash-attn==2.3.6
Could you try to upgrade the env and test again?
trying to narrow down the problem first
env
torch==2.2.1+cu118
transformers==4.39.0
flash-attn==2.5.6
output
tensor(7.1602, device='cuda:0', dtype=torch.float16)
thanks, on my to do, will try to reproduce with your configs!
980-> ok but 384 -> same bug. My env setup:
torch==2.1.1+cu118
transformers==4.37.0
I don't use flash-attn. I also find that the output of Visual Embedding are different.