Why the same input will be affected by batch size?

#2
by yuzaa - opened
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained("HuggingFaceM4/siglip-so400m-14-384-flash-attn2", trust_remote_code=True)
model.eval().cuda().half()

pixel_values = torch.randn(1, 3, 384, 384).cuda().half()

with torch.inference_mode():
    x = model.vision_model(pixel_values)
    y = model.vision_model(torch.vstack([pixel_values, pixel_values]))

print(torch.sum(x.last_hidden_state - y.last_hidden_state[:1]))

the output is

tensor(-5.1641, device='cuda:0', dtype=torch.float16)

The same phenomenon was observed in HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit

Hi @yuzaa ,
I am unfortunately unable to reproduce your discrepancy, neither with HuggingFaceM4/siglip-so400m-14-384-flash-attn2, nor with HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit.
I simply copy-pasted your snippet but the output is tensor(0., device='cuda:0', dtype=torch.float16) for me.
Can you say more about your setup?

@VictorSanh Thanks for reply, Can you provide your version of pytorch and flash-attn? I tested the above code in

torch==1.13.1+cu117
transformers==4.37
flash-attn==2.3.3

Also, I found out that when I was using the HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit and input pixel_values = torch.randn(1, 3, 980, 980).cuda().half(), I can get the output tensor(0., device='cuda:0', dtype=torch.float16)

my setup:

torch==2.0.1+cu118
transformers==4.37.1
flash-attn==2.3.6

Could you try to upgrade the env and test again?
trying to narrow down the problem first

@VictorSanh

env

torch==2.2.1+cu118
transformers==4.39.0
flash-attn==2.5.6

output

tensor(7.1602, device='cuda:0', dtype=torch.float16)

thanks, on my to do, will try to reproduce with your configs!

980-> ok but 384 -> same bug. My env setup:

torch==2.1.1+cu118
transformers==4.37.0

I don't use flash-attn. I also find that the output of Visual Embedding are different.

Sign up or log in to comment