Why the same input will be affected by batch size?

by yuzaa - opened Mar 7

Mar 7

import torch
from transformers import AutoModel
model = AutoModel.from_pretrained("HuggingFaceM4/siglip-so400m-14-384-flash-attn2", trust_remote_code=True)
model.eval().cuda().half()

pixel_values = torch.randn(1, 3, 384, 384).cuda().half()

with torch.inference_mode():
    x = model.vision_model(pixel_values)
    y = model.vision_model(torch.vstack([pixel_values, pixel_values]))

print(torch.sum(x.last_hidden_state - y.last_hidden_state[:1]))

the output is

tensor(-5.1641, device='cuda:0', dtype=torch.float16)

yuzaa

Mar 7

The same phenomenon was observed in HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit

VictorSanh

Mar 7

Hi @yuzaa ,
I am unfortunately unable to reproduce your discrepancy, neither with HuggingFaceM4/siglip-so400m-14-384-flash-attn2, nor with HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit.
I simply copy-pasted your snippet but the output is tensor(0., device='cuda:0', dtype=torch.float16) for me.
Can you say more about your setup?

yuzaa

Mar 8

•

edited Mar 8

@VictorSanh Thanks for reply, Can you provide your version of pytorch and flash-attn? I tested the above code in

torch==1.13.1+cu117
transformers==4.37
flash-attn==2.3.3

Also, I found out that when I was using the HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit and input pixel_values = torch.randn(1, 3, 980, 980).cuda().half(), I can get the output tensor(0., device='cuda:0', dtype=torch.float16)

VictorSanh

Mar 8

my setup:

torch==2.0.1+cu118
transformers==4.37.1
flash-attn==2.3.6

Could you try to upgrade the env and test again?
trying to narrow down the problem first

power0341

Apr 1

@VictorSanh

env

torch==2.2.1+cu118
transformers==4.39.0
flash-attn==2.5.6

output

tensor(7.1602, device='cuda:0', dtype=torch.float16)

VictorSanh

Apr 2

thanks, on my to do, will try to reproduce with your configs!

Coobiw

Apr 21

980-> ok but 384 -> same bug. My env setup:

torch==2.1.1+cu118
transformers==4.37.0

I don't use flash-attn. I also find that the output of Visual Embedding are different.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment