[BUG] unable to inference it in batchsize=4
Hi, I just test your open-sourced code and model few days ago. It works quite well when in batchsize = 1, but fail to produce normally when batchsize > 1. More precisely, for all samples, it generates content that looks quite similar with sample[0].
The problem with the current code probably lies in Kosmos2_5VisionLayer
https://huggingface.co/kirp/kosmos2_5/blob/bef6ac6ae6e461316affd896206a106abf8cdb3e/modeling_kosmos2_5.py#L867-L874
self_attention_outputs, _ = self.attention(
hidden_states,
attention_mask=attention_mask,
layer_head_mask=head_mask,
output_attentions=output_attentions,
)
attention_output = self_attention_outputs[0]
outputs = self_attention_outputs[1:] # add self attentions if we output attention weights
The attention module, for instance, Kosmos2_5VisionAttention
, returns tuple:
class Kosmos2_5VisionAttention(nn.Module):
# ...
def forward(
self,
hidden_states,
attention_mask=None,
position_bias=None,
layer_head_mask=None,
output_attentions=False,
):
return attn_output, attn_weights
but the logic in Kosmos2_5VisionLayer
seems to ignore the attn_weights, and parse attn_output as if it were a tuple like (attn_output, attn_weights)
, hence the model will take only the first output with shape (4096, 1536). The broadcast mechanism in residual connection makes it won't report any error... But it indeed seems incorrect.
To correct, modify it like:
self_attention_outputs, _ = self.attention(
👇
self_attention_outputs = self.attention(
will help so. If a PR is needed, I will be willing to raise it.
Please let me know I get it right or not. Thanks!
Did you run the code?! I can't even get the code to run. Have you resolved this issue by any chance? I would appreciate it if you could let me know.
Traceback (most recent call last):
File "/shared/workspace/koshug/hug_me.py", line 17, in <module>
inputs = processor(text=prompt, images=image, return_tensors="pt")
File "/root/anaconda3/envs/kos/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2945, in __call__
encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
File "/root/anaconda3/envs/kos/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3053, in _call_one
return self.encode_plus(
File "/root/anaconda3/envs/kos/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3127, in encode_plus
return self._encode_plus(
File "/root/anaconda3/envs/kos/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 601, in _encode_plus
batched_output = self._batch_encode_plus(
TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'images'
This repo is just for testing. I haven’t finish the batch generating yet.
Kosmos2_5VisionLayer, I will check this latter. Thank you for your reminder.
Did you run the code?! I can't even get the code to run. Have you resolved this issue by any chance? I would appreciate it if you could let me know.
Traceback (most recent call last): File "/shared/workspace/koshug/hug_me.py", line 17, in <module> inputs = processor(text=prompt, images=image, return_tensors="pt") File "/root/anaconda3/envs/kos/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2945, in __call__ encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs) File "/root/anaconda3/envs/kos/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3053, in _call_one return self.encode_plus( File "/root/anaconda3/envs/kos/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3127, in encode_plus return self._encode_plus( File "/root/anaconda3/envs/kos/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 601, in _encode_plus batched_output = self._batch_encode_plus( TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'images'
Seems you haven't successfully loaded Kosmos2_5Processor. I ran the code in this repo and it goes well(except for batchsize > 1). Maybe
@kirp
can help.
BR.
Thank you. After looking at the hint you gave me, I solved the issue by calling the tokenizer like this: from kosmos2_5.processing_kosmos2_5 import Kosmos2_5Processor
@alexyywwdd
Now batch is supported. You need to pip install git+https://github.com/tic-top/transformers.git --upgrade
# batch generate
inputs = processor(text=[prompt, prompt], images=[image,image], return_tensors="pt")
# Get the original width and height
raw_width, raw_height = image.size
# NOTE: If the processor receives a single image, it will return int; if a batch of image recived, return List[int].
height, width = inputs.pop("height"), inputs.pop("width")
# Here we use height[0], and width[0] to get resized height and width of first image
scale_height = raw_height / height[0]
scale_width = raw_width / width[0]
Thanks, I'm closing this issue now.