Multimodal Tokenizer Question

#33

by Nano1337 - opened Jun 30

Jun 30

In examining the processor and the input_ids, it appears that the image is positioned after the text. Is this a conventional approach? I'm concerned that this configuration could pose an issue when dealing with lengthy texts. Given that the model's maximum context length is 1024 tokens, any excess would necessitate truncation, potentially resulting in the omission of the image tokens.

Is there any option to move the image tokens before the text tokens?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment