Does MiniCPM support multi-image input?

by huanghui1997 - opened May 21

Discussion

huanghui1997

May 21

I want to process 4-6 images each time, what is the best practice?

Cuiunbo

OpenBMB org May 21

Here is a common practice, input messages like this, 'system prompt, image 1... image n, question'.

But, any kind of sequence can be input to the model in the following way, and you can try to find the best way to do so.

    msgs = []
    system_prompt = 'Answer in detail.'
    prompt = 'Caption this two images'
    tgt_path = ['path/to/image1', 'path/to/image/2']
    if system_prompt: 
        msgs.append(dict(type='text', value=system_prompt))
    if isinstance(tgt_path, list):
        msgs.extend([dict(type='image', value=p) for p in tgt_path])
    else:
        msgs = [dict(type='image', value=tgt_path)]
    msgs.append(dict(type='text', value=prompt))

    content = []
    for x in msgs:
        if x['type'] == 'text':
            content.append(x['value'])
        elif x['type'] == 'image':
            image = Image.open(x['value']).convert('RGB')
            content.append(image)
    msgs = [{'role': 'user', 'content': content}]

    res = model.chat(
        msgs=msgs,
        context=None,
        image=None,
        tokenizer=self.tokenizer,
        **default_kwargs
    )

If you have more questions, feel free to continue the discussion.

vjunyang

May 22

@Cuiunbo hi I wrote according to the official script, but when I passed in the image, I ran an error message like "Segmentation fault (core dumped)"

Cuiunbo

OpenBMB org May 22

Hi @vjunyang , you can help us reproduce your error by providing more information about the error and using the environment and code.

vjunyang

May 22

•

edited May 22

@Cuiunbo

My environment：python==3.8,sentencepiece==0.1.99, torch==2.2.0, Pillow==10.1.0,torchvision==0.16.2,transformers==4.40.2, CUDA Version: 12.2

Operation information：

vjunyang

May 22

Solved, I added it in the code torch.backends.cudnn.enabled = False

Cuiunbo

OpenBMB org May 22

nice! Glad you made this work, and feel free to ask if you have more questions!
We'll get back to you as soon as we can.

Cuiunbo changed discussion status to closed May 22

ma-korotkov

May 24

@Cuiunbo It looks like the model works only with 2 images maximum. I've tried it with 2 images and it worked perfectly fine, but with any number of images more than 2 it just gives you a
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 21 but got size 20 for tensor number 1 in the list.

Cuiunbo

OpenBMB org May 24

@ma-korotkov Thanks for providing the implementation, in case you don't modify the model file, the max context length for llama3 is 2048, but i remember llama3 supports 4096, you can try to modify it.
Also, the resolution of the image affects the number of images you can input into the model, you can also try resizing it before that.

ma-korotkov

May 24

@Cuiunbo Thanks a lot! You are right, I'm using quite a big images, so resizing them have helped to fit into context length

Cuiunbo

OpenBMB org May 24

@Cuiunbo Thanks a lot! You are right, I'm using quite a big images, so resizing them have helped to fit into context length

@ma-korotkov Nice! I hope to hear your feedback on our video capabilities, Since we didn't train on multiimages data, it's amazing if we can do some simple video tasks now!

Cuiunbo changed discussion status to open May 24

maniache

Jun 6

May I ask if you have added the training data of interleaving pictures and text? I found that it did not learn well when using multiple pictures and texts (pairs) as the context.

Rasi1610

Jun 13

@Cuiunbo Can you specify how to increase the context length in this model. It would be really helpful.

Cuiunbo

OpenBMB org Jun 14

@maniache
Hello, we have not added interleaving data

Cuiunbo

OpenBMB org Jun 14

@Rasi1610 hi, you may need to edit the tokenizer config for context length

jlsvane

Jun 15

What is the prefered way to continue the "chat" regarding a previously loaded image without reloading?

Is it possible to use the model just as a language model, ie. without any image?

chuangzhidian

Jul 3

Here is a common practice, input messages like this, 'system prompt, image 1... image n, question'.

But, any kind of sequence can be input to the model in the following way, and you can try to find the best way to do so.

    msgs = []
    system_prompt = 'Answer in detail.'
    prompt = 'Caption this two images'
    tgt_path = ['path/to/image1', 'path/to/image/2']
    if system_prompt: 
        msgs.append(dict(type='text', value=system_prompt))
    if isinstance(tgt_path, list):
        msgs.extend([dict(type='image', value=p) for p in tgt_path])
    else:
        msgs = [dict(type='image', value=tgt_path)]
    msgs.append(dict(type='text', value=prompt))

    content = []
    for x in msgs:
        if x['type'] == 'text':
            content.append(x['value'])
        elif x['type'] == 'image':
            image = Image.open(x['value']).convert('RGB')
            content.append(image)
    msgs = [{'role': 'user', 'content': content}]

    res = model.chat(
        msgs=msgs,
        context=None,
        image=None,
        tokenizer=self.tokenizer,
        **default_kwargs
    )

If you have more questions, feel free to continue the discussion.

image=None? 图片为None?

oxiaoshao

Jul 17

Here is a common practice, input messages like this, 'system prompt, image 1... image n, question'.

But, any kind of sequence can be input to the model in the following way, and you can try to find the best way to do so.

    msgs = []
    system_prompt = 'Answer in detail.'
    prompt = 'Caption this two images'
    tgt_path = ['path/to/image1', 'path/to/image/2']
    if system_prompt: 
        msgs.append(dict(type='text', value=system_prompt))
    if isinstance(tgt_path, list):
        msgs.extend([dict(type='image', value=p) for p in tgt_path])
    else:
        msgs = [dict(type='image', value=tgt_path)]
    msgs.append(dict(type='text', value=prompt))

    content = []
    for x in msgs:
        if x['type'] == 'text':
            content.append(x['value'])
        elif x['type'] == 'image':
            image = Image.open(x['value']).convert('RGB')
            content.append(image)
    msgs = [{'role': 'user', 'content': content}]

    res = model.chat(
        msgs=msgs,
        context=None,
        image=None,
        tokenizer=self.tokenizer,
        **default_kwargs
    )

If you have more questions, feel free to continue the discussion.

image=None? 图片为None?

Images are contained in the msg.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment