feat: Added judgment logic to support training with plain text data.

#42

The current logic assumes that all input data includes image inputs, so data['pixel_values'] must match the training samples; however, if dealing with purely text data inputs, 'pixel_values' does not exist.

Although the backend code can handle such cases without image content, this will lead to errors before execution.

start = 0
for pixel_values in pixel_values_list:
    img_cnt = len(pixel_values)
    if img_cnt > 0:
        vision_hidden_states.append(vision_embedding[start: start + img_cnt])
        start += img_cnt
    else:
        vision_hidden_states.append([])
Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment