"Finally working: Redundant TEXT model for HF inference". Could you do the same thing for this LongClip?

by kk3dmax - opened Sep 21

Discussion

kk3dmax

Sep 21

I want to use this model in diffusers.
Many thanks in advance.

zer0int

Owner Sep 21

Done. 👍
I'll just copy-paste what I just added to the readme.md:

🚨 IMPORTANT NOTE for loading with HuggingFace Transformers: 👀

model_id = "zer0int/LongCLIP-GmP-ViT-L-14"

model = CLIPModel.from_pretrained(model_id)
processor = CLIPProcessor.from_pretrained(model_id)

❌ Error due to mismatch with defined 77 tokens in Transformers library

👇

Option 1 (simple & worse):

Truncate to 77 tokens
CLIPModel.from_pretrained(model_id, ignore_mismatched_sizes=True)

# Cosine similarities for 77 tokens is WORSE:
# tensor[photo of a cat, picture of a dog, cat, dog] # image ground truth: cat photo
tensor([[0.16484, 0.0749, 0.1618, 0.0774]], device='cuda:0') 📉

👇

Option 2 (edit Transformers) 💖 RECOMMENDED 💖:

👉 Find the line that says max_position_embeddings=77, in [System Python]/site-packages/transformers/models/clip/configuration_clip.py
👉 Change to: max_position_embeddings=248,

Now, in your inference code, for text:

text_input = processor([your-prompt-or-prompts-as-usual], padding="max_length", max_length=248)
or:
text_input = processor([your-prompt-or-prompts-as-usual], padding="True")

# Resulting Cosine Similarities for 248 tokens padded:
# tensor[photo of a cat, picture of a dog, cat, dog] -- image ground truth: cat photo
tensor([[0.2128, 0.0978, 0.1957, 0.1133]], device='cuda:0') ✅

kk3dmax

Sep 22

thank you so much!

kk3dmax

Sep 22

Still got below error message for fluxPipeline in diffusers.
Token indices sequence length is longer than the specified maximum sequence length for this model (327 > 248). Running this sequence through the model will result in indexing errors
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens:

It may related to diffusers codes:
removed_text = tokenizer.batch_decode(untruncated_ids[:, tokenizer.model_max_length - 1 : -1])
logger.warning(
"The following part of your input was truncated because CLIP can only handle sequences up to"
f" {tokenizer.model_max_length} tokens: {removed_text}"
)

However, I have forced the clip_processor.tokenizer.model_max_length = 248, still got above error message.
model_id = ("zer0int/LongCLIP-GmP-ViT-L-14")

clip_model = CLIPModel.from_pretrained(model_id)
clip_processor = CLIPProcessor.from_pretrained(model_id, padding="max_length", max_length=248)
clip_processor.tokenizer.model_max_length = 248
pipe.tokenizer = clip_processor.tokenizer  # Replace with the CLIP tokenizer
pipe.text_encoder = clip_model.text_model  # Replace with the CLIP text encoder

Post here, just in case someone could find a workaround.
I'll try to figure out myself.

zer0int

Owner Sep 22

...And while I unfortunately don't have the time to do all implementations for Forge, Diffusers and GGUF pipelines that I've received questions about myself, I'm just gonna add this link to ComfyUI nodes for Flux.1.
You could reverse engineer the implementation and apply it to your code: https://github.com/SeaArtLab/ComfyUI-Long-CLIP

kk3dmax

Sep 23

I have succeded make this longclip work for Diffusers.

See the message:
Token indices sequence length is longer than the specified maximum sequence length for this model (307 > 248). Running this sequence through the model will result in indexing errors
The following part of your input was truncated because CLIP can only handle sequences up to 248 tokens:

model_id = ("zer0int/LongCLIP-GmP-ViT-L-14")
config = CLIPConfig.from_pretrained(model_id)
config.text_config.max_position_embeddings = 248
clip_model = CLIPModel.from_pretrained(model_id, torch_dtype=dtype, config=config)
clip_processor = CLIPProcessor.from_pretrained(model_id, padding="max_length", max_length=248)

pipe.tokenizer = clip_processor.tokenizer  # Replace with the CLIP tokenizer
pipe.text_encoder = clip_model.text_model  # Replace with the CLIP text encoder
pipe.tokenizer_max_length = 248
pipe.text_encoder.dtype = torch.bfloat16

With above codes, you don't need to hack the transformers --> Option 3 (above codes for diffusers) 💖 RECOMMENDED 💖:

zer0int

Owner Sep 23

Thank you for sharing this! I just updated the README.MD with this information. 💖 RECOMMENDED 💖
😁👍

zer0int

Owner Sep 23

PS: And I just added the original author's LongCLIP for Diffusers today, too, if you're interested: https://huggingface.co/zer0int/LongCLIP-L-Diffusers

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

"Finally working: Redundant TEXT model for HF inference". Could you do the same thing for this LongClip?

Done. 👍I'll just copy-paste what I just added to the readme.md:

🚨 IMPORTANT NOTE for loading with HuggingFace Transformers: 👀

❌ Error due to mismatch with defined 77 tokens in Transformers library

👇

Option 1 (simple & worse):

👇

Option 2 (edit Transformers) 💖 RECOMMENDED 💖:

Now, in your inference code, for text:

Done. 👍
I'll just copy-paste what I just added to the readme.md: