--- tags: - mteb - sentence-transformers - transformers - multilingual - sentence-similarity license: apache-2.0 --- ## gte-multilingual-base The **gte-multilingual-base** model is the latest in the [GTE](https://huggingface.co/collections/Alibaba-NLP/gte-models-6680f0b13f885cb431e6d469) (General Text Embedding) family of models, featuring several key attributes: - **High Performance**: Achieves state-of-the-art (SOTA) results in multilingual retrieval tasks and multi-task representation model evaluations when compared to models of similar size. - **Training Architecture**: Trained using an encoder-only transformers architecture, resulting in a smaller model size. Unlike previous models based on decode-only LLM architecture (e.g., gte-qwen2-1.5b-instruct), this model has lower hardware requirements for inference, offering a 10x increase in inference speed. - **Long Context**: Supports text lengths up to **8192** tokens. - **Multilingual Capability**: Supports over **70** languages. - **Elastic Dense Embedding**: Support elastic output dense representation while maintaining the effectiveness of downstream tasks, which significantly reduces storage costs and improves execution efficiency. - **Sparse Vectors**: In addition to dense representations, it can also generate sparse vectors. ## Model Information - Model Size: 305M - Embedding Dimension: 768 - Max Input Tokens: 8192 ## Usage Get Dense Embeddings with Transformers ``` # Requires transformers>=4.36.0 import torch.nn.functional as F from transformers import AutoModel, AutoTokenizer input_texts = [ "what is the capital of China?", "how to implement quick sort in python?", "北京", "快排算法介绍" ] model_path = 'Alibaba-NLP/gte-multilingual-base' tokenizer = AutoTokenizer.from_pretrained(model_path) model = AutoModel.from_pretrained(model_path, trust_remote_code=True) # Tokenize the input texts batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt') outputs = model(**batch_dict) dimension=768 # The output dimension of the output embedding, should be in [128, 768] embeddings = outputs.last_hidden_state[:, 0][:dimension] embeddings = F.normalize(embeddings, p=2, dim=1) scores = (embeddings[:1] @ embeddings[1:].T) * 100 print(scores.tolist()) ``` Use with sentence-transformers ``` from sentence_transformers import SentenceTransformer from sentence_transformers.util import cos_sim input_texts = [ "what is the capital of China?", "how to implement quick sort in python?", "北京", "快排算法介绍" ] model = SentenceTransformer('Alibaba-NLP/gte-multilingual-base', trust_remote_code=True) embeddings = model.encode(input_texts) ``` Use with custom code to get dense embeddigns and sparse token weights ``` # You can find the gte_embeddings.py in https://huggingface.co/Alibaba-NLP/gte-multilingual-base/blob/main/scripts/gte_embedding.py from gte_embeddings import GTEEmbeddidng model_path = 'Alibaba-NLP/gte-multilingual-base' model = GTEEmbeddidng(model_path) query = "中国的首都在哪儿" docs = [ "what is the capital of China?", "how to implement quick sort in python?", "北京", "快排算法介绍" ] embs = model.encode(docs, return_dense=True,return_sparse=True) print('dense_embeddings vecs', embs['dense_embeddings']) print('token_weights', embs['token_weights']) pairs = [(query, doc) for doc in docs] dense_scores = model.compute_scores(pairs, dense_weight=1.0, sparse_weight=0.0) sparse_scores = model.compute_scores(pairs, dense_weight=0.0, sparse_weight=1.0) hybird_scores = model.compute_scores(pairs, dense_weight=1.0, sparse_weight=0.3) print('dense_scores', dense_scores) print('sparse_scores', sparse_scores) print('hybird_scores', hybird_scores) ``` ## Citation ``` @misc{zhang2024mgtegeneralizedlongcontexttext, title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval}, author={Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang}, year={2024}, eprint={2407.19669}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2407.19669}, } ```