--- language: - ja library_name: sentence-transformers tags: - sentence-transformers - sentence-similarity - feature-extraction metrics: widget: [] pipeline_tag: sentence-similarity license: apache-2.0 datasets: - hpprc/emb - hpprc/mqa-ja - google-research-datasets/paws-x --- ## Model Details This is a text embedding model based on RoFormer with a maximum input sequence length of 1024. The model is pre-trained with Wikipedia and cc100 and fine-tuned as a sentence embedding model. Fine-tuning begins with weakly supervised learning using mc4 and MQA. After that, we perform the same 3-stage learning process as [GLuCoSE v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2). ### Model Description - **Model Type:** Sentence Transformer - **Maximum Sequence Length:** 1024 tokens - **Output Dimensionality:** 768 tokens - **Similarity Function:** Cosine Similarity ### Model Sources - **Documentation:** [Sentence Transformers Documentation](https://sbert.net) - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers) - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers) ### Full Model Architecture ``` SentenceTransformer( (0): Transformer({'max_seq_length': 1024, 'do_lower_case': False}) with Transformer model: RetrievaBertModel (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) ) ``` ## Usage ### Direct Usage (Sentence Transformers) You can perform inference using SentenceTransformers with the following code: ```python from sentence_transformers import SentenceTransformer import torch.nn.functional as F # Download from the 🤗 Hub # The argument "trust_remote_code=True" is required to load the model model = SentenceTransformer("pkshatech/RoSEtta-base-ja",trust_remote_code=True) # Each input text should start with "query: " or "passage: ". # For tasks other than retrieval, you can simply use the "query: " prefix. sentences = [ 'query: PKSHAはどんな会社ですか?', 'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。', 'query: 日本で一番高い山は?', 'passage: 富士山(ふじさん)は、標高3776.12 m、日本最高峰(剣ヶ峰)の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。', ] embeddings = model.encode(sentences,convert_to_tensor=True) print(embeddings.shape) # [4, 768] # Get the similarity scores for the embeddings similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2) print(similarities) # [[1.0000, 0.5910, 0.4332, 0.5421], # [0.5910, 1.0000, 0.4977, 0.6969], # [0.4332, 0.4977, 1.0000, 0.7475], # [0.5421, 0.6969, 0.7475, 1.0000]] ``` ### Direct Usage (Transformers) You can perform inference using Transformers with the following code: ```python import torch.nn.functional as F from torch import Tensor from transformers import AutoTokenizer, AutoModel def mean_pooling(last_hidden_states: Tensor,attention_mask: Tensor) -> Tensor: emb = last_hidden_states * attention_mask.unsqueeze(-1) emb = emb.sum(dim=1) / attention_mask.sum(dim=1).unsqueeze(-1) return emb # Download from the 🤗 Hub tokenizer = AutoTokenizer.from_pretrained("pkshatech/RoSEtta-base-ja") # The argument "trust_remote_code=True" is required to load the model model = AutoModel.from_pretrained("pkshatech/RoSEtta-base-ja",trust_remote_code=True) # Each input text should start with "query: " or "passage: ". # For tasks other than retrieval, you can simply use the "query: " prefix. sentences = [ 'query: PKSHAはどんな会社ですか?', 'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。', 'query: 日本で一番高い山は?', 'passage: 富士山(ふじさん)は、標高3776.12 m、日本最高峰(剣ヶ峰)の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。', ] # Tokenize the input texts batch_dict = tokenizer(sentences, max_length=1024, padding=True, truncation=True, return_tensors='pt') outputs = model(**batch_dict) embeddings = mean_pooling(outputs.last_hidden_state, batch_dict['attention_mask']) print(embeddings.shape) # [4, 768] # Get the similarity scores for the embeddings similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2) print(similarities) # [[1.0000, 0.5910, 0.4332, 0.5421], # [0.5910, 1.0000, 0.4977, 0.6969], # [0.4332, 0.4977, 1.0000, 0.7475], # [0.5421, 0.6969, 0.7475, 1.0000]] ``` ## Benchmarks ### Retieval Evaluated with [MIRACL-ja](https://huggingface.co/datasets/miracl/miracl), [JQARA](https://huggingface.co/datasets/hotchpotch/JQaRA) , [JaCWIR](https://huggingface.co/datasets/hotchpotch/JaCWIR) and [MLDR-ja](https://huggingface.co/datasets/Shitao/MLDR). | model | size | MIRACL
Recall@5 | JQaRA
nDCG@10 | JaCWIR
MAP@10 | MLDR
nDCG@10 | |:--:|:--:|:--:|:--:|:--:|:----:| | [mE5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 0.3B | **84.2** | 47.2 | **85.3** | 25.4 | | [GLuCoSE](https://huggingface.co/pkshatech/GLuCoSE-base-ja) | 0.1B | 53.3 | 30.8 | 68.6 | 25.2 | |[ruri-base](https://huggingface.co/cl-nagoya/ruri-base) | 0.1B | 74.3 | **58.1** | 84.6 | **35.3** | | RoSEtta | 0.2B | 79.3 | 57.7 | 83.8 | 32.3 | ### JMTEB Evaluated with [JMTEB](https://github.com/sbintuitions/JMTEB). * The time-consuming datasets ['amazon_review_classification', 'mrtydi', 'jaqket', 'esci'] were excluded, and the evaluation was conducted on the other 12 datasets. * The average is a macro-average per task. | model | size | Class. | Ret. | STS. | Clus. | Pair. | Avg. | |:--:|:--:|:--:|:--:|:----:|:-------:|:-------:|:------:| | [mE5-base](https://huggingface.co/intfloat/multilingual-e5-base) | 0.3B | 75.1 | 80.6 | 80.5 | **52.6** | 62.4 | 70.2 | | [GLuCoSE](https://huggingface.co/pkshatech/GLuCoSE-base-ja) | 0.1B | **82.6** | 69.8 | 78.2 | 51.5 | **66.2** | 69.7 | | RoSEtta | 0.2B | 79.0 | **84.3** | **81.4** | **53.2** | 61.7 | **71.9** | ## Authors Chihiro Yano, Mocho Go, Hideyuki Tachibana, Hiroto Takegawa, Yotaro Watanabe ## License This model is published under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).