RoSEtta-base-ja / README.md
yano0's picture
Update README.md
b5a5752 verified
|
raw
history blame
No virus
6.98 kB
---
language:
- ja
library_name: sentence-transformers
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
metrics:
widget: []
pipeline_tag: sentence-similarity
license: apache-2.0
datasets:
- hpprc/emb
- hpprc/mqa-ja
- google-research-datasets/paws-x
---
## Model Details
This is a text embedding model based on RoFormer with a maximum input sequence length of 1024.
The model is pre-trained with Wikipedia and cc100 and fine-tuned as a sentence embedding model.
Fine-tuning begins with weakly supervised learning using mc4 and MQA.
After that, we perform the same 3-stage learning process as [GLuCoSE v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2).
### Model Description
- **Model Type:** Sentence Transformer
- **Maximum Sequence Length:** 1024 tokens
- **Output Dimensionality:** 768 tokens
- **Similarity Function:** Cosine Similarity
<!-- - **Training Dataset:** Unknown -->
<!-- - **Language:** Unknown -->
<!-- - **License:** Unknown -->
### Model Sources
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
### Full Model Architecture
```
SentenceTransformer(
(0): Transformer({'max_seq_length': 1024, 'do_lower_case': False}) with Transformer model: RetrievaBertModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
```
## Usage
### Direct Usage (Sentence Transformers)
You can perform inference using SentenceTransformers with the following code:
```python
from sentence_transformers import SentenceTransformer
import torch.nn.functional as F
# Download from the 🤗 Hub
# The argument "trust_remote_code=True" is required to load the model
model = SentenceTransformer("pkshatech/RoSEtta-base-ja",trust_remote_code=True)
# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = [
'query: PKSHAはどんな会社ですか?',
'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
'query: 日本で一番高い山は?',
'passage: 富士山(ふじさん)は、標高3776.12 m、日本最高峰(剣ヶ峰)の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]
embeddings = model.encode(sentences,convert_to_tensor=True)
print(embeddings.shape)
# [4, 768]
# Get the similarity scores for the embeddings
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.5910, 0.4332, 0.5421],
# [0.5910, 1.0000, 0.4977, 0.6969],
# [0.4332, 0.4977, 1.0000, 0.7475],
# [0.5421, 0.6969, 0.7475, 1.0000]]
```
### Direct Usage (Transformers)
You can perform inference using Transformers with the following code:
```python
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def mean_pooling(last_hidden_states: Tensor,attention_mask: Tensor) -> Tensor:
emb = last_hidden_states * attention_mask.unsqueeze(-1)
emb = emb.sum(dim=1) / attention_mask.sum(dim=1).unsqueeze(-1)
return emb
# Download from the 🤗 Hub
tokenizer = AutoTokenizer.from_pretrained("pkshatech/RoSEtta-base-ja")
# The argument "trust_remote_code=True" is required to load the model
model = AutoModel.from_pretrained("pkshatech/RoSEtta-base-ja",trust_remote_code=True)
# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = [
'query: PKSHAはどんな会社ですか?',
'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
'query: 日本で一番高い山は?',
'passage: 富士山(ふじさん)は、標高3776.12 m、日本最高峰(剣ヶ峰)の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]
# Tokenize the input texts
batch_dict = tokenizer(sentences, max_length=1024, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = mean_pooling(outputs.last_hidden_state, batch_dict['attention_mask'])
print(embeddings.shape)
# [4, 768]
# Get the similarity scores for the embeddings
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.5910, 0.4332, 0.5421],
# [0.5910, 1.0000, 0.4977, 0.6969],
# [0.4332, 0.4977, 1.0000, 0.7475],
# [0.5421, 0.6969, 0.7475, 1.0000]]
```
<!--
### Downstream Usage (Sentence Transformers)
You can finetune this model on your own dataset.
<details><summary>Click to expand</summary>
</details>
-->
<!--
### Out-of-Scope Use
*List how the model may foreseeably be misused and address what users ought not to do with the model.*
-->
<!--
## Bias, Risks and Limitations
*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
-->
<!--
### Recommendations
*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
-->
## Benchmarks
### Retieval
Evaluated with [MIRACL-ja](https://huggingface.co/datasets/miracl/miracl), [JQARA](https://huggingface.co/datasets/hotchpotch/JQaRA) and [MLDR-ja](https://huggingface.co/datasets/Shitao/MLDR).
| model | size | MIRACL<br>Recall@5 | JQaRA<br>nDCG@10 | MLDR<br>nDCG@10 |
|:--:|:--:|:--:|:--:|:----:|
| me5-base | 0.3B | **84.2** | 47.2 | 25.4 |
| GLuCoSE | 0.1B | 53.3 | 30.8 | 25.2 |
| RoSEtta | 0.2B | 79.3 | **57.7** | **32.3** |
### JMTEB
Evaluated with [JMTEB](https://github.com/sbintuitions/JMTEB).
* The time-consuming datasets ['amazon_review_classification', 'mrtydi', 'jaqket', 'esci'] were excluded, and the evaluation was conducted on the other 12 datasets.
* The average is a macro-average per task.
| model | size | Class. | Ret. | STS. | Clus. | Pair. | Avg. |
|:--:|:--:|:--:|:--:|:----:|:-------:|:-------:|:------:|
| me5-base | 0.3B | 75.1 | 80.6 | 80.5 | 52.6 | 62.4 | 70.2 |
| GLuCoSE | 0.1B | **82.6** | 69.8 | 78.2 | 51.5 | **66.2** | 69.7 |
| RoSEtta | 0.2B | 79.0 | **84.3** | **81.4** | **53.2** | 61.7 | **71.9** |
## Authors
Chihiro Yano, Mocho Go, Hideyuki Tachibana, Hiroto Takegawa, Yotaro Watanabe
## License
This model is published under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).