File size: 6,980 Bytes
9221208
d372cfb
 
9221208
 
 
 
 
 
 
 
d372cfb
 
 
 
 
9221208
 
 
6c122a4
d372cfb
 
 
9221208
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b5a5752
9221208
a66a603
9221208
 
 
eef3369
9221208
 
eef3369
 
0a05607
a66a603
 
9221208
eef3369
 
 
 
9221208
eef3369
9221208
eef3369
9221208
 
0a05607
eef3369
a66a603
 
 
 
9221208
 
 
 
a66a603
9221208
a66a603
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9221208
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d372cfb
9221208
d372cfb
f9c4a7f
9221208
d372cfb
6c122a4
 
d372cfb
6c122a4
9221208
 
d372cfb
 
6c122a4
d372cfb
9221208
d372cfb
6c122a4
d372cfb
6c122a4
 
9221208
d372cfb
6c122a4
9221208
d372cfb
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
---
language:
- ja
library_name: sentence-transformers
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
metrics:
widget: []
pipeline_tag: sentence-similarity
license: apache-2.0
datasets:
- hpprc/emb
- hpprc/mqa-ja
- google-research-datasets/paws-x
---

## Model Details
This is a text embedding model based on RoFormer with a maximum input sequence length of 1024.
The model is pre-trained with Wikipedia and cc100 and fine-tuned as a sentence embedding model.
Fine-tuning begins with weakly supervised learning using mc4 and MQA.
After that, we perform the same 3-stage learning process as [GLuCoSE v2](https://huggingface.co/pkshatech/GLuCoSE-base-ja-v2).

### Model Description
- **Model Type:** Sentence Transformer
- **Maximum Sequence Length:** 1024 tokens
- **Output Dimensionality:** 768 tokens
- **Similarity Function:** Cosine Similarity
<!-- - **Training Dataset:** Unknown -->
<!-- - **Language:** Unknown -->
<!-- - **License:** Unknown -->

### Model Sources

- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)

### Full Model Architecture

```
SentenceTransformer(
  (0): Transformer({'max_seq_length': 1024, 'do_lower_case': False}) with Transformer model: RetrievaBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)
```

## Usage

### Direct Usage (Sentence Transformers)

You can perform inference using SentenceTransformers with the following code:

```python
from sentence_transformers import SentenceTransformer
import torch.nn.functional as F

# Download from the 🤗 Hub
# The argument "trust_remote_code=True" is required to load the model
model = SentenceTransformer("pkshatech/RoSEtta-base-ja",trust_remote_code=True)

# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = [
    'query: PKSHAはどんな会社ですか?',
    'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
    'query: 日本で一番高い山は?',
    'passage: 富士山(ふじさん)は、標高3776.12 m、日本最高峰(剣ヶ峰)の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]
embeddings = model.encode(sentences,convert_to_tensor=True)
print(embeddings.shape)
# [4, 768]

# Get the similarity scores for the embeddings
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.5910, 0.4332, 0.5421],
# [0.5910, 1.0000, 0.4977, 0.6969],
# [0.4332, 0.4977, 1.0000, 0.7475],
# [0.5421, 0.6969, 0.7475, 1.0000]]
```

### Direct Usage (Transformers)

You can perform inference using Transformers with the following code:

```python
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def mean_pooling(last_hidden_states: Tensor,attention_mask: Tensor) -> Tensor:
    emb = last_hidden_states * attention_mask.unsqueeze(-1)
    emb = emb.sum(dim=1) / attention_mask.sum(dim=1).unsqueeze(-1)
    return emb

# Download from the 🤗 Hub
tokenizer = AutoTokenizer.from_pretrained("pkshatech/RoSEtta-base-ja")
# The argument "trust_remote_code=True" is required to load the model
model = AutoModel.from_pretrained("pkshatech/RoSEtta-base-ja",trust_remote_code=True)

# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = [
    'query: PKSHAはどんな会社ですか?',
    'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
    'query: 日本で一番高い山は?',
    'passage: 富士山(ふじさん)は、標高3776.12 m、日本最高峰(剣ヶ峰)の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]

# Tokenize the input texts
batch_dict = tokenizer(sentences, max_length=1024, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = mean_pooling(outputs.last_hidden_state, batch_dict['attention_mask'])
print(embeddings.shape)
# [4, 768]

# Get the similarity scores for the embeddings
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.5910, 0.4332, 0.5421],
# [0.5910, 1.0000, 0.4977, 0.6969],
# [0.4332, 0.4977, 1.0000, 0.7475],
# [0.5421, 0.6969, 0.7475, 1.0000]]
```

<!--
### Downstream Usage (Sentence Transformers)

You can finetune this model on your own dataset.

<details><summary>Click to expand</summary>

</details>
-->

<!--
### Out-of-Scope Use

*List how the model may foreseeably be misused and address what users ought not to do with the model.*
-->

<!--
## Bias, Risks and Limitations

*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
-->

<!--
### Recommendations

*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
-->

## Benchmarks

### Retieval
Evaluated with [MIRACL-ja](https://huggingface.co/datasets/miracl/miracl), [JQARA](https://huggingface.co/datasets/hotchpotch/JQaRA) and [MLDR-ja](https://huggingface.co/datasets/Shitao/MLDR).

| model | size | MIRACL<br>Recall@5 | JQaRA<br>nDCG@10 | MLDR<br>nDCG@10 |
|:--:|:--:|:--:|:--:|:----:|
| me5-base | 0.3B | **84.2** | 47.2 | 25.4 |
| GLuCoSE | 0.1B | 53.3 | 30.8 | 25.2 |
| RoSEtta | 0.2B | 79.3 | **57.7** | **32.3** |


### JMTEB
Evaluated with [JMTEB](https://github.com/sbintuitions/JMTEB).
* The time-consuming datasets ['amazon_review_classification', 'mrtydi', 'jaqket', 'esci'] were excluded, and the evaluation was conducted on the other 12 datasets.
* The average is a macro-average per task.

| model | size | Class. | Ret. | STS. | Clus. | Pair. | Avg. |
|:--:|:--:|:--:|:--:|:----:|:-------:|:-------:|:------:|
| me5-base | 0.3B | 75.1 | 80.6 | 80.5 | 52.6 | 62.4 | 70.2 |
| GLuCoSE | 0.1B | **82.6** | 69.8 | 78.2 | 51.5 | **66.2** | 69.7 |
| RoSEtta | 0.2B | 79.0 | **84.3** | **81.4** | **53.2** | 61.7 | **71.9** |

## Authors
Chihiro Yano, Mocho Go, Hideyuki Tachibana, Hiroto Takegawa, Yotaro Watanabe

## License
This model is published under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).