|
https://github.com/BM-K/Sentence-Embedding-is-all-you-need |
|
|
|
# Korean-Sentence-Embedding |
|
๐ญ Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides environments where individuals can train models. |
|
|
|
## Quick tour |
|
```python |
|
import torch |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
def cal_score(a, b): |
|
if len(a.shape) == 1: a = a.unsqueeze(0) |
|
if len(b.shape) == 1: b = b.unsqueeze(0) |
|
|
|
a_norm = a / a.norm(dim=1)[:, None] |
|
b_norm = b / b.norm(dim=1)[:, None] |
|
return torch.mm(a_norm, b_norm.transpose(0, 1)) * 100 |
|
|
|
model = AutoModel.from_pretrained('BM-K/KoSimCSE-roberta-multitask') |
|
AutoTokenizer.from_pretrained('BM-K/KoSimCSE-roberta-multitask') |
|
|
|
sentences = ['์นํ๊ฐ ๋คํ์ ๊ฐ๋ก ์ง๋ฌ ๋จน์ด๋ฅผ ์ซ๋๋ค.', |
|
'์นํ ํ ๋ง๋ฆฌ๊ฐ ๋จน์ด ๋ค์์ ๋ฌ๋ฆฌ๊ณ ์๋ค.', |
|
'์์ญ์ด ํ ๋ง๋ฆฌ๊ฐ ๋๋ผ์ ์ฐ์ฃผํ๋ค.'] |
|
|
|
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") |
|
embeddings, _ = model(**inputs, return_dict=False) |
|
|
|
score01 = cal_score(embeddings[0][0], embeddings[1][0]) |
|
score02 = cal_score(embeddings[0][0], embeddings[2][0]) |
|
``` |