File size: 4,609 Bytes
891db93 fbc88e2 83c5c5a 891db93 fbc88e2 891db93 fbc88e2 891db93 fbc88e2 83c5c5a fbc88e2 83c5c5a fbc88e2 891db93 fbc88e2 07cb18b fbc88e2 891db93 fbc88e2 891db93 fbc88e2 891db93 c11b0cc 891db93 fbc88e2 891db93 fbc88e2 891db93 ee1ab63 fbc88e2 ee1ab63 fbc88e2 ee1ab63 83c5c5a 891db93 fbc88e2 891db93 fbc88e2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
---
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
- semantic-search
- character-transformer
- hierarchical-transformer
language:
- en
- grc
---
# shlm-grc-en
## Sentence embeddings for English and Ancient Greek
This model creates sentence embeddings in a shared vector space for Ancient Greek and English text.
The base model uses a modified version of the HLM architecture described in [Heidelberg-Boston @ SIGTYP 2024 Shared Task: Enhancing Low-Resource Language Analysis With Character-Aware Hierarchical Transformers](https://aclanthology.org/2024.sigtyp-1.16/) ([arXiv](https://arxiv.org/abs/2405.20145))
This model is trained to produce sentence embeddings using the multilingual knowledge distillation method and datasets described in [Sentence Embedding Models for Ancient Greek Using Multilingual Knowledge Distillation](https://aclanthology.org/2023.alp-1.2/) ([arXiv](https://arxiv.org/abs/2308.13116)).
This model was distilled from `BAAI/bge-base-en-v1.5` for embedding English and Ancient Greek text.
## Usage (Sentence-Transformers)
**This model is currently incompatible with the latest version of the sentence-transformers library.**
For now, either use HuggingFace Transformers directly (see below) or the following fork of sentence-transformers:
https://github.com/kevinkrahn/sentence-transformers
You can use the model with sentence-transformers like this:
```python
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('kevinkrahn/shlm-grc-en')
embeddings = model.encode(sentences)
print(embeddings)
```
## Usage (HuggingFace Transformers)
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
```python
from transformers import AutoTokenizer, AutoModel
import torch
def cls_pooling(model_output):
return model_output[0][:,0]
# Sentences we want sentence embeddings for
sentences = ['This is an English sentence', 'Ὁ Παρθενών ἐστιν ἱερὸν καλὸν τῆς Ἀθήνης.']
# Load model from HuggingFace Hub
model = AutoModel.from_pretrained('kevinkrahn/shlm-grc-en', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('kevinkrahn/shlm-grc-en', trust_remote_code=True)
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, cls pooling.
sentence_embeddings = cls_pooling(model_output)
print("Sentence embeddings:")
print(sentence_embeddings)
```
## Citing & Authors
If you use this model please cite the following papers:
```
@inproceedings{riemenschneider-krahn-2024-heidelberg,
title = "Heidelberg-Boston @ {SIGTYP} 2024 Shared Task: Enhancing Low-Resource Language Analysis With Character-Aware Hierarchical Transformers",
author = "Riemenschneider, Frederick and
Krahn, Kevin",
editor = "Hahn, Michael and
Sorokin, Alexey and
Kumar, Ritesh and
Shcherbakov, Andreas and
Otmakhova, Yulia and
Yang, Jinrui and
Serikov, Oleg and
Rani, Priya and
Ponti, Edoardo M. and
Murado{\u{g}}lu, Saliha and
Gao, Rena and
Cotterell, Ryan and
Vylomova, Ekaterina",
booktitle = "Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP",
month = mar,
year = "2024",
address = "St. Julian's, Malta",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.sigtyp-1.16",
pages = "131--141",
}
```
```
@inproceedings{krahn-etal-2023-sentence,
title = "Sentence Embedding Models for {A}ncient {G}reek Using Multilingual Knowledge Distillation",
author = "Krahn, Kevin and
Tate, Derrick and
Lamicela, Andrew C.",
editor = "Anderson, Adam and
Gordin, Shai and
Li, Bin and
Liu, Yudong and
Passarotti, Marco C.",
booktitle = "Proceedings of the Ancient Language Processing Workshop",
month = sep,
year = "2023",
address = "Varna, Bulgaria",
publisher = "INCOMA Ltd., Shoumen, Bulgaria",
url = "https://aclanthology.org/2023.alp-1.2",
pages = "13--22",
}
``` |