ilsp
/

droussis commited on
Commit
e21ce79
1 Parent(s): 2fc8c2f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -3
README.md CHANGED
@@ -1,3 +1,84 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - pt
5
+ - en
6
+ pipeline_tag: translation
7
+ ---
8
+
9
+ # Portuguese-English Translation Model for the Scientific Domain
10
+
11
+ ## Description
12
+
13
+ This is a CTranslate2 Portuguese-English translation model for the scientific domain, which uses the PT-EN OPUS-MT Transformer-Align [(link)](https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/por-eng) as its base model.
14
+ It has been fine-tuned on a large parallel corpus with scientific texts, with special focus to the four pilot domains of the [SciLake](https://scilake.eu/) project:
15
+ - Neuroscience
16
+ - Cancer
17
+ - Transportation
18
+ - Energy
19
+
20
+ ## Dataset
21
+
22
+ The fine-tuning dataset consists of 5,705,469 EN-PT parallel sentences extracted from parallel theses and abstracts which have been acquired from multiple academic repositories.
23
+
24
+ ## Evaluation
25
+
26
+ We have evaluated the base and the fine-tuned models on 5 test sets:
27
+ - Four which correspond to the pilot domains (Neuroscience, Cancer, Transportation, Energy) with each one containing 1,000 parallel sentences.
28
+ - A general scientific which contains 3,000 parallel sentences from a wide range of scientific texts in other domains.
29
+
30
+ | Model | Average of 4 domains | | | General Scientific| | |
31
+ |-------------|----------------------|---------------|---------------|-------------------|---------------|---------------|
32
+ | | SacreBLEU | chrF2++ | COMET | SacreBLEU | chrF2++ | COMET |
33
+ | Base | 46 | 68.3 | 66.7 | 44.9 | 67.7 | 66.3 |
34
+ | Fine-Tuned | 48.4 | 69.9 | 67.3 | 47.3 | 69.1 | 67.8 |
35
+ | Improvement | +2.4 | +1.6 | +0.9 | +2.4 | +1.4 | +1.5 |
36
+
37
+
38
+ ## Usage
39
+
40
+ ```
41
+ pip install ctranslate2 sentencepiece huggingface_hub
42
+ ```
43
+
44
+ ```python
45
+ import ctranslate2
46
+ import sentencepiece as spm
47
+ from huggingface_hub import snapshot_download
48
+
49
+ repo_id = "ilsp/opus-mt-pt-en_ct2_ft-SciLake"
50
+
51
+ # REPLACE WITH ACTUAL LOCAL DIRECTORY WHERE THE MODEL WILL BE DOWNLOADED
52
+ local_dir = ""
53
+
54
+ model_path = snapshot_download(repo_id=repo_id, local_dir=local_dir)
55
+
56
+ translator = ctranslate2.Translator(model_path, compute_type="auto")
57
+
58
+ sp_enc = spm.SentencePieceProcessor()
59
+ sp_enc.load(f"{model_path}/source.spm")
60
+
61
+ sp_dec = spm.SentencePieceProcessor()
62
+ sp_dec.load(f"{model_path}/target.spm")
63
+
64
+ def translate_text(input_text, sp_enc=sp_enc, sp_dec=sp_dec, translator=translator, beam_size=6):
65
+ input_tokens = sp_enc.encode(input_text, out_type=str)
66
+ results = translator.translate_batch([input_tokens],
67
+ beam_size=beam_size,
68
+ length_penalty=0,
69
+ max_decoding_length=512,
70
+ replace_unknowns=True)
71
+ output_tokens = results[0].hypotheses[0]
72
+ output_text = sp_dec.decode(output_tokens)
73
+ return output_text
74
+
75
+ input_text = "Na osteoartríte (OA) a degeneração progressiva das estruturas articulares activa continuamente nociceptores levando ao desenvolvimento de dor crónica e a déficits emocionais e cognitivos."
76
+ translate_text(input_text)
77
+
78
+ # OUTPUT
79
+ # In osteoarthritis (OA), progressive degeneration of articular structures continuously activates nociceptors leading to the development of chronic pain and emotional and cognitive deficits.
80
+ ```
81
+
82
+ ## Acknowledgements
83
+
84
+ This work was created within the [SciLake](https://scilake.eu/) project. We are grateful to the SciLake project for providing the resources and support that made this work possible. This project has received funding from the European Union’s Horizon Europe framework programme under grant agreement No. 101058573.