File size: 4,181 Bytes
d5c9d4e
 
94d94a3
 
 
 
 
 
286d1cc
d5c9d4e
94d94a3
bd61780
 
53a3a16
 
bd61780
 
 
 
53a3a16
 
 
bd61780
53a3a16
bd61780
 
 
53a3a16
 
bd61780
286d1cc
bd61780
53a3a16
 
bd61780
53a3a16
 
 
 
 
 
 
 
bd61780
 
 
 
53a3a16
 
8cd52b1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bd61780
 
 
53a3a16
bd61780
53a3a16
286d1cc
53a3a16
 
 
 
bd61780
53a3a16
bd61780
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
license: apache-2.0
language:
- en
tags:
- g2p
- cisco
- Grapheme-to-Phoneme
pipeline_tag: text2text-generation
---

## Model Summary

`mini-bart-g2p` is a seq2seq model based on the [BART architecture](https://arxiv.org/abs/1910.13461). 
We spruce down the number of layers and transformer heads in the original BART architecture to ensure that we can reliably train the model for the grapheme to phoneme conversion task. 


## Intended Uses

The input is expected to contain English words consisting of Latin letters and certain punctuation symbols. 
The model has been trained to take as input a single word at a time and _will return unexpected results when fed multiple words as a single input_. 
The [HuggingFace tokenizer](https://huggingface.co/cisco-ai/mini-bart-g2p/blob/main/tokenizer.json) provided is configured to normalize the words into lowercase and separates the letters by space under the hood, so while using the model you may provide the words normally without separation between letters.

The model provides output in the form of phonemes along with their corresponding stress numbers. It is also capable of generating phonemes for words that may be hyphenated or have apostrophes present.


## How to Use
```python
from transformers import pipeline

pipe = pipeline(task="text2text-generation", model="cisco-ai/mini-bart-g2p")

text = "hello world"
# DO NOT DO ```pipe(text)``` as this will produce unexpected results.

pipe(text.split())
# [{'translation_text': 'HH EH1 L OW0'}, {'translation_text': 'W ER1 L D'}]

text = "co-workers coworkers hunter's hunter"
pipe(text.split())

# [{'translation_text': 'K OW1 W ER1 K ER0 Z'}, {'translation_text': 'K OW1 W ER1 K ER0 Z'}, {'translation_text': 'HH AH1 N T ER0 Z'}, {'translation_text': 'HH AH1 N T ER0'}]
```



## Training
The `mini-bart-g2p` model was trained on a combination of both the [Librispeech Alignments dataset](https://zenodo.org/records/2619474#.YuCdaC8r1ZF) and the [CMUDict dataset](https://github.com/cmusphinx/cmudict).
The model was trained using the [translation training script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/translation/run_translation.py) provided by HuggingFace Transformers repo. 
The following parameters were specified in the training script to produce the model.
<details>
<summary>Training script parameters</summary>

  ```bash
  python run_translation.py \
  --model_name_or_path <MODEL DIR> \
  --source_lang wrd \
  --target_lang phon \
  --num_train_epochs 500 \
  --train_file <TRAIN SPLIT> \
  --validation_file <VAL SPLIT> \
  --test_file <TEST SPLIT> \
  --num_beams 5 \
  --generation_num_beams 5 \
  --max_source_length 128 \
  --max_target_length 128 \
  --overwrite_cache \
  --overwrite_output_dir \
  --do_train \
  --do_eval \
  --do_predict \
  --evaluation_strategy epoch \
  --eval_delay 3 \
  --save_strategy epoch \
  --per_device_train_batch_size 16 \
  --per_device_eval_batch_size 16 \
  --learning_rate 5e-4 \
  --label_smoothing_factor 0.1 \
  --weight_decay 0.00001 \
  --adam_beta1 0.9 \
  --adam_beta2 0.98 \
  --load_best_model_at_end True \
  --predict_with_generate True \
  --generation_max_length 20 \
  --output_dir <OUTPUT DIR> \
  --seed 4664427 \
  --lr_scheduler_type cosine_with_restarts \
  --warmup_steps 120000 \
  --optim adafactor \
  --group_by_length \
  --metric_for_best_model bleu \
  --greater_is_better True \
  --save_total_limit 10 \
  --log_level info \
  --logging_steps 500
  ```
</details>


## Limitations
The model has some limitations in it's current form which we list for full transparency. 

- The `mini-bart-g2p` model is trained to only work on the English language.
- The model does not produce consistent behavior when non-apostrophe punctuation symbols are part of the input word. **We recommend stripping the words of all non-essential punctuation symbols before running it through the pipeline.**
```python
text = "world world!"
pipe(text.split())
# [{'translation_text': 'W ER1 L D'}, {'translation_text': 'W ER1 L D F'}]

```


### License

The model is licensed under the [Apache 2.0 License](https://huggingface.co/cisco-ai/mini-bart-g2p/blob/main/LICENSE).