Russian G2P token classification model

This is a non-autoregressive model for Russian grapheme-to-phoneme (G2P) conversion based on BERT architecture. It predicts phonemes in IPA format. Initial data was built using Wiktionary json from https://kaikki.org/dictionary/Russian/index.html

Intended uses & limitations

The input is expected to consist of cyrillic letters separated by space. Real space should be replaced to underscore(_). Note that the model was trained on single words and some short phrases. Though it can accept longer phrases its accuracy may degrade on them.

How to use

Install NeMo.

Download ru_g2p.nemo (this model)

git lfs install
git clone https://huggingface.co/bene-ges/ru_g2p_ipa_bert_large

Run

python ${NEMO_ROOT}/examples/nlp/text_normalization_as_tagging/normalization_as_tagging_infer.py \
  pretrained_model=ru_g2p_ipa_bert_large/ru_g2p.nemo \
  inference.from_file=input.txt \
  inference.out_file=output.txt \
  model.max_sequence_len=512 \
  inference.batch_size=128 \
  lang=ru

Example of input file:

и с х о д
т р а н с н е п т у н о в ы х
т е л я т н и к о в с к о е
ц а р с к о г о
к р о с х о ф
г а н с - ю р г е н
д а р д а н е л л

Example of output file:

ɪ s x 'o t                          и с х о д                       ɪ s x 'o t                         ɪ s x 'o t                           PLAIN PLAIN PLAIN PLAIN PLAIN
t r a nʲ sʲ nʲ ɪ p t 'u n ə v ɨ x   т р а н с н е п т у н о в ы х   t r a nʲ sʲ nʲ ɪ p t 'u n ə v ɨ x   t r a nʲ sʲ nʲ ɪ p t 'u n ə v ɨ x    PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN
tʲ ɪ lʲ 'æ tʲ nʲ ɪ k ə f s k ə jə   т е л я т н и к о в с к о е     tʲ ɪ lʲ 'æ tʲ nʲ ɪ k ə f s k ə jə   tʲ ɪ lʲ 'æ tʲ nʲ ɪ k ə f s k ə jə    PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN
t~s 'a r s k ə v ə                  ц а р с к о г о                 t~s 'a r s k ə v ə                 t~s 'a r s k ə v ə                  PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN
k r ɐ s x 'o f                      к р о с х о ф                   k r ɐ s x 'o f                     k r ɐ s x 'o f                      PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN
ɡ a n s 'ju r ɡʲ ɪ n                г а н с - ю р г е н             ɡ a n s _ 'ju r ɡʲ ɪ n              ɡ a n s _ 'ju r ɡʲ ɪ n              PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN
d ə r d ɐ n 'ɛ ɫ                    д а р д а н е л л               d ə r d ɐ n 'ɛ ɫ <DELETE>          d ə r d ɐ n 'ɛ ɫ <DELETE>            PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN

Note that the correct output tags are in the third column, input is in the second column. Tags correspond to input letters in a one-to-one fashion. If you remove <DELETE> tag, +, ~, and spaces, you should get IPA-like transcription. The model does not predict secondary stress. The primary stress is put directly before the stressed vowel. In some cases stress can be missing.

How to use for TTS

See example of inference pipeline for G2P + FastPitch + HifiGAN in this notebook.