Bengali BPE Tokenizer

Dataset

Multiple references of raw Bengali corpus are available at this GitHub link. Used following references from that for gathering raw bengali text for the purpose of training the tokenizer. - Tab-delimited Bilingual Sentence Pairs - These are selected sentence pairs from the Tatoeba Project. This has approximately 6,500 english to bengali sentence pairs. Only Bengali sentences are extracted for training the tokenization - IndicParaphrase - Only the input data from validation dataset of Bengali paraphrases are used for the tokenization. That dataset contains 10,000 Bengali sentences.

Steps

Followed the instructions from the video from Andrej Karpathy and created the notebook for experiment.
Experimented with the regular expression that suits bengali language. The intention was to the regular expression for splitting Bengali words instead of characters.
- Using gpt2 regex 's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+ was resulting in splitting of individual characters instead of words.
- Used the regex ?\p{Bengali}+| ?[^\s\p{Bengali}]+|\s+(?!\S)|\s+ that could split the sentence "সবাই যা করতে চায় তা করতে চায়নি।" into following words 'সবাই', ' যা', ' করতে', ' চায়', ' তা', ' করতে', ' চায়নি', '।'
Updated BPE training process to use text chucks as output from the regular expression splits instead of the complete sentences. This helps avoid merging of tokens across different worlds. Ref
Updated encode and decoder function to deal with text chucks instead of complete sentences. Ref

Tokenizer training

Experimented with the regular expression that suits bengali language. The intention was to the regular expression for splitting Bengali words instead of characters.
- Using gpt2 regex 's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+ was resulting in splitting of individual characters instead of words.
- Used the regex ?\p{Bengali}+| ?[^\s\p{Bengali}]+|\s+(?!\S)|\s+ that could split the sentence "সবাই যা করতে চায় তা করতে চায়নি।" into following words 'সবাই', ' যা', ' করতে', ' চায়', ' তা', ' করতে', ' চায়নি', '।'
Updated BPE training process to use text chucks as output from the regular expression splits instead of the complete sentences. This helps avoid merging of tokens across different worlds. Ref
Updated encode and decoder function to deal with text chucks instead of complete sentences. Ref
Trained the tokenizer to reach vocab size of 5001 and compression of 11X
Saved vocab file (contains the mapping from tokens to bengali text), merges files (contains the mapping from pair of tokens to be merged to token after merging) and the regular expression that is used for splitting the bengali sentences. All these artifacts are required to perform BPE tokenization on a new text.

The HuggingFace Spaces Gradio App

The app is available here

The App takes one or more Bengali sentences as input provide following outputs

Numeric tokens that represent the sentence (using encode function)
Regenerated sentence using the tokens (using decode function)
A visualization for each token to Bengali text mapping as explanation for the tokenization.

sayanbanerjee32
/

bengali_tokenizer

Bengali BPE Tokenizer

Dataset

Steps

Tokenizer training

The HuggingFace Spaces Gradio App

Space using sayanbanerjee32/bengali_tokenizer 1