Spaces:

sayanbanerjee32
/

bengali_bpe_tokenizer

Runtime error

sayanbanerjee32 commited on Jun 21

Commit

65617ff

•

1 Parent(s): 10861ba

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -10,4 +10,22 @@ pinned: false
 license: mit
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 license: mit
 ---
+# Bengali BPE Tokenizer
+## Dataset
+Multiple references of raw Bengali corpus are available at this [GitHub link](https://github.com/sagorbrur/bangla-corpus). Used following references from that for gathering raw bengali text for the purpose of training the tokenizer.
+    - [Tab-delimited Bilingual Sentence Pairs](https://www.manythings.org/anki/) - These are selected sentence pairs from the [Tatoeba Project](http://tatoeba.org/home). This has approximately 6,500 english to bengali sentence pairs. Only Bengali sentences are extracted for training the tokenization
+    - [IndicParaphrase](https://huggingface.co/datasets/ai4bharat/IndicParaphrase) - Only the input data from validation dataset of [Bengali paraphrases](https://huggingface.co/datasets/ai4bharat/IndicParaphrase/blob/main/data/bn_IndicParaphrase_v1.0.zip) are used for the tokenization. That dataset contains 10,000 Bengali sentences.
+## Tokenizer
+The Tokenizer artifacts are available at https://huggingface.co/sayanbanerjee32/bengali_tokenizer
+## The HuggingFace Spaces Gradio App
+The App takes one or more Bengali sentences as input provide following outputs
+1. Numeric tokens that represent the sentence (using encode function)
+2. Regenerated sentence using the tokens (using decode function)
+3. A visualization for each token to Bengali text mapping as explanation for the tokenization.