sayanbanerjee32 commited on
Commit
65617ff
1 Parent(s): 10861ba

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -1
README.md CHANGED
@@ -10,4 +10,22 @@ pinned: false
10
  license: mit
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  license: mit
11
  ---
12
 
13
+ # Bengali BPE Tokenizer
14
+
15
+ ## Dataset
16
+
17
+ Multiple references of raw Bengali corpus are available at this [GitHub link](https://github.com/sagorbrur/bangla-corpus). Used following references from that for gathering raw bengali text for the purpose of training the tokenizer.
18
+ - [Tab-delimited Bilingual Sentence Pairs](https://www.manythings.org/anki/) - These are selected sentence pairs from the [Tatoeba Project](http://tatoeba.org/home). This has approximately 6,500 english to bengali sentence pairs. Only Bengali sentences are extracted for training the tokenization
19
+ - [IndicParaphrase](https://huggingface.co/datasets/ai4bharat/IndicParaphrase) - Only the input data from validation dataset of [Bengali paraphrases](https://huggingface.co/datasets/ai4bharat/IndicParaphrase/blob/main/data/bn_IndicParaphrase_v1.0.zip) are used for the tokenization. That dataset contains 10,000 Bengali sentences.
20
+
21
+ ## Tokenizer
22
+
23
+ The Tokenizer artifacts are available at https://huggingface.co/sayanbanerjee32/bengali_tokenizer
24
+
25
+ ## The HuggingFace Spaces Gradio App
26
+
27
+ The App takes one or more Bengali sentences as input provide following outputs
28
+ 1. Numeric tokens that represent the sentence (using encode function)
29
+ 2. Regenerated sentence using the tokens (using decode function)
30
+ 3. A visualization for each token to Bengali text mapping as explanation for the tokenization.
31
+