Spaces:
Runtime error
Runtime error
sayanbanerjee32
commited on
Commit
•
65617ff
1
Parent(s):
10861ba
Update README.md
Browse files
README.md
CHANGED
@@ -10,4 +10,22 @@ pinned: false
|
|
10 |
license: mit
|
11 |
---
|
12 |
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
license: mit
|
11 |
---
|
12 |
|
13 |
+
# Bengali BPE Tokenizer
|
14 |
+
|
15 |
+
## Dataset
|
16 |
+
|
17 |
+
Multiple references of raw Bengali corpus are available at this [GitHub link](https://github.com/sagorbrur/bangla-corpus). Used following references from that for gathering raw bengali text for the purpose of training the tokenizer.
|
18 |
+
- [Tab-delimited Bilingual Sentence Pairs](https://www.manythings.org/anki/) - These are selected sentence pairs from the [Tatoeba Project](http://tatoeba.org/home). This has approximately 6,500 english to bengali sentence pairs. Only Bengali sentences are extracted for training the tokenization
|
19 |
+
- [IndicParaphrase](https://huggingface.co/datasets/ai4bharat/IndicParaphrase) - Only the input data from validation dataset of [Bengali paraphrases](https://huggingface.co/datasets/ai4bharat/IndicParaphrase/blob/main/data/bn_IndicParaphrase_v1.0.zip) are used for the tokenization. That dataset contains 10,000 Bengali sentences.
|
20 |
+
|
21 |
+
## Tokenizer
|
22 |
+
|
23 |
+
The Tokenizer artifacts are available at https://huggingface.co/sayanbanerjee32/bengali_tokenizer
|
24 |
+
|
25 |
+
## The HuggingFace Spaces Gradio App
|
26 |
+
|
27 |
+
The App takes one or more Bengali sentences as input provide following outputs
|
28 |
+
1. Numeric tokens that represent the sentence (using encode function)
|
29 |
+
2. Regenerated sentence using the tokens (using decode function)
|
30 |
+
3. A visualization for each token to Bengali text mapping as explanation for the tokenization.
|
31 |
+
|