File size: 1,527 Bytes
a0ea6aa
 
 
 
 
 
 
 
 
 
 
 
65617ff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
---
title: Bengali Bpe Tokenizer
emoji: 📈
colorFrom: purple
colorTo: gray
sdk: gradio
sdk_version: 4.36.1
app_file: app.py
pinned: false
license: mit
---

# Bengali BPE Tokenizer

## Dataset

Multiple references of raw Bengali corpus are available at this [GitHub link](https://github.com/sagorbrur/bangla-corpus). Used following references from that for gathering raw bengali text for the purpose of training the tokenizer.
    - [Tab-delimited Bilingual Sentence Pairs](https://www.manythings.org/anki/) - These are selected sentence pairs from the [Tatoeba Project](http://tatoeba.org/home). This has approximately 6,500 english to bengali sentence pairs. Only Bengali sentences are extracted for training the tokenization
    - [IndicParaphrase](https://huggingface.co/datasets/ai4bharat/IndicParaphrase) - Only the input data from validation dataset of [Bengali paraphrases](https://huggingface.co/datasets/ai4bharat/IndicParaphrase/blob/main/data/bn_IndicParaphrase_v1.0.zip) are used for the tokenization. That dataset contains 10,000 Bengali sentences.
 
## Tokenizer

The Tokenizer artifacts are available at https://huggingface.co/sayanbanerjee32/bengali_tokenizer

## The HuggingFace Spaces Gradio App

The App takes one or more Bengali sentences as input provide following outputs
1. Numeric tokens that represent the sentence (using encode function)
2. Regenerated sentence using the tokens (using decode function)
3. A visualization for each token to Bengali text mapping as explanation for the tokenization.