MMOCR / docs /en /datasets /recog.md
tomofi's picture
Add application file
2366e36
|
raw
history blame
23.8 kB

Text Recognition

Overview

The structure of the text recognition dataset directory is organized as follows.

β”œβ”€β”€ mixture
β”‚   β”œβ”€β”€ coco_text
β”‚   β”‚   β”œβ”€β”€ train_label.txt
β”‚   β”‚   β”œβ”€β”€ train_words
β”‚   β”œβ”€β”€ icdar_2011
β”‚   β”‚   β”œβ”€β”€ training_label.txt
β”‚   β”‚   β”œβ”€β”€ Challenge1_Training_Task3_Images_GT
β”‚   β”œβ”€β”€ icdar_2013
β”‚   β”‚   β”œβ”€β”€ train_label.txt
β”‚   β”‚   β”œβ”€β”€ test_label_1015.txt
β”‚   β”‚   β”œβ”€β”€ test_label_1095.txt
β”‚   β”‚   β”œβ”€β”€ Challenge2_Training_Task3_Images_GT
β”‚   β”‚   β”œβ”€β”€ Challenge2_Test_Task3_Images
β”‚   β”œβ”€β”€ icdar_2015
β”‚   β”‚   β”œβ”€β”€ train_label.txt
β”‚   β”‚   β”œβ”€β”€ test_label.txt
β”‚   β”‚   β”œβ”€β”€ ch4_training_word_images_gt
β”‚   β”‚   β”œβ”€β”€ ch4_test_word_images_gt
β”‚   β”œβ”€β”€ III5K
β”‚   β”‚   β”œβ”€β”€ train_label.txt
β”‚   β”‚   β”œβ”€β”€ test_label.txt
β”‚   β”‚   β”œβ”€β”€ train
β”‚   β”‚   β”œβ”€β”€ test
β”‚   β”œβ”€β”€ ct80
β”‚   β”‚   β”œβ”€β”€ test_label.txt
β”‚   β”‚   β”œβ”€β”€ image
β”‚   β”œβ”€β”€ svt
β”‚   β”‚   β”œβ”€β”€ test_label.txt
β”‚   β”‚   β”œβ”€β”€ image
β”‚   β”œβ”€β”€ svtp
β”‚   β”‚   β”œβ”€β”€ test_label.txt
β”‚   β”‚   β”œβ”€β”€ image
β”‚   β”œβ”€β”€ Syn90k
β”‚   β”‚   β”œβ”€β”€ shuffle_labels.txt
β”‚   β”‚   β”œβ”€β”€ label.txt
β”‚   β”‚   β”œβ”€β”€ label.lmdb
β”‚   β”‚   β”œβ”€β”€ mnt
β”‚   β”œβ”€β”€ SynthText
β”‚   β”‚   β”œβ”€β”€ alphanumeric_labels.txt
β”‚   β”‚   β”œβ”€β”€ shuffle_labels.txt
β”‚   β”‚   β”œβ”€β”€ instances_train.txt
β”‚   β”‚   β”œβ”€β”€ label.txt
β”‚   β”‚   β”œβ”€β”€ label.lmdb
β”‚   β”‚   β”œβ”€β”€ synthtext
β”‚   β”œβ”€β”€ SynthAdd
β”‚   β”‚   β”œβ”€β”€ label.txt
β”‚   β”‚   β”œβ”€β”€ label.lmdb
β”‚   β”‚   β”œβ”€β”€ SynthText_Add
β”‚   β”œβ”€β”€ TextOCR
β”‚   β”‚   β”œβ”€β”€ image
β”‚   β”‚   β”œβ”€β”€ train_label.txt
β”‚   β”‚   β”œβ”€β”€ val_label.txt
β”‚   β”œβ”€β”€ Totaltext
β”‚   β”‚   β”œβ”€β”€ imgs
β”‚   β”‚   β”œβ”€β”€ annotations
β”‚   β”‚   β”œβ”€β”€ train_label.txt
β”‚   β”‚   β”œβ”€β”€ test_label.txt
β”‚   β”œβ”€β”€ OpenVINO
β”‚   β”‚   β”œβ”€β”€ image_1
β”‚   β”‚   β”œβ”€β”€ image_2
β”‚   β”‚   β”œβ”€β”€ image_5
β”‚   β”‚   β”œβ”€β”€ image_f
β”‚   β”‚   β”œβ”€β”€ image_val
β”‚   β”‚   β”œβ”€β”€ train_1_label.txt
β”‚   β”‚   β”œβ”€β”€ train_2_label.txt
β”‚   β”‚   β”œβ”€β”€ train_5_label.txt
β”‚   β”‚   β”œβ”€β”€ train_f_label.txt
β”‚   β”‚   β”œβ”€β”€ val_label.txt
β”‚   β”œβ”€β”€ funsd
β”‚   β”‚   β”œβ”€β”€ imgs
β”‚   β”‚   β”œβ”€β”€ dst_imgs
β”‚   β”‚   β”œβ”€β”€ annotations
β”‚   β”‚   β”œβ”€β”€ train_label.txt
β”‚   β”‚   β”œβ”€β”€ test_label.txt

(*) Since the official homepage is unavailable now, we provide an alternative for quick reference. However, we do not guarantee the correctness of the dataset.

Preparation Steps

ICDAR 2013

ICDAR 2015

IIIT5K

svt

python tools/data/textrecog/svt_converter.py <download_svt_dir_path>

ct80

svtp

coco_text

MJSynth (Syn90k)

  • Step1: Download mjsynth.tar.gz from homepage
  • Step2: Download label.txt (8,919,273 annotations) and shuffle_labels.txt (2,400,000 randomly sampled annotations). Please make sure you're using the right annotation to train the model by checking its dataset specs in Model Zoo.
  • Step3:
mkdir Syn90k && cd Syn90k

mv /path/to/mjsynth.tar.gz .

tar -xzf mjsynth.tar.gz

mv /path/to/shuffle_labels.txt .
mv /path/to/label.txt .

# create soft link
cd /path/to/mmocr/data/mixture

ln -s /path/to/Syn90k Syn90k

SynthText (Synth800k)

  • Step1: Download SynthText.zip from homepage

  • Step2: According to your actual needs, download the most appropriate one from the following options: label.txt (7,266,686 annotations), shuffle_labels.txt (2,400,000 randomly sampled annotations), alphanumeric_labels.txt (7,239,272 annotations with alphanumeric characters only) and instances_train.txt (7,266,686 character-level annotations).

:::{warning} Please make sure you're using the right annotation to train the model by checking its dataset specs in Model Zoo. :::

  • Step3:
mkdir SynthText && cd SynthText
mv /path/to/SynthText.zip .
unzip SynthText.zip
mv SynthText synthtext

mv /path/to/shuffle_labels.txt .
mv /path/to/label.txt .
mv /path/to/alphanumeric_labels.txt .
mv /path/to/instances_train.txt .

# create soft link
cd /path/to/mmocr/data/mixture
ln -s /path/to/SynthText SynthText
  • Step4: Generate cropped images and labels:
cd /path/to/mmocr

python tools/data/textrecog/synthtext_converter.py data/mixture/SynthText/gt.mat data/mixture/SynthText/ data/mixture/SynthText/synthtext/SynthText_patch_horizontal --n_proc 8

SynthAdd

  • Step1: Download SynthText_Add.zip from SynthAdd (code:627x))
  • Step2: Download label.txt
  • Step3:
mkdir SynthAdd && cd SynthAdd

mv /path/to/SynthText_Add.zip .

unzip SynthText_Add.zip

mv /path/to/label.txt .

# create soft link
cd /path/to/mmocr/data/mixture

ln -s /path/to/SynthAdd SynthAdd

:::{tip} To convert label file with txt format to lmdb format,

python tools/data/utils/txt2lmdb.py -i <txt_label_path> -o <lmdb_label_path>

For example,

python tools/data/utils/txt2lmdb.py -i data/mixture/Syn90k/label.txt -o data/mixture/Syn90k/label.lmdb

:::

TextOCR

mkdir textocr && cd textocr

# Download TextOCR dataset
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json
wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json

# For images
unzip -q train_val_images.zip
mv train_images train
  • Step2: Generate train_label.txt, val_label.txt and crop images using 4 processes with the following command:
python tools/data/textrecog/textocr_converter.py /path/to/textocr 4

Totaltext

  • Step1: Download totaltext.zip from github dataset and groundtruth_text.zip from github Groundtruth (Our totaltext_converter.py supports groundtruth with both .mat and .txt format).
mkdir totaltext && cd totaltext
mkdir imgs && mkdir annotations

# For images
# in ./totaltext
unzip totaltext.zip
mv Images/Train imgs/training
mv Images/Test imgs/test

# For annotations
unzip groundtruth_text.zip
cd Groundtruth
mv Polygon/Train ../annotations/training
mv Polygon/Test ../annotations/test
  • Step2: Generate cropped images, train_label.txt and test_label.txt with the following command (the cropped images will be saved to data/totaltext/dst_imgs/):
python tools/data/textrecog/totaltext_converter.py /path/to/totaltext -o /path/to/totaltext --split-list training test

OpenVINO

  • Step0: Install awscli.
  • Step1: Download Open Images subsets train_1, train_2, train_5, train_f, and validation to openvino/.
mkdir openvino && cd openvino

# Download Open Images subsets
for s in 1 2 5 f; do
  aws s3 --no-sign-request cp s3://open-images-dataset/tar/train_${s}.tar.gz .
done
aws s3 --no-sign-request cp s3://open-images-dataset/tar/validation.tar.gz .

# Download annotations
for s in 1 2 5 f; do
  wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_train_${s}.json
done
wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_validation.json

# Extract images
mkdir -p openimages_v5/val
for s in 1 2 5 f; do
  tar zxf train_${s}.tar.gz -C openimages_v5
done
tar zxf validation.tar.gz -C openimages_v5/val
  • Step2: Generate train_{1,2,5,f}_label.txt, val_label.txt and crop images using 4 processes with the following command:
python tools/data/textrecog/openvino_converter.py /path/to/openvino 4

FUNSD

mkdir funsd && cd funsd

# Download FUNSD dataset
wget https://guillaumejaume.github.io/FUNSD/dataset.zip
unzip -q dataset.zip

# For images
mv dataset/training_data/images imgs && mv dataset/testing_data/images/* imgs/

# For annotations
mkdir annotations
mv dataset/training_data/annotations annotations/training && mv dataset/testing_data/annotations annotations/test

rm dataset.zip && rm -rf dataset
  • Step2: Generate train_label.txt and test_label.txt and crop images using 4 processes with following command (add --preserve-vertical if you wish to preserve the images containing vertical texts):
python tools/data/textrecog/funsd_converter.py PATH/TO/funsd --nproc 4