This software was developed by the National Diet Library under contract to Morpho AI Solutions, Inc. |
This software is largely based on the following repositories. |
The newly developed portion of this program is released by the National Diet Library under a CC BY 4.0 license. For more information, see [LICENSE](./LICENSE) |
. |
# What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis |
| [paper](https://arxiv.org/abs/1904.01906) | [training and evaluation data](https://github.com/clovaai/deep-text-recognition-benchmark#download-lmdb-dataset-for-traininig-and-evaluation-from-here) | [failure cases and cleansed label](https://github.com/clovaai/deep-text-recognition-benchmark#download-failure-cases-and-cleansed-label-from-here) | [pretrained model](https://www.dropbox.com/sh/j3xmli4di1zuv3s/AAArdcPgz7UFxIHUuKNOeKv_a?dl=0) | [Baidu ver(passwd:rryk)](https://pan.baidu.com/s/1KSNLv4EY3zFWHpBYlpFCBQ) | |
[https://github.com/clovaai/deep-text-recognition-benchmark](original repogitory link is here) |
Official PyTorch implementation of our four-stage STR framework, that most existing STR models fit into. <br> |
Using this framework allows for the module-wise contributions to performance in terms of accuracy, speed, and memory demand, under one consistent set of training and evaluation datasets. <br> |
Such analyses clean up the hindrance on the current comparisons to understand the performance gain of the existing modules. <br><br> |
<img src="./figures/trade-off.png" width="1000" title="trade-off"> |
## Honors |
Based on this framework, we recorded the 1st place of [ICDAR2013 focused scene text](https://rrc.cvc.uab.es/?ch=2&com=evaluation&task=3), [ICDAR2019 ArT](https://rrc.cvc.uab.es/files/ICDAR2019-ArT.pdf) and 3rd place of [ICDAR2017 COCO-Text](https://rrc.cvc.uab.es/?ch=5&com=evaluation&task=2), [ICDAR2019 ReCTS (task1)](https://rrc.cvc.uab.es/files/ICDAR2019-ReCTS.pdf). <br> |
The difference between our paper and ICDAR challenge is summarized [here](https://github.com/clovaai/deep-text-recognition-benchmark/issues/13). |
## Updates |
**Aug 3, 2020**: added [guideline to use Baidu warpctc](https://github.com/clovaai/deep-text-recognition-benchmark/pull/209) which reproduces CTC results of our paper. <br> |
**Dec 27, 2019**: added [FLOPS](https://github.com/clovaai/deep-text-recognition-benchmark/issues/125) in our paper, and minor updates such as log_dataset.txt and [ICDAR2019-NormalizedED](https://github.com/clovaai/deep-text-recognition-benchmark/blob/86451088248e0490ff8b5f74d33f7d014f6c249a/test.py#L139-L165). <br> |
**Oct 22, 2019**: added [confidence score](https://github.com/clovaai/deep-text-recognition-benchmark/issues/82), and arranged the output form of training logs. <br> |
**Jul 31, 2019**: The paper is accepted at International Conference on Computer Vision (ICCV), Seoul 2019, as an oral talk. <br> |
**Jul 25, 2019**: The code for floating-point 16 calculation, check [@YacobBY's](https://github.com/YacobBY) [pull request](https://github.com/clovaai/deep-text-recognition-benchmark/pull/36) <br> |
**Jul 16, 2019**: added [ST_spe.zip](https://drive.google.com/drive/folders/192UfE9agQUMNq6AgU3_E05_FcPZK4hyt) dataset, word images contain special characters in SynthText (ST) dataset, see [this issue](https://github.com/clovaai/deep-text-recognition-benchmark/issues/7#issuecomment-511727025) <br> |
**Jun 24, 2019**: added gt.txt of failure cases that contains path and label of each image, see [image_release_190624.zip](https://drive.google.com/open?id=1VAP9l5GL5fgptgKDLio_h3nMe7X9W0Mf) <br> |
**May 17, 2019**: uploaded resources in Baidu Netdisk also, added [Run demo](https://github.com/clovaai/deep-text-recognition-benchmark#run-demo-with-pretrained-model). (check [@sharavsambuu's](https://github.com/sharavsambuu) [colab demo also](https://colab.research.google.com/drive/1PHnc_QYyf9b1_KJ1r15wYXaOXkdm1Mrk)) <br> |
**May 9, 2019**: PyTorch version updated from 1.0.1 to 1.1.0, use torch.nn.CTCLoss instead of torch-baidu-ctc, and various minor updated. |
## Getting Started |
### Dependency |
- This work was tested with PyTorch 1.3.1, CUDA 10.1, python 3.6 and Ubuntu 16.04. <br> You may need `pip3 install torch==1.3.1`. <br> |
In the paper, expriments were performed with **PyTorch 0.4.1, CUDA 9.0**. |
- requirements : lmdb, pillow, torchvision, nltk, natsort |
``` |
pip3 install lmdb pillow torchvision nltk natsort |
``` |
### Download lmdb dataset for traininig and evaluation from [here](https://www.dropbox.com/sh/i39abvnefllx2si/AAAbAYRvxzRp3cIE5HzqUw3ra?dl=0) |
data_lmdb_release.zip contains below. <br> |
training datasets : [MJSynth (MJ)](http://www.robots.ox.ac.uk/~vgg/data/text/)[1] and [SynthText (ST)](http://www.robots.ox.ac.uk/~vgg/data/scenetext/)[2] \ |
validation datasets : the union of the training sets [IC13](http://rrc.cvc.uab.es/?ch=2)[3], [IC15](http://rrc.cvc.uab.es/?ch=4)[4], [IIIT](http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html)[5], and [SVT](http://www.iapr-tc11.org/mediawiki/index.php/The_Street_View_Text_Dataset)[6].\ |
evaluation datasets : benchmark evaluation datasets, consist of [IIIT](http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html)[5], [SVT](http://www.iapr-tc11.org/mediawiki/index.php/The_Street_View_Text_Dataset)[6], [IC03](http://www.iapr-tc11.org/mediawiki/index.php/ICDAR_2003_Robust_Reading_Competitions)[7], [IC13](http://rrc.cvc.uab.es/?ch=2)[3], [IC15](http://rrc.cvc.uab.es/?ch=4)[4], [SVTP](http://openaccess.thecvf.com/content_iccv_2013/papers/Phan_Recognizing_Text_with_2013_ICCV_paper.pdf)[8], and [CUTE](http://cs-chan.com/downloads_CUTE80_dataset.html)[9]. |
### Run demo with pretrained model |
1. Download pretrained model from [here](https://drive.google.com/drive/folders/15WPsuPJDCzhp2SvYZLRj8mAlT3zmoAMW) |
2. Add image files to test into `demo_image/` |
3. Run demo.py (add `--sensitive` option if you use case-sensitive model) |
``` |
CUDA_VISIBLE_DEVICES=0 python3 demo.py \ |
--Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn \ |
--image_folder demo_image/ \ |
--saved_model TPS-ResNet-BiLSTM-Attn.pth |
``` |
#### prediction results |
| demo images | [TRBA (**T**PS-**R**esNet-**B**iLSTM-**A**ttn)](https://drive.google.com/open?id=1b59rXuGGmKne1AuHnkgDzoYgKeETNMv9) | [TRBA (case-sensitive version)](https://drive.google.com/open?id=1ajONZOgiG9pEYsQ-eBmgkVbMDuHgPCaY) | |
| --- | --- | --- | |
| <img src="./demo_image/demo_1.png" width="300"> | available | Available | |
| <img src="./demo_image/demo_2.jpg" width="300"> | shakeshack | SHARESHACK | |
| <img src="./demo_image/demo_3.png" width="300"> | london | Londen | |
| <img src="./demo_image/demo_4.png" width="300"> | greenstead | Greenstead | |
| <img src="./demo_image/demo_5.png" width="300" height="100"> | toast | TOAST | |
| <img src="./demo_image/demo_6.png" width="300" height="100"> | merry | MERRY | |
| <img src="./demo_image/demo_7.png" width="300"> | underground | underground | |
| <img src="./demo_image/demo_8.jpg" width="300"> | ronaldo | RONALDO | |
| <img src="./demo_image/demo_9.jpg" width="300" height="100"> | bally | BALLY | |
| <img src="./demo_image/demo_10.jpg" width="300" height="100"> | university | UNIVERSITY | |
### Training and evaluation |
1. Train CRNN[10] model |
``` |
CUDA_VISIBLE_DEVICES=0 python3 train.py \ |
--train_data data_lmdb_release/training --valid_data data_lmdb_release/validation \ |
--select_data MJ-ST --batch_ratio 0.5-0.5 \ |
--Transformation None --FeatureExtraction VGG --SequenceModeling BiLSTM --Prediction CTC |
``` |
2. Test CRNN[10] model. If you want to evaluate IC15-2077, check [data filtering part](https://github.com/clovaai/deep-text-recognition-benchmark/blob/c27abe6b4c681e2ee0784ad966602c056a0dd3b5/dataset.py#L148). |
``` |
CUDA_VISIBLE_DEVICES=0 python3 test.py \ |
--eval_data data_lmdb_release/evaluation --benchmark_all_eval \ |
--Transformation None --FeatureExtraction VGG --SequenceModeling BiLSTM --Prediction CTC \ |
--saved_model saved_models/None-VGG-BiLSTM-CTC-Seed1111/best_accuracy.pth |
``` |
3. Try to train and test our best accuracy model TRBA (**T**PS-**R**esNet-**B**iLSTM-**A**ttn) also. ([download pretrained model](https://drive.google.com/drive/folders/15WPsuPJDCzhp2SvYZLRj8mAlT3zmoAMW)) |
``` |
CUDA_VISIBLE_DEVICES=0 python3 train.py \ |
--train_data data_lmdb_release/training --valid_data data_lmdb_release/validation \ |
--select_data MJ-ST --batch_ratio 0.5-0.5 \ |
--Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn |
``` |
``` |
CUDA_VISIBLE_DEVICES=0 python3 test.py \ |
--eval_data data_lmdb_release/evaluation --benchmark_all_eval \ |
--Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn \ |
--saved_model saved_models/TPS-ResNet-BiLSTM-Attn-Seed1111/best_accuracy.pth |
``` |
### Arguments |
* `--train_data`: folder path to training lmdb dataset. |
* `--valid_data`: folder path to validation lmdb dataset. |
* `--eval_data`: folder path to evaluation (with test.py) lmdb dataset. |
* `--select_data`: select training data. default is MJ-ST, which means MJ and ST used as training data. |
* `--batch_ratio`: assign ratio for each selected data in the batch. default is 0.5-0.5, which means 50% of the batch is filled with MJ and the other 50% of the batch is filled ST. |
* `--data_filtering_off`: skip [data filtering](https://github.com/clovaai/deep-text-recognition-benchmark/blob/f2c54ae2a4cc787a0f5859e9fdd0e399812c76a3/dataset.py#L126-L146) when creating LmdbDataset. |
* `--Transformation`: select Transformation module [None | TPS]. |
* `--FeatureExtraction`: select FeatureExtraction module [VGG | RCNN | ResNet]. |
* `--SequenceModeling`: select SequenceModeling module [None | BiLSTM]. |
* `--Prediction`: select Prediction module [CTC | Attn]. |
* `--saved_model`: assign saved model to evaluation. |
* `--benchmark_all_eval`: evaluate with 10 evaluation dataset versions, same with Table 1 in our paper. |
## Download failure cases and cleansed label from [here](https://www.dropbox.com/s/5knh1gb1z593fxj/image_release_190624.zip?dl=0) |
image_release.zip contains failure case images and benchmark evaluation images with cleansed label. |
<img src="./figures/failure-case.jpg" width="1000" title="failure cases"> |
## When you need to train on your own dataset or Non-Latin language datasets. |
1. Create your own lmdb dataset. |
``` |
pip3 install fire |
python3 create_lmdb_dataset.py --inputPath data/ --gtFile data/gt.txt --outputPath result/ |
``` |
The structure of data folder as below. |
``` |
data |
βββ gt.txt |
βββ test |
βββ word_1.png |
βββ word_2.png |
βββ word_3.png |
βββ ... |
``` |
At this time, `gt.txt` should be `{imagepath}\t{label}\n` <br> |
For example |
``` |
test/word_1.png Tiredness |
test/word_2.png kills |
test/word_3.png A |
... |
``` |
2. Modify `--select_data`, `--batch_ratio`, and `opt.character`, see [this issue](https://github.com/clovaai/deep-text-recognition-benchmark/issues/85). |
## Acknowledgements |
This implementation has been based on these repository [crnn.pytorch](https://github.com/meijieru/crnn.pytorch), [ocr_attention](https://github.com/marvis/ocr_attention). |
## Reference |
[1] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artificial neural networks for natural scenetext recognition. In Workshop on Deep Learning, NIPS, 2014. <br> |
[2] A. Gupta, A. Vedaldi, and A. Zisserman. Synthetic data fortext localisation in natural images. In CVPR, 2016. <br> |
[3] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Big-orda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, andL. P. De Las Heras. ICDAR 2013 robust reading competition. In ICDAR, pages 1484β1493, 2013. <br> |
[4] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R.Chandrasekhar, S. Lu, et al. ICDAR 2015 competition on ro-bust reading. In ICDAR, pages 1156β1160, 2015. <br> |
[5] A. Mishra, K. Alahari, and C. Jawahar. Scene text recognition using higher order language priors. In BMVC, 2012. <br> |
[6] K. Wang, B. Babenko, and S. Belongie. End-to-end scenetext recognition. In ICCV, pages 1457β1464, 2011. <br> |
[7] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, andR. Young. ICDAR 2003 robust reading competitions. In ICDAR, pages 682β687, 2003. <br> |
[8] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan. Recognizing text with perspective distortion in natural scenes. In ICCV, pages 569β576, 2013. <br> |
[9] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan. A robust arbitrary text detection system for natural scene images. In ESWA, volume 41, pages 8027β8048, 2014. <br> |
[10] B. Shi, X. Bai, and C. Yao. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. In TPAMI, volume 39, pages2298β2304. 2017. |
## Links |
- WebDemo : https://demo.ocr.clova.ai/ <br> |
Combination of Clova AI detection and recognition, additional/advanced features used for KOR/JPN. |
- Repo of detection : https://github.com/clovaai/CRAFT-pytorch |
## Citation |
Please consider citing this work in your publications if it helps your research. |
``` |
@inproceedings{baek2019STRcomparisons, |
title={What Is Wrong With Scene Text Recognition Model Comparisons? Dataset and Model Analysis}, |
author={Baek, Jeonghun and Kim, Geewook and Lee, Junyeop and Park, Sungrae and Han, Dongyoon and Yun, Sangdoo and Oh, Seong Joon and Lee, Hwalsuk}, |
booktitle = {International Conference on Computer Vision (ICCV)}, |
year={2019}, |
pubstate={published}, |
tppubtype={inproceedings} |
} |
``` |
## Contact |
Feel free to contact us if there is any question: <br> |
for code/paper Jeonghun Baek ku21fang@gmail.com; for collaboration hwalsuk.lee@navercorp.com (our team leader). |
## License |
Copyright (c) 2019-present NAVER Corp. |
Licensed under the Apache License, Version 2.0 (the "License"); |
you may not use this file except in compliance with the License. |
You may obtain a copy of the License at |
http://www.apache.org/licenses/LICENSE-2.0 |
Unless required by applicable law or agreed to in writing, software |
distributed under the License is distributed on an "AS IS" BASIS, |
See the License for the specific language governing permissions and |
limitations under the License. |