Document Domain Randomization for Deep Learning Document Layout Extraction

Description:

We present document domain randomization (DDR), the first successful transfer of CNNs trained only on graphically rendered pseudo-paper pages to real-world document segmentation. DDR renders pseudo-document pages by modeling randomized textual and non-textual contents of interest, with user-defined layout and font styles to support joint learning of fine-grained classes. We demonstrate competitive results using our DDR approach to extract nine document classes from the benchmark CS-150 and papers published in two domains, namely annual meetings of Association for Computational Linguistics (ACL) and IEEE Visualization (VIS). We compare DDR to conditions of style mismatch, fewer or more noisy samples that are more easily obtained in the real world. We show that high-fidelity semantic information is not necessary to label semantic classes but style mismatch between train and test can lower model accuracy. Using smaller training samples had a slightly detrimental effect. Finally, network models still achieved high test accuracy when correct labels are diluted towards confusing labels; this behavior hold across several classes.

Paper download: (12.8 MB)

Data: available at DOI 10.21227/326q-bf39

Cross-Reference:

This work is based on our past work on the VIS30K image dataset.

Paper Reference:

Meng Ling, Jian Chen, Torsten Möller, Petra Isenberg, Tobias Isenberg, Michael Sedlmair, Robert S. Laramee, Han-Wei Shen, Jian Wu, and C. Lee Giles (2021) Document Domain Randomization for Deep Learning Document Layout Extraction. In Josep Llados, Daniel Lopresti, and Seiichi Uchida, eds., Proceedings of the 16^th International Conference on Document Analysis and Recognition (ICDAR, September 5–10, Lausanne, Switzerland). Springer, Cham, Switzerland, pages 497–513, 2021.

BibTeX entry:



@INPROCEEDINGS{Ling:2021:DDR,
  author      = {Meng Ling and Jian Chen and Torsten M{\"o}ller and Petra Isenberg and Tobias Isenberg and Michael Sedlmair and Robert S. Laramee and Han-Wei Shen and Jian Wu and Giles, C. Lee},
  title       = {Document Domain Randomization for Deep Learning Document Layout Extraction},
  booktitle   = {Proceedings of the 16\textsuperscript{th} International Conference on Document Analysis and Recognition (ICDAR, September 5--10, Lausanne, Switzerland)},
  OPTeditor   = {Josep Llados and Daniel Lopresti and Seiichi Uchida},
  year        = {2021},
  volume      = {1},
  pages       = {497--513},
  publisher   = {Springer},
  address     = {Cham, Switzerland},
  doi         = {10.1007/978-3-030-86549-8_32},
  doi_url     = {https://doi.org/10.1007/978-3-030-86549-8_32},
  shortdoi    = {10/kd2s},
  oa_hal_url  = {https://hal.science/hal-03336444},
  preprint    = {https://doi.org/10.48550/arXiv.2105.14931},
  url         = {https://tobias.isenberg.cc/p/Ling2021DDR},
  pdf         = {https://tobias.isenberg.cc/personal/papers/Ling_2021_DDR.pdf},
}

Dataset Reference:

This work was done as in collaboration (primarily) with the Interactive Visual Computing Lab of The Ohio State University, USA, and other research labs.