Recognition of Concordances for Indexing in Digital Libraries

  • Simone MarinaiEmail author
  • Samuele Capobianco
  • Zahra Ziran
  • Andrea Giuntini
  • Pierluigi Mansueto
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 1177)


We describe a system for the automatic transcription of books with concordances. Even if the recognition of printed text with OCR tools is nearly solved for high quality documents, the recognition of structured text, where dictionaries and other linguistic tools can be of little help, is still a difficult task. In this work, we propose to use several techniques for correcting the imperfect text recognized by the OCR software by taking into account both physical features of the documents and the redundancy of information implicit in concordances.


  1. 1.
    Cagni, G.M.: Concordanze degli scritti di S. Antonio M. Zaccaria. Collana spiritualita barnabitica, 4 (1960)Google Scholar
  2. 2.
    Capobianco, S., Marinai, S.: Text line extraction in handwritten historical documents. In: Grana, C., Baraldi, L. (eds.) IRCDL 2017. CCIS, vol. 733, pp. 68–79. Springer, Cham (2017). Scholar
  3. 3.
    Cesarini, F., Gori, M., Marinai, S., Soda, G.: Structured document segmentation and representation by the modified X-Y tree. In: Fifth International Conference on Document Analysis and Recognition, ICDAR 1999, Bangalore, India, 20–22 September 1999, pp. 563–566 (1999)Google Scholar
  4. 4.
    Gatos, B.G.: Imaging Techniques in document analysis processes. In: Doermann, D., Tombre, K. (eds.) Handbook of Document Image Processing and Recognition, pp. 73–131. Springer, London (2014). Scholar
  5. 5.
    Likforman-Sulem, L., Zahour, A., Taconet, B.: Text line segmentation of historical documents: a survey. Int. J. Doc. Anal. Recognit. 9(2), 123–138 (2007)CrossRefGoogle Scholar
  6. 6.
    Mandal, S., Chowdhury, S.P., Das, A.K., Chanda, B.: Automated detection and segmentation of table of contents page from document images. In: 2003 Proceedings of the Seventh International Conference on Document Analysis and Recognition, vol. 1, pp. 398–402 (2003)Google Scholar
  7. 7.
    Marinai, S., Marino, E., Soda, G.: Table of contents recognition for converting PDF documents in e-book formats. In: Proceedings of the 10th ACM Symposium on Document Engineering, DocEng 2010, pp. 73–76. ACM, New York (2010)Google Scholar
  8. 8.
    Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)CrossRefGoogle Scholar
  9. 9.
    Read, A.W.: Dictionary, Encyclopaedia Britannica (2016). Accessed 30 Sept 2019
  10. 10.
    Smith, R.: An overview of the Tesseract OCR engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, pp. 629–633, September 2007Google Scholar
  11. 11.
    danvk: Finding blocks of text in an image using Python, OpenCV and numpy (2015)Google Scholar
  12. 12.
    Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishers Inc., San Francisco (1999)zbMATHGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Simone Marinai
    • 1
    Email author
  • Samuele Capobianco
    • 1
  • Zahra Ziran
    • 1
  • Andrea Giuntini
    • 1
  • Pierluigi Mansueto
    • 1
  1. 1.University of FlorenceFirenzeItaly

Personalised recommendations