|
Click here for full text:
Detection and Analysis of Table of Contents Based on Content Association
Lin, Xiaofan; Xiong, Yan
HPL-2005-105
Keyword(s): table of contents; document structure analysis; table recognition; optical character recognition; algorithm combination
Abstract: As a special type of table understanding, the detection and analysis of tables of contents (TOCs) play an important role in the digitization of multi- page documents. Most previous TOC analysis methods only concentrate on the TOC itself without taking into account the other pages in the same document. Besides, they often require manual coding or at least machine learning of document-specific models. This paper introduces a new method to detect and analyze TOCs based on content association. It fully leverages the text information throughout the whole multi-page document and can be directly applied to a wide range of documents without the need to build or learn the models for individual documents. In addition, the associations of general text and page numbers are combined to make the TOC analysis more accurate. Natural language processing and layout analysis are integrated to improve the TOC functional tagging. The applications of the proposed method in a large-scale digital library project are also discussed. Notes: To be published in the International Journal on Document Analysis and Recognition, 2005
21 Pages
Back to Index
|