Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP

HP.com home


Technical Reports



» 

HP Labs

» Research
» News and events
» Technical reports
» About HP Labs
» Careers @ HP Labs
» People
» Worldwide sites
» Downloads
Content starts here

 
Click here for full text: PDF

Detection and Analysis of Table of Contents Based on Content Association

Lin, Xiaofan; Xiong, Yan

HPL-2005-105

Keyword(s): table of contents; document structure analysis; table recognition; optical character recognition; algorithm combination

Abstract: As a special type of table understanding, the detection and analysis of tables of contents (TOCs) play an important role in the digitization of multi- page documents. Most previous TOC analysis methods only concentrate on the TOC itself without taking into account the other pages in the same document. Besides, they often require manual coding or at least machine learning of document-specific models. This paper introduces a new method to detect and analyze TOCs based on content association. It fully leverages the text information throughout the whole multi-page document and can be directly applied to a wide range of documents without the need to build or learn the models for individual documents. In addition, the associations of general text and page numbers are combined to make the TOC analysis more accurate. Natural language processing and layout analysis are integrated to improve the TOC functional tagging. The applications of the proposed method in a large-scale digital library project are also discussed. Notes: To be published in the International Journal on Document Analysis and Recognition, 2005

21 Pages

Back to Index

»Technical Reports

» 2009
» 2008
» 2007
» 2006
» 2005
» 2004
» 2003
» 2002
» 2001
» 2000
» 1990 - 1999

Heritage Technical Reports

» Compaq & DEC Technical Reports
» Tandem Technical Reports
Printable version
Privacy statement Using this site means you accept its terms Feedback to HP Labs
© 2009 Hewlett-Packard Development Company, L.P.