Text Extraction and Compound Document Image Compression

HP Labs


»	Research


»	News and events
»	Technical reports


»	About HP Labs
»	Careers @ HP Labs
»	People
»	Worldwide sites


»	Downloads

Identifying text pixels in a scanned image is crucial for many image processing applications including OCR (Optical Character Recognition), page classification, image enhancement and compound document image compression. Compound document images contain both uniform-colored text characters and continuous tone pictorial regions.

An edge-based text extraction

We have developed an edge-based text extraction algorithm with the following major characteristics:

accurate at character boundaries
symmetric performances for regular text (darker than surrounding) and inversed text (lighter than surrounding)
largely independent of text orientation, font and size, language alphabet and layout

For algorithmic details, please refer to the following paper:

Jian Fan, "Text extraction via an edge-bounded averaging and a parametric character model", HP Labs Tech Report HPL 2002-294, (pdf download), A shorter version was published in
Proc. of the SPIE, Document Recognition and Retrieval X, vol.5010 : 8-19, Jan. 2003, Santa Clara, CA, USA

An example of text extraction :

The scanned image (300dpi scan, original TIFF file size 24,656KB, displayed here in JPEG compressed PDF, 691KB)

Our text extraction result

Compound document image compression in PDF

Using the text extraction algorithm, we developed a three-layer compound document image compression under the PDF 1.3 framework. Currently we compress the binary mask/text layer by CCITT G4 in its original resolution , and the color/gray background and foreground layers by JPEG, in 100 and 50 dpi resolution, respectively.

Here is a simple comparison (zoom in to view or print out):

Our result (150KB) achieved 164:1 compression ratio with excellent text legibility.

The result with extra high JPEG compression (191KB) shows unacceptable artifacts even before it hardly reaches the same compression ratio.

Contact

For more information about the technology, please contact Jian Fan (jian.fan@hp.com) at Imaging Technology Dept., HP Labs.

Printing and Imaging Research Center
	»	PIRC web site



»	Imaging Systems Laboratory
	»	Projects

Printable version


Privacy statement	Using this site means you accept its terms	Feedback to HP Labs

© 2009 Hewlett-Packard Development Company, L.P.