Identifying text pixels in a scanned image is crucial for many
image processing applications including OCR (Optical Character
Recognition), page classification, image enhancement and compound
document image compression. Compound document images contain
both uniform-colored text characters and continuous tone pictorial
regions.
An edge-based text extraction
We have developed an edge-based text extraction algorithm with
the following major characteristics:
-
accurate at character boundaries
-
symmetric performances for regular text (darker than
surrounding) and inversed text (lighter than surrounding)
-
largely independent of text orientation, font and size,
language alphabet and layout
For algorithmic details, please refer to the following paper:
Jian Fan, "Text extraction via an edge-bounded averaging and
a parametric character model", HP Labs Tech Report HPL
2002-294, (pdf
download), A shorter version was published in
Proc. of the SPIE, Document Recognition and Retrieval X, vol.5010 :
8-19, Jan. 2003, Santa Clara, CA, USA
An example of text extraction :
The scanned image (300dpi scan,
original TIFF file size 24,656KB, displayed here in JPEG compressed
PDF, 691KB)
Our text extraction result
Compound document image compression in PDF
Using the text extraction algorithm, we developed a three-layer
compound document image compression under the PDF 1.3 framework.
Currently we compress the binary mask/text layer by CCITT G4 in its
original resolution , and the color/gray background and foreground
layers by JPEG, in 100 and 50 dpi resolution, respectively.
Here is a simple comparison (zoom in to view or print out):
Our result (150KB) achieved 164:1 compression
ratio with excellent text legibility.
The result with extra high JPEG compression (191KB)
shows unacceptable artifacts even before it hardly reaches the same compression
ratio.
Contact
For more information about the technology, please contact Jian
Fan (jian.fan@hp.com) at Imaging Technology Dept., HP Labs.
|