|
Click here for full text:
Biblio: Automatic meta-data extraction
Staelin, Carl; Elad, Michael; Greig, Darryl; Shmueli, Oded; Vans, Marie
HPL-2004-190
Keyword(s): document understanding; learning; support vector machines; neural networks
Abstract: Biblio is an adaptive system that automatically extracts meta-data from semi. structured and structured scanned documents. Instead of using hand- coded templates or other methods manually customized for each given document format, it uses example-based machine learning to adapt to customer-defined document and meta-data types. We provide results from two document corpuses, a set of scanned journal articles and a set of scanned legal documents. The first set is semi-structured, as the different journals use a variety of flexible layouts. The second set is largely free-form text based on poor quality scans of FAX- quality legal documents. We demonstrate accuracy on the semi-structured document set roughly comparable to hand-coded systems, and much worse performance on the legal documents.
26 Pages
Back to Index
|