Technical Reports

HPL-2008-203

Click here for full text: PDF

Effective Metadata Extraction from Irregularly Structured Web Content

Zhou, Baoyao; Liu, Wei; Yang, Yu; Wang, Weichun; Zhang, Ming
HP Laboratories

HPL-2008-203

Keyword(s): Information Extraction, Metadata, Online Course Organization, Logical Structure Model.

Abstract: Metadata extraction is one crucial module for domain specific Web content discovery and management, because the accuracy and completeness of the extracted metadata would directly affect the quality of subsequent domain information services. Our Online Course Organization project aims to build an online course portal to serve the course information obtained from the Web. Since most course pages are irregularly structured, most existing approaches are not effective for extracting course metadata. In this paper, we proposed a novel hierarchical clustering approach to generate a web page semantic structure model from the DOM tree, called Logical Structure Model, such that the hidden patterns and knowledge can be revealed and used to facilitate identifying course metadata. The experimental results have shown that our solution can achieve effective metadata extraction.

9 Pages

External Posting Date: November 21, 2008 [Fulltext]. Approved for External Publication
Internal Posting Date: November 21, 2008 [Fulltext]

Back to Index