Technical Reports
HPL-2008-203
Effective Metadata Extraction from Irregularly Structured Web Content
Zhou, Baoyao; Liu, Wei; Yang, Yu; Wang, Weichun; Zhang, Ming
HP Laboratories
HPL-2008-203
Keyword(s): Information Extraction, Metadata, Online Course Organization, Logical Structure Model.
Abstract: Metadata extraction is one crucial module for domain specific Web content discovery and management, because the accuracy and completeness of the extracted metadata would directly affect the quality of subsequent domain information services. Our Online Course Organization project aims to build an online course portal to serve the course information obtained from the Web. Since most course pages are irregularly structured, most existing approaches are not effective for extracting course metadata. In this paper, we proposed a novel hierarchical clustering approach to generate a web page semantic structure model from the DOM tree, called Logical Structure Model, such that the hidden patterns and knowledge can be revealed and used to facilitate identifying course metadata. The experimental results have shown that our solution can achieve effective metadata extraction.
9 Pages
External Posting Date: November 21, 2008 [Fulltext]. Approved for External Publication
Internal Posting Date: November 21, 2008 [Fulltext]