Technical Reports

HPL-2009-185

Click here for full text: PDF

Web Article Extraction for Web Printing: a DOM+Visual based Approach

Luo, Ping; Fan, Jian; Liu, Sam; Lin, Fen; Xiong, Yuhong; Liu, Jerry;
HP Laboratories

HPL-2009-185

Keyword(s): Article extraction, maximal scoring subsequence

Abstract: This work studies the problem of extracting articles from Web pages for better printing. Different from existing approaches of article extraction, Web printing poses several unique requirements: 1) Identifying just the boundary surrounding the text- body is not the ideal solution for article extraction. It is highly desirable to filter out some uninformative links and advertisements within this boundary. 2) It is necessary to identify paragraphs, which may not be readily separated as DOM nodes, for the purpose of better layout of the article. 3) Its performance should be independent of content domains, written languages, and Web page templates. Toward these goals we propose a novel method of article extraction using both DOM (Document Object Model) and visual features. The main components of our method include: 1) a text segment/paragraph identification algorithm based on line-breaking features, 2) a global optimization method, Maximum Scoring Subsequence, based on text segments for identifying the boundary of the article body, 3) an outlier elimination step based on left or right alignment of text segments with the article body. Our experiments showed the proposed method is effective in terms of precision and recall at the level of text segments.

4 Pages

Additional Publication Information: To be published in the 9th ACM Symposium on Document Engineering, DocEng'09, Munich, Germany. September 16-18, 2009

External Posting Date: August 21, 2009 [Fulltext]. Approved for External Publication
Internal Posting Date: August 21, 2009 [Fulltext]

Back to Index