Technical Reports

HPL-2009-160

Click here for full text: PDF

Web Page Layout Via Visual Segmentation

Pnueli, Ayelet; Bergman, Ruth; Schein, Sagi; Barkol, Omer
HP Laboratories

HPL-2009-160

Keyword(s): Layout understanding, Layout analysis, Web page segmentation, HTML, DOM

Abstract: Web page segmentation is required for any application that observes, manipulates, interacts, summarizes or does anything with web content or web services. Although segmentation is a non-trivial task, until recently it could be performed reasonably by analyzing the HTML structure. Today, the dynamic content of web pages does not fit the assumptions made by those algorithms. The HTML structure does not contain enough information to extract the important regions. Yet, visually, the page itself remains understandable to the human user. Thus, we believe it contains all the information that is needed to understand its content. We propose adding methods of computer vision for the analysis of the page. When the HTML does not contain the needed object hierarchy information, one may use the visual information. Moreover, visual segmentation allows us to correct the HTML structure or to simplify its hierarchy which in many cases is not semantic. We perform top-down segmentation, yielding first the large scale layout of the page, down to the required degree of detail.

4 Pages

External Posting Date: July 21, 2009 [Fulltext]. Approved for External Publication
Internal Posting Date: July 21, 2009 [Fulltext]

Back to Index