Technical Reports
HPL-2009-160
Web Page Layout Via Visual Segmentation
Pnueli, Ayelet; Bergman, Ruth; Schein, Sagi; Barkol, Omer
HP Laboratories
HPL-2009-160
Keyword(s): Layout understanding, Layout analysis, Web page segmentation, HTML, DOM
Abstract: Web page segmentation is required for any application that observes, manipulates, interacts, summarizes or does anything with web content or web services. Although segmentation is a non-trivial task, until recently it could be performed reasonably by analyzing the HTML structure. Today, the dynamic content of web pages does not fit the assumptions made by those algorithms. The HTML structure does not contain enough information to extract the important regions. Yet, visually, the page itself remains understandable to the human user. Thus, we believe it contains all the information that is needed to understand its content. We propose adding methods of computer vision for the analysis of the page. When the HTML does not contain the needed object hierarchy information, one may use the visual information. Moreover, visual segmentation allows us to correct the HTML structure or to simplify its hierarchy which in many cases is not semantic. We perform top-down segmentation, yielding first the large scale layout of the page, down to the required degree of detail.
4 Pages
External Posting Date: July 21, 2009 [Fulltext]. Approved for External Publication
Internal Posting Date: July 21, 2009 [Fulltext]