Technical Reports
HPL-2011-212
Rapid Change Detection and Text Mining
Balinsky, Alexander; Balinsky, Helen; Simske, Steven
HP Laboratories
HPL-2011-212
Keyword(s): text mining; rapid change detection; helmholtz principle; text summarization
Abstract: In this presentation we review and present a novel approach to text data mining and automatic text summarization. This modeling includes several steps. First, we apply a rapid change detection algorithm in data streams and documents, introduced in [1, 2]. It is based on ideas from image processing and especially on the Helmholtz Principle from the Gestalt Theory of human perception. Applied to the problem of keyword extraction, it delivers fast and effective tools to identify meaningful words using parameter-free methods. We also define levels of meaningfulness of document words, which allows control of the sizes of selected keywords sets providing for different application needs. After that, based on the introduced level of meaningfulness, we model a document as a one- parameter family of graphs with its sentences or paragraphs defining the vertex set and with edges defined by Helmholtz's principle. We demonstrated that for some range of the parameters, the resulting graph becomes a small-world network [3]. Such a remarkable structure opens the possibility of applying many measures and tools from the theory of social networks to the problem of extracting the most important sentences and structures from text documents [4]. We also present our new software for document analysis and automatic text summarization.
5 Pages
External Posting Date: November 6, 2011 [Abstract]. Approved for External Publication
Internal Posting Date: November 6, 2011 [Fulltext]