Technical Reports

HPL-2008-91R1

Click here for full text: PDF

Extremely Fast Text Feature Extraction for Classification and Indexing

Forman, George; Kirshenbaum, Evan
HP Laboratories

HPL-2008-91R1

Keyword(s): text mining, text indexing, bag-of-words, feature engineering, feature extraction, document categorization, text tokenization

Abstract: Most research in speeding up text mining involves algorithmic improvements to induction algorithms, and yet for many large scale applications, such as classifying or indexing large document repositories, the time spent extracting word features from texts can itself greatly exceed the initial training time. This paper describes a fast method for text feature extraction that folds together Unicode conversion, forced lowercasing, word boundary detection, and string hash computation. We show empirically that our integer hash features result in classifiers with equivalent statistical performance to those built using string word features, but require far less computation and less memory.

15 Pages

Additional Publication Information: To be published and presented at Conference on Information & Knowledge Management, Napa, CA Oct 27, 2008

External Posting Date: August 21, 2008 [Fulltext]. Approved for External Publication
Internal Posting Date: August 21, 2008 [Fulltext]

Back to Index