Technical reports

HP Labs


»	Research


»	News and events
»	Technical reports


»	About HP Labs
»	Careers @ HP Labs
»	People
»	Worldwide sites


»	Downloads

Click here for full text:

Scaling Up Text Classification for Large File Systems

Forman, George; Rajaram, Shyamsundar
HP Laboratories

HPL-2008-29R1

Keyword(s): machine learning, text classification, document categorization, information retrieval, enterprise scalability, forensic search.

Abstract: We combine the speed and scalability of information retrieval with the generally superior classification accuracy offered by machine learning, yielding a two- phase text classifier that can scale to very large document corpora. We investigate the effect of different methods of formulating the query from the training set, as well as varying the query size. In empirical tests on the Reuters RCV1 corpus of 806,000 documents, we find runtime was easily reduced by a factor of 27x, with a somewhat surprising gain in F- measure compared with traditional text classification.

8 Pages

Additional Publication Information: Submitted to 14th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD'08), August 2008.

External Posting Date: June 21, 2008 [Fulltext]. Approved for External Publication
Internal Posting Date: June 21, 2008 [Fulltext]

Back to Index


»Technical Reports
	»	2009
	»	2008
	»	2007
	»	2006
	»	2005
	»	2004
	»	2003
	»	2002
	»	2001
	»	2000
	»	1990 - 1999



Heritage Technical Reports
	»	Compaq & DEC Technical Reports
	»	Tandem Technical Reports

Printable version


Privacy statement	Using this site means you accept its terms	Feedback to HP Labs

© 2009 Hewlett-Packard Development Company, L.P.

Technical reports

HP Labs

»Technical Reports

Heritage Technical Reports