Technical Reports

printable version

HP Labs


»	Research


»	News and events
»	Technical reports


»	About HP Labs
»	Careers @ HP Labs
»	People
»	Worldwide sites


»	Downloads

Click here for full text:

A Method for Discovering the Insignificance of One's Best Classifier and the Unlearnability of a Classification Task

Forman, George

HPL-2002-123R2

Keyword(s): supervised machine learning; overfitting; 2001 KDD Cup thrombin classification competition

Abstract: Consider the following common scenario: a data mining practitioner tries various specialized classification algorithms on a new dataset of unknown difficulty and selects the apparent best. Supposing its accuracy were 70% on a held-out test set, how can one know whether this is a significant result or not? It can be difficult to tell in the absence of standard benchmark results for the dataset. Surprisingly, it can also be difficult to tell even when the dataset has hundreds of benchmark results. This paper presents a method to address this question by comparing the chosen best classifier to the distribution of performance scores obtained by many simple classifiers that are randomly generated. This can also serve to discover when a classification problem appears nearly unlearnable. It is demonstrated for the results of the 2001 KDD Cup thrombin competition. Notes: To be published in and presented at Data Mining Lessons Learned Workshop, the 19th International Conference on Machine Learning (ICML), 8-12 July 2002, Sydney, Australia

5 Pages

Back to Index


»Technical Reports
	»	2009
	»	2008
	»	2007
	»	2006
	»	2005
	»	2004
	»	2003
	»	2002
	»	2001
	»	2000
	»	1990 - 1999



Heritage Technical Reports
	»	Compaq & DEC Technical Reports
	»	Tandem Technical Reports


Privacy statement	Using this site means you accept its terms	Feedback to HP Labs

© 2009 Hewlett-Packard Development Company, L.P.

Technical Reports

HP Labs

»Technical Reports

Heritage Technical Reports