|
Click here for full text:
Learning from Little: Comparison of Classifiers Given Little Training
Forman, George; Cohen, Ira
HPL-2004-19R1
Keyword(s): benchmark comparison; text classification; information retrieval; F-measure; precision in the top 10; small training sets; skewed/unbalanced class distribution
Abstract: Many real-world machine learning tasks are faced with the problem of small training sets. Additionally, the class distribution of the training set often does not match the target distribution. In this paper we compare the performance of many learning models on a substantial benchmark of binary text classification tasks having small training sets. We vary the training size and class distribution to examine the learning surface, as opposed to the traditional learning curve. The models tested include various feature selection methods each coupled with four learning algorithms: Support Vector Machines (SVM), Logistic Regression, Naive Bayes, and Multinomial Naive Bayes. Different models excel in different regions of the learning surface, leading to meta-knowledge about which to apply in different situations. This helps guide the researcher and practitioner when facing choices of model and feature selection methods in, for example, information retrieval settings and others. Notes: Copyright Springer-Verlag. To be published in and presented at the 15th European Conference on Machine Learning and the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, 20- 24 September 2004, Pisa, Italy
14 Pages
Back to Index
|