|
Click here for full text:
A Pitfall and Solution in Multi-Class Feature Selection for Text Classification
Forman, George
HPL-2004-86
Keyword(s): benchmark comparison; text classification; information retrieval; F-measure; precision in the top 10; small training sets; skewed/unbalanced class distribution
Abstract: Information Gain is a well-known and empirically proven method for high-dimensional feature selection. We found that it and other existing methods failed to produce good results on an industrial text classification problem. On investigating the root cause, we find that a large class of feature scoring methods suffers a pitfall: they can be blinded by a surplus of strongly predictive features for some classes, while largely ignoring features needed to discriminate difficult classes. In this paper we demonstrate this pitfall hurts performance even for a relatively uniform text classification task. Based on this understanding, we present solutions inspired by round-robin scheduling that avoid this pitfall, without resorting to costly wrapper methods. Empirical evaluation on 19 datasets shows substantial improvements. Notes: Published in and presented at the 21st International Conference on Machine Learning, 4-8 July 2004, Banff, Alberta, Canada
8 Pages
Back to Index
|