Technical Reports

HPL-2007-32R1

Click here for full text: PDF

BNS Scaling: An Improved Representation over TF·IDF for SVM Text Classification

Forman, George
HP Laboratories

HPL-2007-32R1

Keyword(s): text classification; topic identification; machine learning; feature selection; Support Vector Machine; TF*IDF text representation

Abstract: In the realm of machine learning for text classification, TF·IDF is the most widely used representation for real-valued feature vectors. Unfortunately, it is oblivious to the training class labels, and naturally scales some features inappropriately. We replace IDF with Bi-Normal Separation (BNS), which was previously found to be excellent at ranking words for feature selection filtering. Empirical evaluation on a benchmark of 237 binary text classification tasks shows substantially better accuracy and F-measure for a Support Vector Machine (SVM) by using the BNS scaling representation. A wide variety of other feature scaling methods were found inferior, including binary features. Furthermore, BNS scaling yielded better performance without feature selection, obviating the complexities of feature selection.

8 Pages

Additional Publication Information: To be presented and published in ACM 17th Conference on Information and Knowledge Management. Napa Valley. CA, October 26-30, 2008

External Posting Date: August 6, 2008 [Fulltext]. Approved for External Publication
Internal Posting Date: August 6, 2008 [Fulltext]

Back to Index