|
Click here for full text:
Quantifying Counts, Costs, and Trends Accurately via Machine Learning
Forman, George
HPL-2007-164R1
Keyword(s): supervised machine learning, classification, prevalence estimation, class distribution estimation, cost quantification, quantification research methodology, minimizing training effort, detecting and tracking trends, concept drift, class imbalance, text mining
Abstract: In many business and science applications, it is important to track trends over historical data, for example, measuring the monthly prevalence of influenza incidents at a hospital. In situations where a machine learning classifier is needed to identify the relevant incidents from among all cases in the database, anything less than perfect classification accuracy will result in a consistent and potentially substantial bias in estimating the class prevalence. There is an assumption ubiquitous in machine learning that the class distribution of the training set matches that of the test set, but this is certainly not the case for applications where the goal is to measure changes or trends in the distribution over time. The paper defines two research challenges for machine learning that address this distribution mismatch problem. The 'quantification' task is to accurately estimate the number of positive cases (or class distribution) in an unlabeled test set via machine learning, using a limited training set that may have a substantially different class distribution. The 'cost quantification' task is to estimate the total cost associated with the positive class, where each case is tagged with a cost attribute, such as the hours of labor needed to resolve the case. Obtaining a precise quantification estimate over a set of cases has a very different utility model from traditional classification research, whose goal is to obtain an accurate classification for each individual case. For both forms of quantification, the paper describes a suitable experiment methodology and evaluates a variety of methods. It reveals which methods give more reliable estimates, even when training data is scarce and the testing class distribution differs widely from training. Some methods function well even under high class imbalance, e.g. 1% positives. These strengths can make quantification practical for business use, even where classification accuracy is poor. Publication Info: To be published in international journal Data Mining and Knowledge Discovery in a special issue on Utility-Based Data Mining
25 Pages
Back to Index
|