HP Labs Technical Reports
Click here for full text:
Model-Independent Measure of Regression Difficulty
Zhang, Bin; Elkan, Charles; Dayal, Umeshwar; Hsu, Meichun
HPL-2000-5
Keyword(s):data mining; machine learning; model fitting;
regression; exploratory data analysis
Abstract:We prove an inequality bound for the variance of the
error of a regression function plus its
non-smoothness
as quantified by the Uniform Lipschitz condition. The
coefficients in the inequality are calculated based on
training data with no assumptions about how the
regression function is learned. This inequality,
called the Unpredictability Inequality, allows us to
evaluate the difficulty of the regression problem for
a given dataset, before applying any regression
method. The Inequality gives information on the
tradeoff between prediction error and how sensitive
predictions must be to predictor values. The
Unpredictability Inequality can be applied to any
convex subregion of the space X of predictors. We
improve the effectiveness of the Inequality by
partitioning X into multiple convex subregions via
clustering, and then applying the Inequality on each
subregion. Experimental results on genuine data from
a
manufacturing line show that, combined with
clustering, the Unpredictability Inequality provides
considerable insight and help in selecting a
regression method.
19 Pages
Back to Index
|