>>11724269
https://www3.nd.edu/~nchawla/papers/ECML03.pdf
SMOTEBoost: Improving Prediction
of the Minority Class in Boosting
1 Motivation and Introduction
Rare events are events that occur very infrequently, i.e. whose frequency ranges from
say 5% to less than 0.1%, depending on the application. Classification of rare events
is a common problem in many domains, such as detecting fraudulent transactions,
network intrusion detection, Web mining, direct marketing, and medical diagnostics.
For example, in the network intrusion detection domain, the number of intrusions on
the network is typically a very small fraction of the total network traffic. In medical
databases, when classifying the pixels in mammogram images as cancerous or not [1],
abnormal (cancerous) pixels represent only a very small fraction of the entire image.
The nature of the application requires a fairly high detection rate of the minority class
and allows for a small error rate in the majority class since the cost of misclassifying
a cancerous patient as non-cancerous can be very high.
In all these scenarios when the majority class typically represents 98-99% of the
entire population, a trivial classifier that labels everything with the majority class can
achieve high accuracy. It is apparent that for domains with imbalanced and/or skewed
distributions, classification accuracy is not sufficient as a standard performance measure. ROC analysis [2] and metrics such as precision, recall and F-value [3, 4] have
been used to understand the performance of the learning algorithm on the minority
class. The prevalence of class imbalance in various scenarios has caused a surge in
research dealing with the minority classes. Several approaches for dealing with
imbalanced data sets were recently introduced [1, 2, 4, 9-15].
A confusion matrix as shown in Table 1 is typically used to evaluate performance
of a machine learning algorithm for rare class problems. In classification problems,
assuming class “C” as the minority class of the interest, and “NC” as a conjunction of
all the other classes, there are four possible outcomes when detecting class “C”.
Table 1. Confusion matrix defines four possible scenarios when classifying class “C”
Predicted Class “C” Predicted Class “NC”
Actual class “C” True Positives (TP) False Negatives (FN)
Actual class “NC” False Positives (FP) True Negatives (TN)
From Table 1, recall, precision and F-value may be defined as follows:
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
F-value = Re call Pr ecision
( ) Re call Pr ecision
⋅ +
2
2 1
β
β ,
where β corresponds to relative importance of precision vs. recall and it is usually set
to 1. The main focus of all learning algorithms is to improve the recall, without sacrificing the precision. However, the recall and precision goals are often conflicting and
attacking them simultaneously may not work well, especially when one class is rare.
The F-value incorporates both precision and recall, and the “goodness” of a learning
algorithm for the minority class can be measured by the F-value. While ROC curves
represent the trade-off between values of TP and FP, the F-value basically incorporates the relative effects/costs of recall and precision into a single number.
It is well known in machine learning that a combination of classifiers can be an effective technique for improving prediction accuracy. As one of the most popular combining techniques, boosting [5] uses adaptive sampling of instances to generate a
highly accurate ensemble of classifiers whose individual global accuracy is only moderate. There has been significant interest in the recent literature for embedding costsensitivities in the boosting algorithm. CSB [6] and AdaCost boosting algorithms [7]
update the weights of examples according to the misclassification costs. Karakoulas
and Shawe-Taylor’s ThetaBoost adjusts the margins in the presence of unequal loss
functions [8]. Alternatively, Rare-Boost [4, 9] updates the weights of the examples
differently for all four entries shown in Table 1.