Abstract - Undersampling is a popular method in dealing with class-imbalance problems, which uses only a subset of the majority class and thus is very efficient. The main deficiency is that many majority class examples are ignored. We propose two algorithms to overcome this deficiency. EasyEnsemblesamples several subsets from the majority class, trains a learner using each of them, and combines the outputs of those learners. BalanceCascadetrains the learners sequentially, where in each step, the majority class examples that are correctly classified by the current trained learners are removed from further consideration. Experimental results show that both methods have higher Area Under the ROC Curve, F-measure, and G-mean values than many existing class-imbalance learning methods. Moreover, they have approximately the same training time as that of undersampling when the same number of weak classifiers is used, which is significantly faster than other methods.

Index Terms - Class-imbalance learning, data mining, ensemble learning, machine learning, undersampling

  1. I. INTRODUCTION

In many real-world problems, the data sets are typically imbalanced, i.e., some classes have much more instances than others. The level of imbalance (ratio of size of the majority class to minority class) can be as huge as . It is noteworthy that class imbalance is emerging as an important issue in designing classifiers.

Imbalance has a serious impact on the performance of classifiers. Learning algorithm that do not consider class imbalance tend to be overwhelmed by the majority class and ignore the minority class. For example, in a problem with imbalance level of , a learning algorithm that minimizes error rate could decide to classify all examples as the majority class in order to achieve a low error rate of . However, all minority class examples will be wrong classified in this case. In problems where the imbalance level is huge, class imbalance must be carefully handled to build a good classifier.

Class imbalance is also closely related to cost-sensitive learning, another important issue in machine learning. Misclassifying a minority class instance is usually more serious than misclassifying a majority class one. For example, approving a fraudulent credit card application is more costly than declining a credible one. Breiman et al. pointed out that training set size, class priors, cost of errors in different classes, and placement of decision boundaries are all closely connected. In fact, many existing methods for dealing with class imbalance rely on connections among these four components. Sampling methods handle class imbalance by varying the minority and majority class sizes in the training set. Cost-sensitive learning deals with class imbalance by incurring different costs for the two classes and is considered as an important class of methods to handle class imbalance. More details about class-imbalance learning methods are presented in Section II.

In this paper, we examine only binary classification problems by ensembling classifiers built from multiple under sampled training sets. Undersampling is an efficient method for class-imbalance learning. This method uses a subset of the majority class to train the classifier. Since many majority class examples are ignored, the training set becomes more balanced and the training process becomes faster. However, the main drawback of undersampling is that potentially useful information contained in these ignored examples is neglected. The intuition of our proposed methods is then to wisely explore these ignored data while keeping the fast training speed of understanding.

We propose two way to use these data. One straightforward way is to sample several subsets independently from (the majority class), use these subsets to train classifiers separately, and combine the trained classifiers. Another method is to use trained classifiers to guide the sampling process for subsequent classifiers. After we have trained classifiers, examples correctly classified by them will be removed from . Experiments on 16 UCI data sets show that both methods have higher Area Under the receiver operating characteristics (ROC) Curve (AUC), F-measure, and G-mean values than many existing class-imbalance learning methods.

III. EasyEnsemble AND BalanceCascade

As was shown by Drummond and Holte, undersampling is an efficient strategy to deal with class imbalance. However, the drawback of undersampling is that it throws away many potentially useful data. In this section, we propose two strategies to explore the majority class examples ignored by undersampling: EasyEnsemble and BalanceCascade.

A. EasyEnsemble

Given the minority training set

EasyEnsemble is probably the most straightforward way to further exploit the majority class examples ignored by undersampling, i.e., examples in EasyEnsemble is shown in Algorithm 1.

Algorithm 1 The EasyEnsemble algorithm

1: {Input: A set of minority class examples

2:

3: repeat

4:

5: Randomly sample a subset

6: Learn

7: until

8: Output: An ensemble

The idea behind EasyEnsemble is simple. Similar to the Balanced Random Forests, EasyEnsemble generates

The output of EasyEnsemble is a single ensemble, but it looks like an “ensemble of ensembles”. It is known that boosting mainly reduces bias, while bagging mainly reduces variance. Several works, combine different ensemble strategies to achieve stronger generalization. MultiBoosting, combines boosting with bagging/wagging by using boosted ensembles as base learners. Stochastic Gradient Boosting and Cocktail Ensemble also combine different ensemble strategies. It is evident that EasyEnsemble has benefited from the combination of boosting and a bagging-like strategy with balanced class distribution.

Both EasyEnsemble and Balanced Random Forests try to use balanced boostrap samples; however, the former uses the samples to generate boosted ensembles, while the latter uses the samples to train decision trees randomly. Costing also uses multiple samples of the original training set. Costing was initially proposed as a cost-sensitive learning method, while EasyEnsemble is proposed to deal with class imbalance directly. Moreover, the working style of EasyEnsemble is quite different from costing. For example, the costing method samples the examples with probability in proportion to their costs (rejection sampling). Since this is a probability-based sampling method, no positive example will definitely appear in all the samples (in fact, the probability of a positive example appearing in all the samples is small). While in EasyEnsemble, all the positive examples will definitely appear in all the samples. When the size of minority class is very small, it is important to utilize every minority class example.

B. BalanceCascade

EasyEnsemble is an unsupervised strategy to explore

Algorithm 2 The BalanceCascade algorithm

1: {Input: A set of minority class examples

2:

3: repeat

4:

5: Randomly sample a subset

6: Learn

The ensemble's threshold is

7: Adjust

8: Remove from

9: until

10: Output: A single ensemble

This method is called BalanceCascade since it is somewhat similar to the cascade classifier. The majority training set

BalanceCascade is similar to EasyEnsemble in their structures. The main difference between them is the lines 7 and 8 of Algorithm 2. Line removes the true majority class examples from

Exploratory Undersampling for Class-Imbalance Learning的更多相关文章

  1. 【软件分析与挖掘】ELBlocker: Predicting blocking bugs with ensemble imbalance learning

    摘要: 提出一种方法——ELBlocker,用于自动检测出Blocking Bugs(prevent other bugs from being fixed). 难度在于这些Blocking Bugs仅 ...

  2. 【Machine Learning】如何处理机器学习中的非均衡数据集?

    在机器学习中,我们常常会遇到不均衡的数据集.比如癌症数据集中,癌症样本的数量可能远少于非癌症样本的数量:在银行的信用数据集中,按期还款的客户数量可能远大于违约客户的样本数量.   比如非常有名的德国信 ...

  3. 机器学习类别不平衡处理之欠采样(undersampling)

    类别不平衡就是指分类任务中不同类别的训练样例数目差别很大的情况 常用的做法有三种,分别是1.欠采样, 2.过采样, 3.阈值移动 由于这几天做的project的target为正值的概率不到4%,且数据 ...

  4. Using SMOTEBoost(过采样) and RUSBoost(使用聚类+集成学习) to deal with class imbalance

    Using SMOTEBoost and RUSBoost to deal with class imbalance from:https://aitopics.org/doc/news:1B9F7A ...

  5. kaggle 欺诈信用卡预测——不平衡训练样本的处理方法 综合结论就是:随机森林+过采样(直接复制或者smote后,黑白比例1:3 or 1:1)效果比较好!记得在smote前一定要先做标准化!!!其实随机森林对特征是否标准化无感,但是svm和LR就非常非常关键了

    先看数据: 特征如下: Time Number of seconds elapsed between each transaction (over two days) numeric V1 No de ...

  6. CS100.1x Introduction to Big Data with Apache Spark

    CS100.1x简介 这门课主要讲数据科学,也就是data science以及怎么用Apache Spark去分析大数据. Course Software Setup 这门课主要介绍如何编写和调试Py ...

  7. (转)8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset

    8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset by Jason Brownlee on August ...

  8. (转) Learning from Imbalanced Classes

    Learning from Imbalanced Classes AUGUST 25TH, 2016 If you’re fresh from a machine learning course, c ...

  9. How to handle Imbalanced Classification Problems in machine learning?

    How to handle Imbalanced Classification Problems in machine learning? from:https://www.analyticsvidh ...

随机推荐

  1. Selenium:元素等待的4种方法

    1.使用Thread.sleep(),这是最笨的方法,但有时候也能用到而且很实用. 2.隐示等待,隐性等待是指当要查找元素,而这个元素没有马上出现时,告诉WebDriver查询Dom一定时间.默认值是 ...

  2. ubuntu16.04中将python3设置为默认

    直接执行这两个命令即可: sudo update-alternatives --install /usr/bin/python python /usr/bin/python2 100 sudo upd ...

  3. settings.php rwx

    440/400 https://www.drupal.org/node/137702 You must understand the meaning of XYZ chmod from file at ...

  4. VMware虚拟机12安装linux系统

    http://jingyan.baidu.com/article/4f7d5712d20a1b1a21192760.html 阿里云开源镜像站:http://mirrors.aliyun.com/

  5. TAP/TUN(二)

    tap.c代码      #include<assert.h> #include<fcntl.h> #include<stdio.h> #include<st ...

  6. 如何让同局域网的同事访问我电脑上的PHP网站和数据库

    需求:想让公司同一局域网的同事电脑访问我的电脑里面的php项目. 条件:首先确认localhost正常访问你的本地项目 环境:我使用的是wampserver2.5集成环境 步骤: 1.增加新增监听端口 ...

  7. AxureRP8实战手册(基础21-30)

    AxureRP8实战手册(基础21-30) 本文目录 基础21.     设置元件默认选中/禁用 基础22.     设置单选按钮唯一选中 基础23.     设置元件不同状态时的样式 基础24.   ...

  8. Python正则式的基本用法

    Python正则式的基本用法 1.1基本规则 1.2重复 1.2.1最小匹配与精确匹配 1.3前向界定与后向界定 1.4组的基本知识 2.re模块的基本函数 2.1使用compile加速 2.2 ma ...

  9. ubuntu 跟xshell的问题

    有2个分析: 1:是windos的防火墙没有关闭 2:是虚拟机没有安装sshd服务器 ubuntu在CLI界面下输入:dpkg -l |grep ssh 因为是我安装过的sshd server   要 ...

  10. js数组倒叙输出

    第一种:是直接利用代码进行输出 <!DOCTYPE html> <html> <head> <meta charset="UTF-8"&g ...