Abstract – In many practical data mining applications such as web page classification, unlabeled training examples are readily available but labeled ones are fairly expensive to obtain. Therefore, semi-supervised learning algorithms such as co-training have attracted much attention. In this paper, a new co-training style semi-supervised learning algorithm named tri-training is proposed. This algorithm generates three classifier from the original labeled example set. These classifiers are then refined using unlabeled examples in the tri-training process. In detail, in each round of tri-training, an unlabeled example is labeled for a classifier if the other two classifiers agree on the labeling, under certain conditions. Since tri-training neither requires the instance space be described with sufficient and redundant views nor does it put any constraints on the supervised learning algorithm, its applicability is broader than that of previous co-training style algorithms. Experiments on UCI data sets and application to the web page classification task indicate that tri-training performance.

Index Terms – Data Mining, Machine Learning, Learning from Unlabeled Data, Semi-supervised Learning, Co-training, Tri-training, Web Page Classification

I. INTRODUCTION

IN many practical data mining applications such as web page classification, unlabeled training examples are readily available but labeled ones are fairly expensive to obtain because they require human effort. Therefore, semi-supervised learning that exploits unlabeled examples in addition to labeled ones has become a hot topic.

Many current semi-supervised learning algorithms use a generative model for the classifier and employ Expectation Maximization (EM) to model the label estimation or parameter estimation process. For example, mixture of Gaussians, mixture of experts, and naive Bayes have been respectively used as the generative model, while EM is used to combine labeled and unlabeled data for classification. There are also many other algorithms such as using transductive inference for support vector machines to optimize performance on a specific test set, constructing a graph on the examples such that minimum cut on the graph yields an optimal labeling of the unlabeled examples according to certain optimization functions, etc.

A prominent achievement in this area is the co-training paradigm proposed by Blum and Mitchell, which trains two classifiers separately on two different views, i.e. two independent sets of attributes, and uses the predictions of each classifier on unlabeled examples to augment the training set of the other. Such an idea of utilizing the natural redundancy in the attributes has been employed in some other works. For example, Yarowsky performed word sense disambiguation by constructing a sense classifier using the local context of the word and a classifier based on the sensed of other occurrences of that word in the same document; Riloff and Jones classified a noun phrase for geographic locations by considering both the noun phrase itself and the linguistic context in which the noun phrase appears; Collins and Singer performed named entity classification using both the spelling of the entity itself and the context in which the entity occurs. It is noteworthy that the co-training paradigm has already been used in many domains such as statistical parsing and noun phrase identification.

The standard co-training algorithm requires two sufficient and redundant views, that is, the attributes be naturally partitioned into two sets, each of which is sufficient for learning and conditionally independent to the other given the class label. Dasgupta et al. have shown that when the requirement is met, the co-trained classifiers could make fewer generalization errors by maximizing their agreement over the unlabeled data. Unfortunately, such a requirement can hardly be met in most scenarios. Goldman and Zhou proposed an algorithm which does not exploit attribute partition. However, it requires using two different supervised learning algorithms that partition the instance space into a set of equivalence classes, and employing time-consuming cross validation technique to determine how to label the unlabeled examples and how to product the final hypothesis.

In this paper, a new co-training style algorithm named tri-training is proposed. Tri-training does not require sufficient and redundant views, nor does it require the use of different supervised learning algorithms whose hypothesis partitions the instance space into a set of equivalence classes. Therefore it can be easily applied to common data mining scenarios. In contrast to previous algorithms that utilize two classifiers, tri-training uses three classifiers. This setting tackles the problem of determining how to label the unlabeled examples and how to produce the final hypothesis, which contributes much to the efficiency of the algorithm. Moreover, better generalization ability can be achieved through combining these three classifiers. Experiments on UCI data sets and application to the web page classification task show that tri-training can effectively exploit unlabeled data, and the generalization ability of its final hypothesis is quite good, sometimes even outperforms that of the ensemble of three classifiers being provided with labels of all the unlabeled examples.

II. TRI-TRAINING

Let denote the labeled example set with size and denote the unlabeled example set with size . In previous co-training style algorithms, two classifiers are initially trained from , each of which is then re-trained with the help of unlabeled examples that are labeled by the latest version of the other classifier. In order to determine which example in should be labeled and which classifier should be biased in prediction, the confidence of the labeling of each classifier must be explicitly measured. Sometimes such a measuring process is quite time-consuming.

Assume that besides these two classifiers, i.e. and , a classifier is initially trained from . Then, for any classifier, an unlabeled example can be labeled for it as long as the other two classifiers agree on the labeling of this example, while the confidence of the labeling of the classifiers are not needed to be explicitly measured. For instance, if and agree on the labeling of an example in , then can be labeled for . It is obvious that in such a schema if the prediction of and on is correct; otherwise will get an example with noisy label. However, even in the worse case, the increase in the classification noise rate can be compensated if the amount of newly labeled example is sufficient, under certain conditions, as shown below.

Inspired by Goldman and Zhou, the finding of Angluin and Laird is used in the following analysis. That is, if a sequence of samples is drawn, where the sample size satisfies Eq. 1:

(1)

where
is the hypothesis worst-case classification error rate, is an upper
bound on the classification noise rate, is the number of hypothesis,
and is the confidence, then a hypothesis that minimizes
disagreement with will have the PAC property:

(1)

where
is sum over the probability of elements from the symmetric difference
between the two hypothesis sets and (the ground-truth). Let where
makes Eq. 1 hold equality, then Eq. 1 becomes Eq. 3:

(1)

To
simplify the computation, it is helpful to compute the quotient of
the constant divided by the square of the error:

(1)

In
each round of tri-training, the classifiers
and
choose some examples in
to label for
Since the classifiers are refined in the tri-training process, the
amount as well as the concrete unlabeled examples chosen to label may
be different in different rounds. Let
and
denote the set of examples that are labeled for
in the
round and the
round, respectively. Then the
training set for
in the
round and the
round are respectively
and
whose sample size
and
are
and
respectively. Note that the unlabeled examples labeled in the
round, i.e.
won't be put into the original labeled example set, i.e.
Instead, in the
round all the examples in
will be regarded as unlabeled and put into
again.

Let
denote the classification noise rate of
that is, the number of examples in
that are mislabeled is
Let
denote the upper bound of the classification error rate of
in the
round, i.e. the error rate of the hypothesis derived from the
combination of
and
Assuming there are
number of examples on which the classification made by
agrees with that made by
and among these examples both
and
make correct classification on
examples, then
can be estimated as
Thus, the number of examples in
that are mislabeled is
Therefore the classification noise rate in the
round is:

Then,
according to Eq,
can be computed as:

The
pseudo-code of tri-training is presented in Table I. The function
attempts to estimate the classification error rate of the hypothesis
derived from the combination of
and
Since it is difficult to estimate
the classification error on the unlabeled examples,
here only the original labeled examples are used, heuristically based
on the assumption that the unlabeled examples hold the same
distribution as that held by the labeled ones. In detail, the
classification error of the hypothesis is approximated through
dividing the number of labeled examples on which both
and
make incorrect classification by the number of labeled examples on
which the classification made by
is the same as that made by
The function
randomly removes
number of examples from
where
is computed according to Eq.10.

It
is noteworthy that the initial classifiers in tri-training should be
diverse because if all the classifiers are identical, then for any of
these classifier, the unlabeled examples labeled by the other two
classifiers will be the same as these labeled by the classifier for
itself. Thus, tri-training degenerates to self-training with a single
classifier. In the standard co-training
algorithm,
the use of sufficient and redundant views enables the classifiers be
different. In fact, previous research has shown that even when there
is no natural attributes partitions, if there are sufficient
redundancy among the attributes then a fairly reasonable attribute
partition will enable co-training exhibit advantages. While in the
extended co-training algorithm which does not require sufficient and
redundant views, the diversity among the classifiers is achieved
through using different supervised
learning algorithms.
Since the tri-training algorithm does not assume sufficient and
redundant views and different supervised learning algorithms, the
diversity of the classifiers have to be sought from other channels.
Indeed, here the diversity is obtained through manipulating the
original labeled example set. In detail, the initial classifiers are
trained from data sets generated via bootstrap sampling from the
original labeled example set. These classifiers are then refined in
the tri-training process, and the final hypothesis is produced via
majority
voting.
The generation of the initial classifiers looks like training an
ensemble from the labeled example set with a popular ensembel
learning algorithm, that is, Bagging.

Tri-training
can be regarded as a
new extension to the co-training algorithms.
As mentioned before, Blum and Mitchell's algorithm requires the
instance space be described by two sufficient and redundant views,
which can hardly be satisfied in common data mining scenarios. Since
tri-training does not rely on different views, its applicability is
broader. Goldman and Zhou's algorithm does not rely on different
views either. However, their algorithm requires two
different supervised learning algorithms
that partition the instance space into a set of equivalence classes.
Moreover, their algorithm frequently uses 10-fold cross validation on
the original labeled example set to determine how to label the
unlabeled examples and how to produce the final hypothesis. If the
original labeled example set is rather small, cross validation will
exhibit high variance and is not helpful for model
selection.
Also, the frequently used cross validation makes the learning process
time-consuming. Since
tri-training does not put any constraint on the supervised learning
algorithm nor does it employ time-consuming cross validation
processes, both its applicability and efficiency are better.

Tri-Training: Exploiting Unlabeled Data Using Three Classifiers的更多相关文章

  1. Bit error testing and training in double data rate (ddr) memory system

    DDR PHY interface bit error testing and training is provided for Double Data Rate memory systems. An ...

  2. A brief introduction to weakly supervised learning(简要介绍弱监督学习)

    by 南大周志华 摘要 监督学习技术通过学习大量训练数据来构建预测模型,其中每个训练样本都有其对应的真值输出.尽管现有的技术已经取得了巨大的成功,但值得注意的是,由于数据标注过程的高成本,很多任务很难 ...

  3. (zhuan) Notes on Representation Learning

    this blog from: https://opendatascience.com/blog/notes-on-representation-learning-1/   Notes on Repr ...

  4. Introduction to Deep Neural Networks

    Introduction to Deep Neural Networks Neural networks are a set of algorithms, modeled loosely after ...

  5. Machine Learning and Data Mining(机器学习与数据挖掘)

    Problems[show] Classification Clustering Regression Anomaly detection Association rules Reinforcemen ...

  6. 少标签数据学习:宾夕法尼亚大学Learning with Few Labeled Data

    目录 Few-shot image classification Three regimes of image classification Problem formulation A flavor ...

  7. 论文解读(GraphDA)《Data Augmentation for Deep Graph Learning: A Survey》

    论文信息 论文标题:Data Augmentation for Deep Graph Learning: A Survey论文作者:Kaize Ding, Zhe Xu, Hanghang Tong, ...

  8. Android开发训练之第五章第六节——Transferring Data Using Sync Adapters

    Transferring Data Using Sync Adapters GET STARTED DEPENDENCIES AND PREREQUISITES Android 2.1 (API Le ...

  9. kaggle Data Leakage

    What is Data Leakage¶ Data leakage is one of the most important issues for a data scientist to under ...

随机推荐

  1. [转] GitHub上README.md教程

    点击阅读原文 最近对它的README.md文件颇为感兴趣.便写下这贴,帮助更多的还不会编写README文件的同学们. README文件后缀名为md.md是markdown的缩写,markdown是一种 ...

  2. html+css创建提示框

    看到下面的效果了吗? 本来我们站点是用下面的图片做的背景, 但是后期当更改完框中的提示内容,并且更新内容较多的时候,发现内容溢出了,如下图: 但是背景图片不能自动拉伸,还得重新做一张背景图,这样就导致 ...

  3. oracle分区提高篇

      一. 分区表理论知识 Oracle提供了分区技术以支持VLDB(Very Large DataBase).分区表通过对分区列的判断,把分区列不同的记录,放到不同的分区中.分区完全对应用透明. Or ...

  4. ubuntu环境极其内存情况

    001:安装系统后 002:synergy 003:vim-cscope (修改vim脚本) 004:root 005:bashrc修改 006:bcompare 007:lib 008:git,gi ...

  5. [问题2015S02] 复旦高等代数 II(14级)每周一题(第三教学周)

    [问题2015S02]  设 \(a,b,c\) 为复数且 \(bc\neq 0\), 证明下列 \(n\) 阶方阵 \(A\) 可对角化: \[A=\begin{pmatrix} a & b ...

  6. 自定义Dialog

    功能:从底部弹出的对话框,加入动画 步骤:1 定义dialog布局文件 2 设置标题,透明度style.xml,选择器selector.xml ,圆角shape.xml 等样式文件 3 设置显示位置, ...

  7. 通过配置的方式Autofac 《第三篇》

    一.基本配置 1.通过配置的方式使用Autofac <?xml version="1.0"?> <configuration> <configSect ...

  8. xml资源getStringArray(R.array.xxx)方法

    在res/values/下新建menu_names.xml 代码如下: <?xml version="1.0" encoding="utf-8"?> ...

  9. Java面向对象三大特点之封装

    封装 含义:将对象的属性和行为封装起来,而将对象的属性和行为封装起来的载体是类,类通常对客户隐藏其实现细节,这就是封装的思想.封装最主要的功能在于我们能修改自己的实现代码,而不用修改那些调用我们代码的 ...

  10. Delphi 文件类型

    该内容整理自以下链接 http://www.cnblogs.com/chenyunpeng/archive/2012/08/02/2620513.html 1.DPR: Delphi Project文 ...