Abstract – In many practical data mining applications such as web page classification, unlabeled training examples are readily available but labeled ones are fairly expensive to obtain. Therefore, semi-supervised learning algorithms such as co-training have attracted much attention. In this paper, a new co-training style semi-supervised learning algorithm named tri-training is proposed. This algorithm generates three classifier from the original labeled example set. These classifiers are then refined using unlabeled examples in the tri-training process. In detail, in each round of tri-training, an unlabeled example is labeled for a classifier if the other two classifiers agree on the labeling, under certain conditions. Since tri-training neither requires the instance space be described with sufficient and redundant views nor does it put any constraints on the supervised learning algorithm, its applicability is broader than that of previous co-training style algorithms. Experiments on UCI data sets and application to the web page classification task indicate that tri-training performance.

Index Terms – Data Mining, Machine Learning, Learning from Unlabeled Data, Semi-supervised Learning, Co-training, Tri-training, Web Page Classification

I. INTRODUCTION

IN many practical data mining applications such as web page classification, unlabeled training examples are readily available but labeled ones are fairly expensive to obtain because they require human effort. Therefore, semi-supervised learning that exploits unlabeled examples in addition to labeled ones has become a hot topic.

Many current semi-supervised learning algorithms use a generative model for the classifier and employ Expectation Maximization (EM) to model the label estimation or parameter estimation process. For example, mixture of Gaussians, mixture of experts, and naive Bayes have been respectively used as the generative model, while EM is used to combine labeled and unlabeled data for classification. There are also many other algorithms such as using transductive inference for support vector machines to optimize performance on a specific test set, constructing a graph on the examples such that minimum cut on the graph yields an optimal labeling of the unlabeled examples according to certain optimization functions, etc.

A prominent achievement in this area is the co-training paradigm proposed by Blum and Mitchell, which trains two classifiers separately on two different views, i.e. two independent sets of attributes, and uses the predictions of each classifier on unlabeled examples to augment the training set of the other. Such an idea of utilizing the natural redundancy in the attributes has been employed in some other works. For example, Yarowsky performed word sense disambiguation by constructing a sense classifier using the local context of the word and a classifier based on the sensed of other occurrences of that word in the same document; Riloff and Jones classified a noun phrase for geographic locations by considering both the noun phrase itself and the linguistic context in which the noun phrase appears; Collins and Singer performed named entity classification using both the spelling of the entity itself and the context in which the entity occurs. It is noteworthy that the co-training paradigm has already been used in many domains such as statistical parsing and noun phrase identification.

The standard co-training algorithm requires two sufficient and redundant views, that is, the attributes be naturally partitioned into two sets, each of which is sufficient for learning and conditionally independent to the other given the class label. Dasgupta et al. have shown that when the requirement is met, the co-trained classifiers could make fewer generalization errors by maximizing their agreement over the unlabeled data. Unfortunately, such a requirement can hardly be met in most scenarios. Goldman and Zhou proposed an algorithm which does not exploit attribute partition. However, it requires using two different supervised learning algorithms that partition the instance space into a set of equivalence classes, and employing time-consuming cross validation technique to determine how to label the unlabeled examples and how to product the final hypothesis.

In this paper, a new co-training style algorithm named tri-training is proposed. Tri-training does not require sufficient and redundant views, nor does it require the use of different supervised learning algorithms whose hypothesis partitions the instance space into a set of equivalence classes. Therefore it can be easily applied to common data mining scenarios. In contrast to previous algorithms that utilize two classifiers, tri-training uses three classifiers. This setting tackles the problem of determining how to label the unlabeled examples and how to produce the final hypothesis, which contributes much to the efficiency of the algorithm. Moreover, better generalization ability can be achieved through combining these three classifiers. Experiments on UCI data sets and application to the web page classification task show that tri-training can effectively exploit unlabeled data, and the generalization ability of its final hypothesis is quite good, sometimes even outperforms that of the ensemble of three classifiers being provided with labels of all the unlabeled examples.

II. TRI-TRAINING

Let denote the labeled example set with size and denote the unlabeled example set with size . In previous co-training style algorithms, two classifiers are initially trained from , each of which is then re-trained with the help of unlabeled examples that are labeled by the latest version of the other classifier. In order to determine which example in should be labeled and which classifier should be biased in prediction, the confidence of the labeling of each classifier must be explicitly measured. Sometimes such a measuring process is quite time-consuming.

Assume that besides these two classifiers, i.e. and , a classifier is initially trained from . Then, for any classifier, an unlabeled example can be labeled for it as long as the other two classifiers agree on the labeling of this example, while the confidence of the labeling of the classifiers are not needed to be explicitly measured. For instance, if and agree on the labeling of an example in , then can be labeled for . It is obvious that in such a schema if the prediction of and on is correct; otherwise will get an example with noisy label. However, even in the worse case, the increase in the classification noise rate can be compensated if the amount of newly labeled example is sufficient, under certain conditions, as shown below.

Inspired by Goldman and Zhou, the finding of Angluin and Laird is used in the following analysis. That is, if a sequence of samples is drawn, where the sample size satisfies Eq. 1:

(1)

where
is the hypothesis worst-case classification error rate, is an upper
bound on the classification noise rate, is the number of hypothesis,
and is the confidence, then a hypothesis that minimizes
disagreement with will have the PAC property:

(1)

where
is sum over the probability of elements from the symmetric difference
between the two hypothesis sets and (the ground-truth). Let where
makes Eq. 1 hold equality, then Eq. 1 becomes Eq. 3:

(1)

To
simplify the computation, it is helpful to compute the quotient of
the constant divided by the square of the error:

(1)

In
each round of tri-training, the classifiers
and
choose some examples in
to label for
Since the classifiers are refined in the tri-training process, the
amount as well as the concrete unlabeled examples chosen to label may
be different in different rounds. Let
and
denote the set of examples that are labeled for
in the
round and the
round, respectively. Then the
training set for
in the
round and the
round are respectively
and
whose sample size
and
are
and
respectively. Note that the unlabeled examples labeled in the
round, i.e.
won't be put into the original labeled example set, i.e.
Instead, in the
round all the examples in
will be regarded as unlabeled and put into
again.

Let
denote the classification noise rate of
that is, the number of examples in
that are mislabeled is
Let
denote the upper bound of the classification error rate of
in the
round, i.e. the error rate of the hypothesis derived from the
combination of
and
Assuming there are
number of examples on which the classification made by
agrees with that made by
and among these examples both
and
make correct classification on
examples, then
can be estimated as
Thus, the number of examples in
that are mislabeled is
Therefore the classification noise rate in the
round is:

Then,
according to Eq,
can be computed as:

The
pseudo-code of tri-training is presented in Table I. The function
attempts to estimate the classification error rate of the hypothesis
derived from the combination of
and
Since it is difficult to estimate
the classification error on the unlabeled examples,
here only the original labeled examples are used, heuristically based
on the assumption that the unlabeled examples hold the same
distribution as that held by the labeled ones. In detail, the
classification error of the hypothesis is approximated through
dividing the number of labeled examples on which both
and
make incorrect classification by the number of labeled examples on
which the classification made by
is the same as that made by
The function
randomly removes
number of examples from
where
is computed according to Eq.10.

It
is noteworthy that the initial classifiers in tri-training should be
diverse because if all the classifiers are identical, then for any of
these classifier, the unlabeled examples labeled by the other two
classifiers will be the same as these labeled by the classifier for
itself. Thus, tri-training degenerates to self-training with a single
classifier. In the standard co-training
algorithm,
the use of sufficient and redundant views enables the classifiers be
different. In fact, previous research has shown that even when there
is no natural attributes partitions, if there are sufficient
redundancy among the attributes then a fairly reasonable attribute
partition will enable co-training exhibit advantages. While in the
extended co-training algorithm which does not require sufficient and
redundant views, the diversity among the classifiers is achieved
through using different supervised
learning algorithms.
Since the tri-training algorithm does not assume sufficient and
redundant views and different supervised learning algorithms, the
diversity of the classifiers have to be sought from other channels.
Indeed, here the diversity is obtained through manipulating the
original labeled example set. In detail, the initial classifiers are
trained from data sets generated via bootstrap sampling from the
original labeled example set. These classifiers are then refined in
the tri-training process, and the final hypothesis is produced via
majority
voting.
The generation of the initial classifiers looks like training an
ensemble from the labeled example set with a popular ensembel
learning algorithm, that is, Bagging.

Tri-training
can be regarded as a
new extension to the co-training algorithms.
As mentioned before, Blum and Mitchell's algorithm requires the
instance space be described by two sufficient and redundant views,
which can hardly be satisfied in common data mining scenarios. Since
tri-training does not rely on different views, its applicability is
broader. Goldman and Zhou's algorithm does not rely on different
views either. However, their algorithm requires two
different supervised learning algorithms
that partition the instance space into a set of equivalence classes.
Moreover, their algorithm frequently uses 10-fold cross validation on
the original labeled example set to determine how to label the
unlabeled examples and how to produce the final hypothesis. If the
original labeled example set is rather small, cross validation will
exhibit high variance and is not helpful for model
selection.
Also, the frequently used cross validation makes the learning process
time-consuming. Since
tri-training does not put any constraint on the supervised learning
algorithm nor does it employ time-consuming cross validation
processes, both its applicability and efficiency are better.

Tri-Training: Exploiting Unlabeled Data Using Three Classifiers的更多相关文章

  1. Bit error testing and training in double data rate (ddr) memory system

    DDR PHY interface bit error testing and training is provided for Double Data Rate memory systems. An ...

  2. A brief introduction to weakly supervised learning(简要介绍弱监督学习)

    by 南大周志华 摘要 监督学习技术通过学习大量训练数据来构建预测模型,其中每个训练样本都有其对应的真值输出.尽管现有的技术已经取得了巨大的成功,但值得注意的是,由于数据标注过程的高成本,很多任务很难 ...

  3. (zhuan) Notes on Representation Learning

    this blog from: https://opendatascience.com/blog/notes-on-representation-learning-1/   Notes on Repr ...

  4. Introduction to Deep Neural Networks

    Introduction to Deep Neural Networks Neural networks are a set of algorithms, modeled loosely after ...

  5. Machine Learning and Data Mining(机器学习与数据挖掘)

    Problems[show] Classification Clustering Regression Anomaly detection Association rules Reinforcemen ...

  6. 少标签数据学习:宾夕法尼亚大学Learning with Few Labeled Data

    目录 Few-shot image classification Three regimes of image classification Problem formulation A flavor ...

  7. 论文解读(GraphDA)《Data Augmentation for Deep Graph Learning: A Survey》

    论文信息 论文标题:Data Augmentation for Deep Graph Learning: A Survey论文作者:Kaize Ding, Zhe Xu, Hanghang Tong, ...

  8. Android开发训练之第五章第六节——Transferring Data Using Sync Adapters

    Transferring Data Using Sync Adapters GET STARTED DEPENDENCIES AND PREREQUISITES Android 2.1 (API Le ...

  9. kaggle Data Leakage

    What is Data Leakage¶ Data leakage is one of the most important issues for a data scientist to under ...

随机推荐

  1. android中ContentProvider获取联系人 总结

    35.内容提供者:ContentResolver 用内容提供者来获取联系人信息 35-1:权限 <!-- 对联系人的读.写权限 --> <uses-permission androi ...

  2. EasyUI中在表单提交之前进行验证

    使用EasyUi我们可以在客户端表单提交之前进行验证,过程如下:只需在onSubmit的时候使用return  $("#form1").form('validate')方法即可,E ...

  3. Unity3D UGUI学习系列索引(暂未完成)

    U3D UGUI学习1 - 层级环境 U3D UGUI学习2 - Canvas U3D UGUI学习3 - RectTransform U3D UGUI学习4 - Text U3D UGUI学习5 - ...

  4. noi 2718 移动路线

    题目链接: http://noi.openjudge.cn/ch0206/2718/ 右上角的方案数 f(m,n) = f(m-1,n) + f(m,n-1); http://paste.ubuntu ...

  5. 常用js总结1

    1.cookie.js(封装了cookie的基本操作) 1.引入cookie.js <script type="text/javascript" src="../j ...

  6. Apache Solr 访问权限控制

    Current state of affairs SSL support was added in version 4.2 (SolrCloud v4.7). Protection of Zookee ...

  7. Nginx 简介

    一.介绍 Nginx是一个高性能的HTTP和反向代理服务器,也是一个IMAP/POP3/SMTP代理服务器. Nginx是一款轻量级的Web服务器/反向代理服务器以及电子邮件代理服务器,并在一个BSD ...

  8. 基础笔记6(exception)

    1.异常:一种处理错误的机制,将错误和业务分离. throwable的子类 error 和exception exception 分两类:checked (需要捕获处理或者抛出)和unchecked( ...

  9. 浅谈 Android 自定义锁屏页的发车姿势

    作者:blowUp,原文链接:http://mp.weixin.qq.com/s?__biz=MzA3NTYzODYzMg==&mid=2653577446&idx=2&sn= ...

  10. c#简易反射调用泛型方法

    // 所谓程序集的简单理解,存在不同项目中(不是解决方案),即using前需要引用**.dll 1.调用当前类文件下的方法public List<T> GetByCondition< ...