[论文]A Link-Based Approach to the Cluster Ensemble Problem

论文作者：Natthakan Iam-On, Tossapon Boongoen, Simon Garrett, and Chris Price

下次还是在汇报前先写了论文总结，不然有些点汇报时容易忘了说，以前看的论文看补不补上来吧，有时间再说。

前言：

这篇论文是关于聚类集成的，成熟的聚类集成框架是将多个聚类算法的结果汇聚在一起，然后使用一致性函数得出最终的聚类结果，论文中认为这两步中间的操作属于原数据上的操作，比较粗糙，所以提出了一种算法，对汇总后聚类结果进行进一步处理，然后再使用一致性函数。

Summary:

This paper presents a new link-based approach to improve the conventional matrix.
Three new link-based algorithms are proposed for the underlying similarity assessment.
The final clustering result is generated from the refined matrix using two different consensus functions of feature-based and graph-based partitioning.

conventional matrix 就是前言中提到的汇总结果。

这个算法目的是发现一个样本在一个聚类结果中与不属于的类之间的关系(similarity)。

提炼后的矩阵称为RA matrix ，在这个矩阵上进行一致性曹组有两种方法，基于feature 和基于图切。

对汇总矩阵的提炼的方法一共有三种。

It aims to refine the ensemble-information matrix using the similarity between clusters in the ensemble under examination.

　　◦Weighted Connected-Triple (WCT)

　　◦Weighted Triple-Quality (WTQ)

　　◦Combined Similarity Measure (CSM)

一致性函数有两种：

two new consensus methods are proposed to derive the ultimate clustering result:

　　◦ feature-based partitioning (FBP)

　　◦ bipartite graph partitioning (BGP)

下面是一些属性讲解，其实看图比较清楚，一共有N 个样本点，聚类集成框架中使用了M 个聚类方法，得到的结果为π，每个聚类结果π的类个数不一样，使用C 表示：

X ={x1 . . . xN} be a set of N data points

Π={Π 1 . . . ΠM} be a cluster ensemble with M base clusterings

Each base clustering returns a set of clusters

a 图是样本的两个聚类情况，π₁ π₂ ，那么可以有3中结果汇众的表达b-d，后面用得上的是d 图，d图这个矩阵就是作者认为的粗糙聚类结果。

N = 5 样本总数

M = 2 集成框架中的聚类方法个数

K1 = 3,K2 = 2 每个聚类方法中的聚类个数

一个聚类集成问题：

The problem is to find a new partition π* of a data set X that summarizes the information from the cluster ensemble π^final.

This metalevel method involves two major tasks of:

◦1) generating a cluster ensemble

◦2) producing the final partition (normally referred to as a “consensus function”).

为了获取不同的聚类结果，大致归纳如下的聚类模型：

Cluster models:

◦Homogeneous ensembles

◦Different-k

　　One of the most successful technique is randomly selecting the number of clusters (k) for each ensemble member

◦Data subspace/subsample

◦Heterogeneous ensembles

◦Mixed heuristics

　　In addition to using one of the aforementioned methods, any combination of them can be applied

而一致性函数归纳如下：

}consensus methods :

◦Feature-based approach

　　It transforms the problem of cluster ensembles to the clustering of categorical data.

◦Direct approach

◦Pairwise similarity approach

◦Graph-based approach

论文的创新点就是在这两部中间加入了一步提炼：

NOVEL LINK-BASED APPROACH:

◦1) generating a cluster ensemble

◦2)creating the refined ensemble-information matrix using a link-based similarity algorithm

◦3) producing the final partition (normally referred to as a “consensus function”).　　

计算RA 矩阵公式，在粗糙矩阵下我们可以先知道如下结果，RA 其实就是将d 图中的0，改为 xi 与 C 的相似度，这就是提炼的意思，方法是通关过计算xi属于的类与目标C 的相似度，然后用这个值作为xi 与目标C 的相似度，这就代替了0.

这个算法计算前需要先计算π₁ 与 π₂ 中类之间的相似度，是两个π之间，π内之间的类相似度怎么算就是这个算法解决的问题。

L_z ∈ X denotes the set of data points belonging to cluster C_z ∈ π.

公式如下：

图示：

C11 类有样本： x1 x2 C21 类有样本： x1 x3

<C11,C21> = {x1}/{x1 x2 x3} = 1/3

在上面的基础上，开始讲解这个算法，算法有3中计算一个聚类中类间的similary：

Weighted Connected-Triple (WCT):

　　◦WCT extends the Connected-Triple method.

　　◦Formally, a triple, Triple =（V_triple ,E_triple), is a subgraph of G’ containing three vertices V_Triple ={v_x,v_y,v_z} ∈V and two edges ETriple ={e_xz,e_yz} ∈E, with e_xz ∉ E.

　　◦DC ∈[0,1]is a constant decay factor

第一条就是计算xy点关于z 点得到他们之间的similary，xy 是属于一个聚类类结果的类标号，z 是其他聚类结果的类标号。

第二条就是第一条结果的叠加。

第三条就是正规化后加上约束因子，因为RA-matrix 直接知道的结果为1，计算similarity 的应该小一点。

图示，这就把RA 矩阵补全了，例如x3 与C11 的项取值，就是Xz 属于的类（C12）与 C11 之间的similarity，即0.9

}Weighted Triple-Quality (WTQ)

　　◦WTQ is inspired by the initial measure of which evaluates the association between personal home pages.

　　◦Note that the method gives high weights to rare features and low weights to features that are common to most of the pages.

N_z ∈V denotes the set of vertices that is directly linked to the vertex v_z such that ∨v_t ∈N_z; |w_zt| > 0.

第一条就是 xy 关于 z 的权重，该式分母其实就是与z 有相关的w 之和。

其他跟上面的一样的。

Combined Similarity Measure (CSM)：

　　With the objective of obtaining a robust similarity evaluation, this particular algorithm combines the WCT and WTQ measures previously described.

将上面两种方法结合成第三种。

一致性方法的选择：

Consensus Methods for the RA Matrix：

　　◦Feature-Based Partitioning

　　　　k-means (KM)

　　　　k-medoids (PAM)

　　◦Bipartite Graph Partitioning

　　　　weight SPEC graph-partitioning

实验结果就不说了，有兴趣的可以下论文能看。

[论文]A Link-Based Approach to the Cluster Ensemble Problem的更多相关文章

[论文]A Link-Based Cluster Ensemble Approach for Categorical Data Clustering
http://www.cnblogs.com/Azhu/p/4137131.html 这篇论文建议先看了上面这一遍,两篇作者是一样的,方法也一样,这一片论文与上面的不同点在于,使用的数据集是目录数据, ...
论文解读 - Composition Based Multi Relational Graph Convolutional Networks
1 简介随着图卷积神经网络在近年来的不断发展,其对于图结构数据的建模能力愈发强大.然而现阶段的工作大多针对简单无向图或者异质图的表示学习,对图中边存在方向和类型的特殊图----多关系图(Multi- ...
近年Recsys论文
2015年~2017年SIGIR,SIGKDD,ICML三大会议的Recsys论文: [转载请注明出处:https://www.cnblogs.com/shenxiaolin/p/8321722.ht ...
机器人局部避障的动态窗口法(dynamic window approach) （转）
源:机器人局部避障的动态窗口法(dynamic window approach) 首先在V_m∩V_d的范围内采样速度: allowable_v = generateWindow(robotV, ro ...
自然语言处理领域重要论文&资源全索引
自然语言处理(NLP)是人工智能研究中极具挑战的一个分支.随着深度学习等技术的引入,NLP领域正在以前所未有的速度向前发展.但对于初学者来说,这一领域目前有哪些研究和资源是必读的?最近,Kyubyon ...
Self-paced Clustering Ensemble自步聚类集成论文笔记
Self-paced Clustering Ensemble自步聚类集成论文笔记 2019-06-23 22:20:40 zpainter 阅读数 174 收藏更多分类专栏: 论文版权声明 ...
CVPR 2020论文收藏（转知乎：https://zhuanlan.zhihu.com/p/112337176）
CVPR 2020 共收录 1470篇文章,根据当前的公布情况,人工智能学社整理了以下约100篇,分享给读者. 代码开源情况:详见每篇注释,当前共15篇开源.(持续更新中,可关注了解). 算法主要领域 ...
SparkStreaming和Kafka基于Direct Approach如何管理offset实现exactly once
在之前的文章<解析SparkStreaming和Kafka集成的两种方式>中已详细介绍SparkStreaming和Kafka集成主要有Receiver based Approach和Di ...
论文翻译：2021_MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement
论文地址:MetricGAN+:用于语音增强的 MetricGAN 的改进版本论文代码:https://github.com/JasonSWFu/MetricGAN 引用格式:Fu S W, Yu ...

随机推荐

Python知识点入门笔记——Python的基本数据类型
Python的数字分为4种类型:整数(int).浮点数(float).布尔值(bool).复数(complex). type()函数可以知道数据的类型,如type(233)是int型,type(233 ...
day 71 Django基础六之ORM中的锁和事务
Django基础六之ORM中的锁和事务本节目录一锁二事务三 xxx 四 xxx 五 xxx 六 xxx 七 xxx 八 xxx 一锁行级锁 select_for_update(no ...
java 调用第三方系统时的连接代码-记录
前言:该文章主要是总结我在实际工作中遇到的问题,在调取第三方系统的时候出现的问题,算自己的总结.各位博友如果有什么建议或意见欢迎留言指正. 先将准备传入参数再与第三方系统建立连接再第三方系统处理后 ...
Python9-MySQL索引-外键-day43
1.以ATM引出DBMS2.MySQL -服务端 -客户端3.通信交流 -授权 -SQL语句 -数据库 create database db1 default charset=utf8; drop d ...
动态规划：最长上升子序列之基础（经典算法 n^2）
解题心得: 1.注意动态转移方程式,d[j]+1>d[i]>?d[i]=d[j]+1:d[i] 2.动态规划的基本思想:将大的问题化为小的,再逐步扩大得到答案,但是小问题的基本性质要和大的 ...
Redis实现之RDB持久化（二）
RDB文件结构在Redis实现之RDB持久化(一)这一章中,我们介绍了Redis服务器保存和载入RDB文件的方法,在这一节,我们将对RDB文件本身进行介绍,并详细说明文件各个部分的结构和意义.图1- ...
Redis实现之字符串
简单动态字符串 Redis中的字符串并不是传统的C语言字符串(即字符数组,以下简称C字符串),而是自己构建了一种简单动态字符串(simple dynamic string,SDS),并将SDS作为Re ...
IOS开发学习笔记008-预处理
预处理 1.宏定义 2.条件编译 3.文件包含注意: 1.所有预处理都是以#开头,并且结尾不用分号. 2.宏名一般用大写字母,以便与变量名区别开来,但用小写也没有语法错误 3.作用域也是从定义到代码 ...
python 学习分享-面向对象
好激动,学了四个月的面向对象,我终于勉勉强强的把作业做完了,现在来重构我的面向对象的知识! 面向过程:根据业务逻辑从上到下写垒代码函数式:将某功能代码封装到函数中,日后便无需重复编写,仅调用函数即可 ...
在IE浏览器下，PDF将弹出窗口遮挡了
写了个embed标签里面放这个pdf 然后点击其他地方的弹框pdf把他遮盖住了如下: 先是改z-index,没卵用. 百度了好久,终于找到了个有用的 https://blog.csdn.net/it ...

[论文]A Link-Based Approach to the Cluster Ensemble Problem

[论文]A Link-Based Approach to the Cluster Ensemble Problem的更多相关文章

随机推荐

热门专题