Hierarchical cluster算法介绍

　　突然想记录几个聚类算法，由于实力有限就先介绍一下层次聚类算法(Hierarchical cluster algorithm),这个聚类算法思想简单，但实现起来感觉复杂度挺大；以前看过《集体智慧编程》里介绍过，里面是用python实现的，由于python里面的列表和字典用起来方便，故实现该算法还行；这里我用c++重新写了一下，感觉代码蛮臃肿，可能是自己的c++没有学习好吧！！！对于容器的使用还不够熟练，这里贴出来的目的是希望哪位大牛看到了指导一二，这里感激不尽。废话不多说了，进入正题吧！

************************************************************************************************************

Hierarchical cluster Algorithm的大致介绍

　　层次聚类算法有两种实现思想，一种是初始时将每个待聚类的数据样本视为一个cluster,采用合并的方式，每次合并两个"距离"最近的cluster，直到合并成一个cluster为止(当然可以在达到自己设定想得到的cluster个数时终止迭代)；另一种刚好与第一种相反，初始时将所有的数据样本视为一个cluster,采用分解的方式(这里没有实现就不说太多)。

************************************************************************************************************

算法的步骤及相关问题

　　算法步骤: (1)初始时，将每个数据样本视为一个cluster(选取一个度量两个cluster距离的方式)，

　　　　　　　(2)计算任意两个cluster之间的距离；每次选取距离最小的两个cluster，

　　　　　　　(3)合并(2)中选择的两个cluster,将合并产生的新cluster加入cluster set中，并删除被合并的两个cluster，

　　　　　　　(4)重复(2)(3)，知道cluster set中元素只剩下一个为止。

　　相关问题: (1)度量两个cluster之间的距离，应该选择哪种距离？？？《集体智慧编程》中选择的是Pearson,当然也可以直接选用欧氏距离

　　　　　　 (2)如何合并两个cluster,即新的cluster对应的属性值如何表示???这里是用被合并的两个cluster的平均值表示新的cluster

******************************************************************************************************************

 /**

 ** Hierarchical cluster Algorithm

 ** step：(1)Firstly,regard each sample as a cluster, and

          (2)Each time merge two clusters if the distance between them is lowest.

          (3)then add the new cluster into cluster set, and delete two clusters merged from cluster set.

 ** method: (1)as to merging, here replace the old two clusters with their average;

            (2)measure the distance with the Pearson similarity.

 ** Time：2013/7/10

 **/

 #include <iostream>

 #include <map>

 #include <vector>

 #include <string>

 #include <fstream>

 #include <cstring>

 #include <sstream>

 #include <cmath>

 #include <iterator>

 using namespace std;

 //cluster

 typedef    struct bicluster{

     vector<double> attri;//attribute

     int  cid;//cluster id

 }Bicluster;

 //a pair

 typedef struct lowpair{

     int leftid;

     int rightid;

     double dist;

 }Lpair;

 /*****************************************************************

 ** convert string(char*) to double(or other type)

 ** here should be included <sstream> before using the stringstream

 ******************************************************************/

 double str2double(char* str){

     stringstream ss;

     ss << str;

     double tmp;

     ss >> tmp;

     return tmp;

 }

 /*****************************************************************

 ** split the string containing some special tokens

 ******************************************************************/

 string split(string &str, vector<double>& dvec, const char* tok){

     char *pch = NULL;

     pch = strtok(const_cast<char*>(str.c_str()), tok);

     string stmp(pch);

     while( pch != NULL ){

         pch = strtok(NULL, tok);

         if( !pch )

             break;

         dvec.push_back(str2double(pch));

     }

     return stmp;

 }

 /******************************************************************

 ** read data from 'blogdata.txt'

 ** @is ------- a reference to ifstream object(input)

 ** @data ----- a map used to store the data (output)

 ******************************************************************/

 bool readfile(ifstream &is, map<string, vector<double> >& mydata){

     if( is.fail() ){

         cerr << "can't open the file !!!" << endl;

         return false;

     }

     //ignore the first line of file

     string str;

     getline(is, str);

     //store the data read from file into mydata

     while( !is.eof() ){

         vector<double> dtmp;

         string tmp;

         getline(is, str);

         tmp = split(str, dtmp, "\t");

         mydata.insert(pair<string,vector<double> >(tmp, dtmp));

     }

     return true;

 }

 /*****************************************************************

 ** compute the distance between two clusters

 ** Note that Pearson value devotes to the similarity between

     two clusters, that is, the greater the Pearson value, the

     lower the distance between them.

 *****************************************************************/

 double distPearson(vector<double>& left, vector<double>& right){

     double sum1 = ;

     double sum2 = ;

     int len = left.size();

     for(int i=; i<len; ++i){

         sum1 += left[i];

         sum2 += right[i];

     }

     /**

     ** maybe you will feel it's complex,

     **  and here we could replace Pearson with Euclidean distance

     **/

     double sum1Sq = ;

     double sum2Sq = ;

     for(int j=; j<len; ++j){

         sum1Sq += pow(left[j], );

         sum2Sq += pow(right[j], );

     }

     double pSum = , num, den;

     for(int k=; k<len; ++k)

         pSum += left[k]*right[k];

     num = pSum - sum1*sum2 / len;

     den = sqrt((sum1Sq - pow(sum1,)/len) * (sum1Sq - pow(sum2,)/len));

     if( den ==  )

         return ;

     return 1.0 - num/den;

 }

 /*************************************************************

 ** Given two clusters, the distance between them

     should be checked whether it exists before compute it.

 **************************************************************/

 bool isExist(vector<Lpair> &lp, int leftid, int rightid, double &d){

     vector<Lpair>::iterator it = lp.begin();

     for(; it!=lp.end(); ++it){

         if( (it->leftid==leftid) && (it->rightid==rightid) ){

             d = it->dist;//if the distance has been computed, assign its value to d

             return true;

         }

     }

     d = ;

     return false;

 }

 /*************************************************************

 ** Given a cluster's id, delete the cluster from cluster set

 **************************************************************/

 void Del(vector<Bicluster> &cvec, int clusterid){

     vector<Bicluster>::iterator it = cvec.begin();

     for(; it!=cvec.end(); ++it){

         if( it->cid == clusterid )

             break;

     }

     cvec.erase(it);

 }

 /*************************************************************

 ** Hierarchical Cluster Algorithm

 **************************************************************/

 void HierarchicalCluster(map<string, vector<double> > &mydata){

     vector<Lpair> distances;//used to store the distance

     //firstly,regard each sample as a cluster

     vector<Bicluster> cvec;

     map<string, vector<double> >::iterator it = mydata.begin();

     int myid = ;

     for(; it!= mydata.end(); ++it){

         Bicluster btmp;

         btmp.attri = it->second;

         btmp.cid = myid++;

         cvec.push_back(btmp);

     }

     myid = -;

     //search the pair

     while( cvec.size()> ){

         Lpair lowp;

         double closedis = distPearson(cvec[].attri,cvec[].attri);

         lowp.leftid = cvec[].cid, lowp.rightid = cvec[].cid;

         lowp.dist = closedis;

         int leftps = , rightps = ;

         for(int ix=; ix<cvec.size(); ++ix){

             for(int iy=ix+; iy<cvec.size(); ++iy){

                 double d;

                 int lid = cvec[ix].cid, rid = cvec[iy].cid;

                 if( !isExist(distances,lid,rid,d) ){

                     Lpair lptmp;

                     lptmp.dist = distPearson(cvec[ix].attri, cvec[iy].attri);

                     lptmp.leftid = lid;

                     lptmp.rightid= rid;

                     distances.push_back(lptmp);

                     d = lptmp.dist;

                   }

                  if( d < lowp.dist ){

                      lowp.leftid = lid;

                      lowp.rightid = rid;

                      leftps = ix;

                      rightps = iy;

                      lowp.dist = d;

                  }

             }

         }

         //create a new cluster

         Bicluster ncluster;

         for(int i=; i<cvec[].attri.size(); ++i){

             double av;

             av = (cvec[leftps].attri[i] + cvec[rightps].attri[i]) / 2.0;

             ncluster.attri.push_back(av);

         }

         ncluster.cid = myid--;//assign negative to the new cluster's id

         cout << "leftid: " << lowp.leftid <<  ", rightid: " << lowp.rightid << endl;

         //delete the pair

         Del(cvec, lowp.leftid);

         Del(cvec, lowp.rightid);

         cvec.push_back(ncluster);

     }

 }

 int main()

 {

     ifstream is("blogdata.txt");

     if( is.fail() ){

         cerr << "error!!!" << endl;

         exit(-);

     }

     map<string, vector<double> > mydata;

     if(readfile(is, mydata))

         HierarchicalCluster(mydata);

     return ;

 }

　　代码写的有点乱且复杂，最后显示的结果不是树状图（python很易实现），只是简单的显示了每次被合并的两个cluster的id.代码中用到的数据可以从http://kiwitobes.com/clusters/blog.txt下载得到。

Hierarchical cluster算法介绍的更多相关文章

matlab下K-means Cluster 算法实现
一.概念介绍 K-means算法是硬聚类算法,是典型的局域原型的目标函数聚类方法的代表,它是数据点到原型的某种距离作为优化的目标函数,利用函数求极值的方法得到迭代运算的调整规则.K-means算法以欧 ...
k-means|k-mode|k-prototype|PAM|AGNES|DIANA|Hierarchical cluster|DA|VIF|
聚类算法: 对于数值变量,k-means eg:k=4,则选出不在原数据中的4个点,计算图形中每个点到这四个点之间的距离,距离最近的便是属于那一类.标准化之后便没有单位差异了,就可以相互比较. 对于分 ...
【原创】机器学习之PageRank算法应用与C#实现(1)算法介绍
考虑到知识的复杂性,连续性,将本算法及应用分为3篇文章,请关注,将在本月逐步发表. 1.机器学习之PageRank算法应用与C#实现(1)算法介绍 2.机器学习之PageRank算法应用与C#实现(2 ...
KNN算法介绍
KNN算法全名为k-Nearest Neighbor,就是K最近邻的意思. 算法描述 KNN是一种分类算法,其基本思想是采用测量不同特征值之间的距离方法进行分类. 算法过程如下: 1.准备样本数据集( ...
ISP基本框架及算法介绍
什么是ISP,他的工作原理是怎样的? ISP是Image Signal Processor的缩写,全称是影像处理器.在相机成像的整个环节中,它负责接收感光元件(Sensor)的原始信号数据,可以理解为 ...
Python之常见算法介绍
一.算法介绍 1. 算法是什么算法是指解题方案的准确而完整的描述,是一系列解决问题的清晰指令,算法代表着用系统的方法描述解决问题的策略机制.也就是说,能够对一定规范的输入,在有限时间内获得所要求的输 ...
RETE算法介绍
RETE算法介绍一. rete概述Rete算法是一种前向规则快速匹配算法,其匹配速度与规则数目无关.Rete是拉丁文,对应英文是net,也就是网络.Rete算法通过形成一个rete网络进行模式匹配,利 ...
H2O中的随机森林算法介绍及其项目实战（python实现）
H2O中的随机森林算法介绍及其项目实战(python实现) 包的引入:from h2o.estimators.random_forest import H2ORandomForestEstimator ...
STL 算法介绍
STL 算法介绍算法概述算法部分主要由头文件<algorithm>,<numeric>和<functional>组成. <algorithm ...

随机推荐

修改 eclipse 文件编码格式
如果要使插件开发应用能有更好的国际化支持,能够最大程度的支持中文输出,则最好使 Java文件使用UTF-8编码.然而,Eclipse工作空间(workspace)的缺省字符编码是操作系统缺省的编码,简 ...
POI 中的CellRangeAddress 参数
在用poi在EXECL报表设计的时候,遇到单元格合并问题,用到一个重要的函数: CellRangeAddress(int, int, int, int) 参数:起始行号,终止行号, 起始列号,终止列号 ...
qtp与selenium2的区别
QTP: 我觉得qtp适合的人: 对编程不是很熟悉的厌烦了手动的功能测试,想快速进入自动化行业的公司想快速自动化项目,并且对价格或者对盗版无所谓的 vbs脚本语言易于上手,可以培训你对代码的兴趣 ...
hdu 1176 免费馅饼(nyist 613)
http://acm.hdu.edu.cn/showproblem.php?pid=1176 dp[i][j]:表示第i秒接j位置的馅饼的最大值. 三种状态: dp[i][j]=max(dp[i-1] ...
Regex Tester 安装教程
下载com.brosinski.eclipse.regex_1.4.0.jar 地址:https://github.com/sbrosinski/RegexTester 下载之后把jar包粘贴到${e ...
grunt + compass
compass和sass文章列表:http://182.92.240.72/tag/compass/ compass实战grunt: http://wrox.cn/article/2000491/ h ...
使用ssh公钥密钥自动登陆linux服务器
转自:http://7056824.blog.51cto.com/69854/403669 作为一名 linux 管理员,在多台 Linux 服务器上登陆进行远程操作是每天工作的一部分.但随着服务器的 ...
Android查询：模拟键盘鼠标事件（adb shell 实现）
1. 发送键盘事件: 命令格式1:adb shell input keyevent “value” 其中value以及对应的key code如下表所列: KeyEvent Value KEYCODE ...
UVa 11178 (简单练习) Morley's Theorem
题意: Morley定理:任意三角形中,每个角的三等分线,相交出来的三个点构成一个正三角形. 不过这和题目关系不大,题目所求是正三角形的三个点的坐标,保留6位小数. 分析: 由于对称性,求出D点,EF ...
ORACLE 远程连接数据库
通过运行->cmd->sqlplus/nolog 登录sqlplus时,突然间提示“sqlplus不是内部或外部命令,也不是可运行的程序或批处理文件”,如下图: 分析后感觉不可能啊,因为 ...

Hierarchical cluster算法介绍

Hierarchical cluster算法介绍的更多相关文章

随机推荐

热门专题