Weka:call for the EM algorithm to achieve clustering.(EM算法)
EM算法:
在Eclipse中写出读取文件的代码然后调用EM算法计算输出结果:
package EMAlg;
import java.io.*; import weka.core.*;
import weka.filters.Filter;
import weka.filters.unsupervised.attribute.Remove;
import weka.clusterers.*;
public class EMAlg { public EMAlg() {
// TODO Auto-generated constructor stub
System.out.println("this is the EMAlg");
} public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
String file="C:\\Program Files/DataMining/Weka-3-6-10/data/labor.arff";
FileReader FReader=new FileReader(file);
BufferedReader Reader= new BufferedReader(FReader); Instances data=new Instances(Reader);
data.setClassIndex(data.numAttributes()-1);//设置最后一个属性作为分类属性 Remove filter=new Remove();
System.out.println("''+data.classIndex()的输出内容是:"+""+data.classIndex());
System.out.println("读取数据的属性个数一共有:"+data.numAttributes()+"个.");
filter.setAttributeIndices(""+(data.classIndex()+1));
/*filter.setAttributeIndices();
* Set which attributes are to be deleted (or kept if invert is true)
* 用来设置哪一个属性应该被删除的方法。
* Parameters:
* rangeList - a string representing the list of attributes. Since the string will typically come from a user, attributes are indexed from 1.
* eg: first-3,5,6-last
*/
filter.setInputFormat(data);
/*
* public boolean setInputFormat(Instances instanceInfo)throws java.lang.Exception
* Sets the format of the input instances(设置输入数据的格式). If the filter is able to determine the output format before seeing any input instances, it does so here(如果过滤器在查看任何输入文件之前可以决定 输入文件的格式,那么这个函数就放在这里).
* This default implementation clears the output format and output queue, and the new batch flag is set.
* Overriders should call super.setInputFormat(Instances)
* Parameters:
* instanceInfo - an Instances object containing the input instance structure (any instances contained in the object are ignored - only the structure is required).
* Returns:
* true if the outputFormat may be collected immediately
* Throws:
* java.lang.Exception - if the inputFormat can't be set successfully
*/
Instances dataCluster=Filter.useFilter(data, filter);
/*public static Instances useFilter(Instances data,Filter filter)throws java.lang.Exception
* Filters an entire set of instances through a filter and returns the new set.
* 传入两个参数,第一个是需要进行过滤的数据,第二个是使用的过滤器,返回只为新的数据集。
* Parameters:
* data - the data to be filtered
* filter - the filter to be used
* Returns:
* the filtered set of data
* Throws:
* java.lang.Exception - if the filter can't be used successfully
*/
EM clusterer=new EM();
/*
* public class EM
* extends RandomizableDensityBasedClusterer
* implements NumberOfClustersRequestable, WeightedInstancesHandler
* Simple EM (expectation maximisation) class.
* EM assigns a probability distribution to each instance which indicates the probability of it belonging to each of the clusters. EM can decide how many clusters to create by cross validation, or you may specify apriori how many clusters to generate.
* The cross validation performed to determine the number of clusters is done in the following steps:
* 1. the number of clusters is set to 1
* 2. the training set is split randomly into 10 folds.
* 3. EM is performed 10 times using the 10 folds the usual CV way.
* 4. the loglikelihood is averaged over all 10 results.
* 5. if loglikelihood has increased the number of clusters is increased by 1 and the program continues at step 2.
* The number of folds is fixed to 10, as long as the number of instances in the training set is not smaller 10. If this is the case the number of folds is set equal to the number of instances.
* Valid options are:
* -N <num>
* number of clusters. If omitted or -1 specified, then cross validation is used to select the number of clusters.
* -I <num>
* max iterations.(default 100)
* -V
* verbose.
* -M <num>
* minimum allowable standard deviation for normal density computation
* (default 1e-6)
* -O
* Display model in old format (good when there are many clusters)
* -S <num>
* Random number seed.(default 100)
*/ String [] options=new String[4]; // max. iterations //最大迭代次数
options[0] = "-I";
options[1] = "100";
//set cluster numbers,设置簇的个数
options[2]="-N";
options[3]="2"; clusterer.setOptions(options);
clusterer.buildClusterer(dataCluster);
//clusterer.buildClusterer(dataClusterer); // evaluate clusterer
ClusterEvaluation eval = new ClusterEvaluation();
eval.setClusterer(clusterer);
eval.evaluateClusterer(data); // print results
System.out.println("数据总数:"+data.numInstances()+"属性个数为:"+data.numAttributes());
System.out.println(eval.clusterResultsToString());
}
}
使用的数据是Weka安装目录下data文件夹中的labor.arff文件。
输出的结果是:
''+data.classIndex()的输出内容是:16
读取数据的属性个数一共有:17个.
数据总数:57属性个数为:17 EM
== Number of clusters: 2 Cluster
Attribute 0 1
(0.14) (0.86)
=================================================
duration
mean 1.5702 2.2532
std. dev. 0.4953 0.6764 wage-increase-first-year
mean 3.0708 3.9184
std. dev. 1.0028 1.3571 wage-increase-second-year
mean 3.8141 3.9964
std. dev. 0.8153 1.0624 wage-increase-third-year
mean 3.9133 3.9133
std. dev. 0.6522 0.6952 cost-of-living-adjustment
none 7.0614 36.9386
tcf 1.3707 8.6293
tc 2.2872 6.7128
[total] 10.7192 52.2808
working-hours
mean 39.4412 37.8196
std. dev. 0.8911 2.4268 pension
none 6.4515 6.5485
ret_allw 2.3211 3.6789
empl_contr 1.9466 42.0534
[total] 10.7192 52.2808
standby-pay
mean 6.7945 7.5462
std. dev. 1.4912 1.918 shift-differential
mean 3.4074 5.1002
std. dev. 1.6629 3.4277 education-allowance
yes 3.167 8.833
no 6.5522 42.4478
[total] 9.7192 51.2808
statutory-holidays
mean 10.555 11.1788
std. dev. 0.572 1.2533 vacation
below_average 4.657 21.343
average 4.0313 14.9687
generous 2.0309 15.9691
[total] 10.7192 52.2808
longterm-disability-assistance
yes 2.9977 48.0023
no 6.7215 3.2785
[total] 9.7192 51.2808
contribution-to-dental-plan
none 7.1218 3.8782
half 2.5419 34.4581
full 1.0556 13.9444
[total] 10.7192 52.2808
bereavement-assistance
yes 5.7192 50.2808
no 4 1
[total] 9.7192 51.2808
contribution-to-health-plan
none 6.2887 3.7113
half 1.8752 9.1248
full 2.5554 39.4446
[total] 10.7192 52.2808
Clustered Instances 0 8 ( 14%)
1 49 ( 86%) Log likelihood: -18.37167 Class attribute: class
Classes to Clusters: 0 1 <-- assigned to cluster
8 12 | bad
0 37 | good Cluster 0 <-- bad
Cluster 1 <-- good Incorrectly clustered instances : 12.0 21.0526 %
Weka:call for the EM algorithm to achieve clustering.(EM算法)的更多相关文章
- Andrew Ng机器学习公开课笔记 -- Mixtures of Gaussians and the EM algorithm
网易公开课,第12,13课 notes,7a, 7b,8 从这章开始,介绍无监督的算法 对于无监督,当然首先想到k means, 最典型也最简单,有需要直接看7a的讲义 Mixtures of G ...
- EM算法 The EM Algorithm
(EM算法)The EM Algorithm http://www.cnblogs.com/jerrylead/archive/2011/04/06/2006936.html EM算法原理 http: ...
- Machine Learning—Mixtures of Gaussians and the EM algorithm
印象笔记同步分享:Machine Learning-Mixtures of Gaussians and the EM algorithm
- Gaussian Mixture Models and the EM algorithm汇总
Gaussian Mixture Models and the EM algorithm汇总 作者:凯鲁嘎吉 - 博客园 http://www.cnblogs.com/kailugaji/ 1. 漫谈 ...
- Maximum likelihood from incomplete data via the EM algorithm (1977)
Maximum likelihood from incomplete data via the EM algorithm (1977)
- (EM算法)The EM Algorithm
http://www.cnblogs.com/jerrylead/archive/2011/04/06/2006936.html http://blog.sina.com.cn/s/blog_a7da ...
- The EM Algorithm
EM是我一直想深入学习的算法之一,第一次听说是在NLP课中的HMM那一节,为了解决HMM的参数估计问题,使用了EM算法.在之后的MT中的词对齐中也用到了.在Mitchell的书中也提到EM可以用于贝叶 ...
- Mixtures of Gaussians and the EM algorithm
http://cs229.stanford.edu/ http://cs229.stanford.edu/notes/cs229-notes7b.pdf
- 挑子学习笔记:两步聚类算法(TwoStep Cluster Algorithm)——改进的BIRCH算法
转载请标明出处:http://www.cnblogs.com/tiaozistudy/p/twostep_cluster_algorithm.html 两步聚类算法是在SPSS Modeler中使用的 ...
随机推荐
- error while loading shared libraies :libopencv_core_so.3.4:cannot open shared object
TX2 上安装自己编译的opencv,使用时出现: error while loading shared libraies :libopencv_core_so.3.4:cannot open sha ...
- 洛谷 P1372 又是毕业季I
可能所有的数论题都是这样玄学.... 题目链接:https://www.luogu.org/problemnew/show/P1372 这道题通过暴力的枚举可以发现是不可做的(当然我也不会做) 然后就 ...
- Jprofiler的安装部署及使用
本地与远程安装同版本的jprofiler.以本地Windows操作系统,远程AIX操作系统为例,详细介绍安装配置步骤.本次测试使用的均是jp6版本. 一.安装Jprofiler服务端 一 般情况下,J ...
- spring配置文件中导入约束的详细步骤
这里先以<beans>元素为例: 首先在eclipse中引入相关约束: 点击OK后,这个约束就被引入到eclipse中了,这一步的意义在于:就算你处于脱机情况下(不能联网),也能给你提示. ...
- R语言批量生成变量(变量名中含有参数)
我们经常会需要生成这样一类的变量,比如a1,a2,a3...... 这时候我们需要用到这两个函数:get()和assign() get()用法 get()函数只是在环境中搜索该变量名的变量,如果该变量 ...
- hdu 2570 贪心
贪心的经典题型 该死的精度问题,WA了好几次,以后能用乘的绝不用除!! #include<iostream> #include<algorithm> #include<c ...
- 在Eclipse中添加Servlet-api.jar的方法
方法一: 正确的加载servlet-api.jar的方法如下: 1:右击项目工程名称 2:Properties 3: Jvav Build Path 4: Libraries 5: Add Ex ...
- 【实战】SSL和TLS漏洞验证
工具下载:git clone https://github.com/drwetter/testssl.sh.git 实验环境:192.168.1.22(bee-box v1.6) 192.168.1. ...
- Ubuntu电源键软关机设置
对于不连接显示器的Ubuntu设备,通过直接拔电源或者长按电源键是普遍的关机方法,但这种方法长期势必会对设备造成损坏. 下面设置电源键软关机(短摁电源按钮关机)的方法可以解决此问题.(默认摁电源键会弹 ...
- PIE SDK神经网络聚类
1.算法功能简介 神经网络是模仿人脑神经系统的组成方式与思维过程而构成的信息处理系统,具有非线性.自学性.容错性.联想记忆和可以训练性等特点.在神经网络中,知识和信息的传递是由神经元的相互连接来实现的 ...