weka数据挖掘拾遗（一）---- 生成Arff格式文件

一、什么是arff格式文件

　　1、arff是Attribute-Relation File Format缩写，从英文字面也能大概看出什么意思。它是weka数据挖掘开源程序使用的一种文件模式。由于weka是个很出色的数据挖掘开源项目，所以使用的比较广，这也无形中推广了它的数据存储格式。

　　2、下面是weka自带的一个arff文件例子（weather.arff）

 @relation weather

 @attribute outlook {sunny, overcast, rainy}

 @attribute temperature real

 @attribute humidity real

 @attribute windy {TRUE, FALSE}

 @attribute play {yes, no}

 @data

 sunny,,,FALSE,no

 sunny,,,TRUE,no

 overcast,,,FALSE,yes

 rainy,,,FALSE,yes

 rainy,,,FALSE,yes

 rainy,,,TRUE,no

 overcast,,,TRUE,yes

 sunny,,,FALSE,no

 sunny,,,FALSE,yes

 rainy,,,FALSE,yes

 sunny,,,TRUE,yes

 overcast,,,TRUE,yes

 overcast,,,FALSE,yes

 rainy,,,TRUE,no

　　a) 第1行，是关系名称，这个自己随便起，不过写的最好要有意义。

　　 b) 第3~7行是特征列表，其中第1列是特征说明，不可缺少，第2列是特征名称，第3列是特征类型或特征取值范围。

　　 c) @data(第9行)是数据域说明，在它下面的全是数据。其中每一行体表一条数据。

　　 d) 例子中给出的数据域是最基本的表示方法，实际应用中，一般都是用稀疏表示法。

　　 e) 此处对于arff文件格式不做进一步解释，不懂的地方可以给我留言。

二、总体思路

　　1、生成特征文件

　　2、文件格式转换

三、具体实现

　　1、特征的生成

　　　　这里假设我们已经生成了特征。（因为特征选择我会另写一篇文章单独介绍）

　　2、arff文件生成

　　　　a) 生成文件的接口

 package com.lvxinjian.alg.models.generatefile;

 /**

  * @Descriptin : 生成指定格式文件的接口

  * @author ：Lv Xinjian

  *

  */

 public interface GenerateFile {

     /**

      * @function 生成文件

      * @param obj 输入数据

      * @param option 参数

      * @return 是否生成成功

      */

     abstract public boolean GenerFile(Object obj , String option);

 }

　　　 b) 生成arff的主文件

 package com.lvxinjian.alg.models.generatefile;

 import java.io.IOException;

 import java.nio.charset.Charset;

 import java.util.ArrayList;

 import java.util.HashSet;

 import java.util.concurrent.ExecutorService;

 import java.util.concurrent.Executors;

 import java.util.concurrent.Future;

 import weka.core.FastVector;

 import weka.core.Instance;

 import weka.core.Instances;

 import com.iminer.alg.models.sampling.SVMSampleBean;

 import com.iminer.alg.models.sampling.SampleBean;

 import com.iminer.alg.models.sampling.SampleUtils;

 import com.iminer.tool.common.util.FileTool;

 /**

  * @Description : 生成arff格式的文件

  * @author ： Lv Xinjian

  *

  */

 public class GenerateArffFile implements GenerateFile {

     /**

      * 保存arff文件

      */

     private static Instances data = null;

     /**

      * 分类标签

      */

     private ClassifyAttribute classifyAttribute = new ClassifyAttribute();

     /**

      * Instance name

      */

     public final String InstanceName = "MyRelation";

     /**

      * 抽取instance的方法 ，默认为方法一

      */

     private String getInstancesMothed = "one";

     /**

      * 保存转换后的SVM数据

      */

     private ArrayList<SVMSampleBean> listSVMBean = new ArrayList<SVMSampleBean>();

     /**

      * 生成arff文件时使用的词表路径

      */

     private String lexPath = null;

     /**

      * 保存arff文件的路径

      */

     private String outputPath = null;

     /**

      * 生成arff文件的主函数

      */

     private int threadNum = 5;

     @Override

     public boolean GenerFile(Object obj, String option) {

         String  [] paramArray = option.split(" ");

         return GenerFile(obj,paramArray);

     }

     /**

      * 生成arff文件的主函数

      */

     public boolean GenerFile(Object obj, String [] paramArray) {

         try {

             if(!Initialization(obj , paramArray))

                 return false;

             GenerateData();

         } catch (IOException e) {

             e.printStackTrace();

             return false;

         }

         return true;

     }

     /**

      * @function 解析命令行，初始化参数

      * @param obj 传入的数据

      * @param paramArrayOfString 命令行

      * @return 初始化是否生成成功，true：成功    false：失败

      */

     private boolean Initialization(Object obj, String [] paramArrayOfString){

         try{

             ArrayList<?> list = (ArrayList<?>)obj;

             for(Object s : list){

                 listSVMBean.add((SVMSampleBean)s);

             }

             //初始化词表路径

             String lexPath = ParameterUtils.getOption("lexPath", paramArrayOfString);

             if(lexPath.length() != 0)

                 this.lexPath = lexPath;

             //初始化arff文件保存路径

             String outputPath = ParameterUtils.getOption("output", paramArrayOfString);

             if(outputPath.length() != 0)

                 this.outputPath = outputPath;

             //初始化instance抽取方法，默认为方法一

             String getInstanceMothed = ParameterUtils.getOption("mothed", paramArrayOfString);

             if(getInstanceMothed.length() != 0)

                 this.getInstancesMothed = getInstanceMothed;

             String threadNum = ParameterUtils.getOption("thread", paramArrayOfString);

             if(threadNum.length() != 0){

                 this.threadNum = Integer.parseInt(threadNum);

                 System.out.println("线程数 ：" + this.threadNum);

             }else{

                 System.out.println("使用默认线程数：5");

             }

             //初始化类别名称，不可省略

             String className = ParameterUtils.getOption("class", paramArrayOfString);

             if(className.length() != 0)

                 classifyAttribute.setClassname(className);

             //初始化类别标签

             String labels = ParameterUtils.getOption("label", paramArrayOfString);

             if(labels.length() != 0){

                 classifyAttribute.setClassLabel(labels.split(","));

             }

             else{

                 System.out.println("please Initialize classify labels!");

                 return false;

             }

         }

         catch(Exception e){

             e.printStackTrace();

         }

         return true;

     }

     /**

      * @function 生成arff的主函数

      * @throws IOException

      */

     public void GenerateData() throws IOException {

         ArrayList<String> words = new ArrayList<String>();

         FastVector atts = ArffFileUtils.GetAttributes(this.lexPath , words ,this.classifyAttribute );

         data = new Instances(this.InstanceName, atts, 0);    

         ExecutorService service = Executors.newFixedThreadPool(this.threadNum);

         int count = 0;

         for (SVMSampleBean it : listSVMBean) {

             if (count++ % 1000 == 0)

                 System.out.println("processed " + count + " recoreds");    

             //获取类别标签的下标

             double labelVal = data.attribute(classifyAttribute.getClassname()).indexOfValue(it.getLabel());

             MyCallableClass task = new MyCallableClass(labelVal, it.getContent(), words ,this.getInstancesMothed);

             Future is =  service.submit(task);

             try{

                 if(is.get() != null)

                     data.add((Instance) is.get());

             }catch(Exception e){

                 e.printStackTrace();

             }

         }

         service.shutdown();

         //保存data中的数据

         ArffFileUtils.savaInstances(data,this.outputPath);

         //清空data中的数据

         data.delete();

     }

     /**

      * @function

      * @param lstContent

      * @param lexPath

      * @param labels

      * @param outputPath

      * @return

      */

     public static boolean generateArff(ArrayList<String> lstContent ,String lexPath , String labels, String outputPath){

         try{

             ArrayList<SampleBean> list = new ArrayList<SampleBean>();

             HashSet<String> lstLabel = new HashSet<String>();

             for(String str : lstContent){

                 SVMSampleBean svmBean = new SVMSampleBean(str.replace("\t", SampleUtils.SPECIAL_CHAR));

                 list.add(svmBean);

                 lstLabel.add(svmBean.label);

             }

             if(labels == null){

                 labels = "";

                 for(String label : lstLabel){

                     if(labels != "")

                         labels += ",";

                     labels += label;

                 }

             }

             String [] options = new String[]{"-lexPath" , lexPath,

                                             "-output" , outputPath,

                                             "-mothed" , "one",

                                             "-class" , "CLASS",

                                             "-label" , labels};

             GenerateArffFile genArff =new GenerateArffFile();

             genArff.GenerFile(list, options);

             return true;

         }catch(Exception e){

             e.printStackTrace();

             return false;

         }

     }

     public static void main(String [] args){

         try{

         String root = "C:\\Users\\Administrator\\Desktop\\12_05\\模型训练\\1219\\";

         ArrayList<SampleBean> list = new ArrayList<SampleBean>();

         ArrayList<String> lstContent = FileTool.LoadListFromFile(root + "不重合的部分.23.txt", 0 , Charset.forName("utf8"));

         for(String str : lstContent){

             SVMSampleBean svmBean = new SVMSampleBean(str.replace("\t", SampleUtils.SPECIAL_CHAR));

             list.add(svmBean);

         }

 //        DFFeatureSelector selector = new DFFeatureSelector();

 //        String options = "-maxFeatureNum 1000 -outputPath "+root + "lex.txt";

 //        selector.selectFeature(list, options);

 //        options = "-lexPath "+root +"lex.txt -output genArffTest.arff -mothed one -class CLASS -label -1,1,2";

         String [] options = new String[]{"-lexPath" , root + "temp/temp.lex",

                                         "-output" , root + "不重合的部分.23.arff",

                                         "-mothed" , "one",

                                         "-class" , "CLASS",

                                         "-label" , "2,1,-1",

                                         "-thread","10"};

         GenerateArffFile genArff =new GenerateArffFile();

         genArff.GenerFile(list, options);

         CalcAttributeFromArffFile.calcAttribute(root + "不重合的部分.23.arff" , root + "calc.txt");

         }catch(Exception e){

             e.printStackTrace();

         }

     }

 }

　　　　c) 命令行解析工具

 package com.lvxinjian.alg.models.generatefile;

 /**

  * @Description : 抽取参数

  * @author : Lv Xinjian

  */

 public class ParameterUtils {

     static public void main(String [] args){

         String option = "-min 1 -max 100 -w 1.2,1.3  -filter  g";

         try {

             boolean r = getFlag("filter", option.split(" "));

             System.out.println(r);

         } catch (Exception e) {

             e.printStackTrace();

         }

     }

     /**

      * @function 抽取字符类的布尔型参数

      * @param flag 字符

      * @param options 参数数组

      * @return

      * @throws Exception

      */

     public static boolean getFlag(char flag, String[] options) throws Exception

     {

       return getFlag("" + flag, options);

     }

     /**

      * @function 抽取字符串类的布尔型参数

      * @param flag 字符

      * @param options 参数数组

      * @return

      * @throws Exception

      */

     public static boolean getFlag(String flag, String[] options) throws Exception

     {

       int pos = getOptionPos(flag, options);

       if (pos > -1) {

         options[pos] = "";

       }

       return (pos > -1);

     }

     /**

      * @function 抽取字符串类的字符串型参数

      * @param flag 字符串

      * @param options 参数数组

      * @return

      * @throws Exception

      */

     public static String getOption(String flag, String[] options)throws Exception {

         int i = getOptionPos(flag, options);

         if (i > -1) {

             if (options[i].equals("-" + flag)) {

                 if (i + 1 == options.length) {

                     throw new Exception("No value given for -" + flag

                             + " option.");

                 }

                 options[i] = "";

                 String newString = new String(options[(i + 1)]);

                 options[(i + 1)] = "";

                 return newString;

             }

             if (options[i].charAt(1) == '-') {

                 return "";

             }

         }

         return "";

     }

     /**

      * @function 抽取字符类的短整型参数

      * @param flag 字符

      * @param options 参数数组

      * @return

      * @throws Exception

      */

     public static int getOptionPos(char flag, String[] options)

     {

       return getOptionPos("" + flag, options);

     }

     /**

      * @function 抽取字符串类的短整型参数

      * @param flag 字符

      * @param options 参数数组

      * @return

      * @throws Exception

      */

     public static int getOptionPos(String flag, String[] options) {

         if (options == null) {

             return -1;

         }

         for (int i = 0; i < options.length; ++i) {

             if ((options[i].length() <= 0) || (options[i].charAt(0) != '-'))

                 continue;

             try {

                 Double.valueOf(options[i]);

             } catch (NumberFormatException e) {

                 if (options[i].equals("-" + flag)) {

                     return i;

                 }

                 if (options[i].charAt(1) == '-') {

                     return -1;

                 }

             }

         }

         return -1;

     }

     /**

      * @function 把数据转换成字符串

      * @param array

      * @param splitPunc

      * @return

      */

     public static String array2String(String [] array , String splitPunc){

         try{

             StringBuilder sb = new StringBuilder();

             for(String para : array){

                 if(sb.length() != 0)

                     sb.append(splitPunc);

                 sb.append(para);

             }

             return sb.toString();

         }catch(Exception e){

             e.printStackTrace();

             return null;

         }

     }

     public static String [] String2Array(String para , String splitPunc){

         try{

             return para.split(splitPunc);

         }catch(Exception e){

             e.printStackTrace();

             return null;

         }

     }

     /**

      * @替换字符串中的某个参数

      * @param para 参数字符串

      * @param splitPunc 参数分隔符

      * @param key 参数名称

      * @param value 新参数值

      * @return

      */

     public static String replacePara(String para , String splitPunc , String key , String value){

         try{

             String [] parameter = String2Array(para,splitPunc);

             replacePara(parameter, key, value);

             return array2String(parameter, splitPunc);

         }catch(Exception e){

             e.printStackTrace();

             return null;

         }

     }

     /**

      * @function 替换一个参数

      * @param para 参数数组

      * @param key 参数名称

      * @param value 新参数值

      * @return

      */

     public static String [] replacePara(String [] para , String key , String value){

         try{

             for(int i = 0 ; i < para.length ; i++){

                 String paraName = para[i];

                 if(paraName.contains(key)){

                     int nextIndex = i + 1;

                     if(nextIndex < para.length){

                         if(!para[nextIndex].contains("-"))

                             para[nextIndex] = value;

                     }

                 }

             }

             return para;

         }catch(Exception e){

             e.printStackTrace();

             return null;

         }

     }

 }

　　　　d) 生成arff文件的辅助类

 package com.lvxinjian.alg.models.generatefile;

 import java.io.BufferedWriter;

 import java.io.FileOutputStream;

 import java.io.IOException;

 import java.io.OutputStreamWriter;

 import java.nio.charset.Charset;

 import java.util.HashMap;

 import java.util.List;

 import java.util.concurrent.Callable;

 import weka.core.Attribute;

 import weka.core.FastVector;

 import weka.core.Instance;

 import weka.core.Instances;

 import weka.core.SparseInstance;

 import com.iminer.alg.models.getInstance.InstanceFactory;

 import com.iminer.tool.common.util.FileTool;

 /**

  * @Description : 生成arff文件的工具类

  * @author : Lv Xinjian

  */

 public class ArffFileUtils {

     /**

      * @function 生成特征向量

      * @param lexDataPath 特征文件路径

      * @param words    保存特征词

      * @param classifyAttribute 类别特征

      * @return    特征向量

      * @throws IOException 文件读取异常

      */

     public static FastVector GetAttributes(String lexDataPath , List<String> words ,ClassifyAttribute classifyAttribute)

             throws IOException {

         HashMap<Integer, String> termId_term = FileTool.LoadIdStrFromFile(

                 lexDataPath, 0, 0, 1, "\t", Charset.forName("utf8"));

         FastVector atts = new FastVector();

         // - numeric

         for (int tid = 0; tid < termId_term.size(); tid++) {

             atts.addElement(new Attribute(termId_term.get(tid)));

             words.add(termId_term.get(tid));

         }

         atts.addElement(new Attribute(classifyAttribute.getClassname(), classifyAttribute.getAttClassLabel()));

         System.out.println("atribute size :" + atts.size());

         return atts;

     }

     /**

      * @function 保存Arff文件

      * @param data arff格式的数据

      * @param outputPath 数据保存路径

      * @return

      */

     public static boolean savaInstances(Instances data , String outputPath)

     {

         try{

             BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(

                     new FileOutputStream(outputPath), Charset.forName("utf-8")));

             bw.write(data.toString());

             bw.close();

         }catch(Exception e){

             e.printStackTrace();

             return false;

         }

         return true;

     }

 }

 /**

  * @Description 分类器 类别特征

  * @author Administrator

  *

  */

 class ClassifyAttribute{

     /**

      * 分类特征名 ， 默认为"CLASS"

      */

     private String classname = "CLASS";

     private FastVector attClassLabel = new FastVector();

     public ClassifyAttribute(String [] labels){

         for(String label: labels){

             this.attClassLabel.addElement(label);

         }

     }

     public ClassifyAttribute(){}

     /**

      * @function 获取类别特征名

      * @return

      */

     public String getClassname() {

         return classname;

     }

     /**

      * @functoin 设置类别特征名

      * @param classname

      */

     public void setClassname(String classname) {

         this.classname = classname;

     }

     /**

      * @function 获取类别特征

      * @return 类别特征

      */

     public FastVector getAttClassLabel() {

         return attClassLabel;

     }

     /**

      * @function 设置类别特征

      * @param attClassLabel

      */

     public void setClassLabel(FastVector attClassLabel) {

         this.attClassLabel = attClassLabel;

     }

     /**

      * @function 设置类别特征分类标签

      * @param labels

      */

     public void setClassLabel(String [] labels) {

         for(String label: labels){

             this.attClassLabel.addElement(label);

         }

     }    

 }

 /**

  * @Description 把字符串转化成 instance实例

  * @author Administrator

  *

  */

 class MyCallableClass implements Callable{

     /**

      * 输入的文本

      */

     private String _text;

     /**

      * 文本的极性标签

      */

     private double _labelVal;

     /**

      * 生成instance的方法

      */

     private String _getInstancesMothed = null;

     /**

      * 获取instance方法 的 初始化词表

      */

     private List<String> _set = null;

     /**

      * @function 构造函数，初始化参数

      * @param labelVal 极性类别标签

      * @param text 输入的文本内容

      * @param set 获取instance方法 的 初始化词表

      * @param getInstancesMothed 生成instance的方法

      */

     public MyCallableClass(final double labelVal, String text ,List<String> set ,String getInstancesMothed)

     {

         this._text = text;

         this._labelVal = labelVal;

         this._getInstancesMothed = getInstancesMothed;

         this._set = set;

     }

     public Instance call() throws Exception{

         double[] vals;

         try {

             //获取instance的double 数组

             vals = InstanceFactory.getInstance(this._getInstancesMothed, this._set ).getInstanceFromText(this._text );

             if (vals != null) {

                 vals[vals.length - 1] = _labelVal;

                 SparseInstance si = new SparseInstance(1.0, vals);

                 return si;

             }    

         } catch (Exception e) {

             e.printStackTrace();

         }

         return null;     

     }

 }

　　　　e) 以上代码缺少了几个类，但由于涉及到公司的保密制度，所以不方便上传。如有疑问，可以给我留言。（其实就是一个生成instance的方法，不过我的方法中揉了些东西进去，方法本身很简单，几行代码的事儿，有空我会补上）

四、小结

　　　　1、考虑到抽取instance可能会有不同的需求，所以用了一个工厂类，这样方便使用和新方法的添加。

　　　　2、考虑到效率问题，所以使用了多线程进行生成。

　　　　3、代码的结构、风格以及变量命名是硬伤，希望多多批评、指点。

weka数据挖掘拾遗（一）---- 生成Arff格式文件的更多相关文章

weka数据挖掘拾遗（三）----再谈如果何生成arff
前一阵子写过一个arff的随笔,但是写完后发现有些啰嗦.其实如果使用weka自带的api,生成arff文件将变成一件很简单的事儿. 首先,可以先把特征文件生成csv格式的.csv格式就是每列数据都用逗 ...
python 生成json格式文件，并存储到手机上
上代码 #!/usr/bin/env python # -*- encoding: utf-8 -*- import json import os import random "" ...
weka数据挖掘拾遗（二）---- 特征选择（IG、chi-square)
一.说明 IG是information gain 的缩写,中文名称是信息增益,是选择特征的一个很有效的方法(特别是在使用svm分类时).这里不做详细介绍,有兴趣的可以googling一下. chi-s ...
如何用python在Windows系统下，生成UNIX格式文件
平时测试工作中,少不了制造测试数据.最近一个项目,我就需要制造一批可在UNIX下正确读取的文件.为确保这批文件能从FTP下载成功,开发叮嘱我:“文件中凡是遇到换行,换行符必须是UNIX下的LF,而不是 ...
Flink生成Parquet格式文件实战
1.概述在流数据应用场景中,往往会通过Flink消费Kafka中的数据,然后将这些数据进行结构化到HDFS上,再通过Hive加载这些文件供后续业务分析.今天笔者为大家分析如何使用Flink消费Kaf ...
生成csv格式文件并导出至页面的前后台实现
一.前台实现: 1. HTML: <div> <a href="javascript:void(0);" class="btnStyleLeft&quo ...
java使用jdom生成xml格式文件
本文生成xml使用的工具是jdom.jar,下载地址如下: 链接:https://eyun.baidu.com/s/3slyHgnj 密码:0TXF 生成之后的文档格式类型,就如上面的图片一样,简单吧 ...
生成arff文件，csv转为arff
一.什么是arff格式文件 1.arff是Attribute-Relation File Format缩写,从英文字面也能大概看出什么意思.它是weka数据挖掘开源程序使用的一种文件模式.由于weka ...
keil MDK中如何生成*.bin格式的文件
在Realview MDK的集成开发环境中,默认情况下可以生成*.axf格式的调试文件和*.hex格式的可执行文件.虽然这两个格式的文件非常有利于ULINK2仿真器的下载和调试,但是ADS的用户更习惯 ...

随机推荐

WP8.1学习系列(第一章)——添加应用栏
做过android开发的同学们应该都知道有个ActionBar的头部操作栏,而wp也有类似的一个固定在app页面里通常拥有的内部属性,就是应用栏.以前叫做ApplicationBar,现在wp和win ...
springbatch---->springbatch的使用（四）
这里我们重点学习一下springbatch里面的各种监听器的使用,以及job参数的传递.追求得到之日即其终止之时,寻觅的过程亦即失去的过程. springbatch的监听器一.JOB LISTENE ...
goldengate 过滤对某张表的复制操作
在复制进程中配置下面的参数可以实现对一个用户下的某些表进行过滤,在复制的时候不做任何操作. MAPEXCLUDE: Valid for Replicat Use the MAPEXCLUDE par ...
九度OJ小结2
由于安排问题,距离上次小结时间已经过去很久.导致这次小结的内容很多. 本次小结涉及到主要内容如下所示: 基于并查集操作的最小生成树问题(prime算法或者kruskal算法): 最短路径问题(Floy ...
记录一次OOM排查经历（一）
一.经历概要程序里有个跑数据的job,这个job的主要功能是往数据库写假数据. 既需要跑历史数据(传给job的日期是过去的时间),也需要能够上线后,实时跑(十秒钟触发一次,传入触发时的当前时间). ...
【黑金原创教程】【FPGA那些事儿-驱动篇I 】实验十九：SDRAM模块② — 多字读写
实验十九:SDRAM模块② — 多字读写表示19.1 Mode Register的内容. Mode Register A12 A11 A10 A9 A8 A7 A6 A5 A4 A3 A2 A1 A ...
3-1 vue-resource基础介绍
1.静态引用 <script src="https://cdn.bootcss.com/vue-resource/1.3.4/vue-resource.js">< ...
iOS自带分享
NSArray *activityItems; if (self.sharingImage != nil) { activityItems = @[self.sharingText, self.sha ...
扩展Spring切面
概述 Spring的切面(Spring动态代理)在Spring中应用十分广泛,例如还有事务管理,重试等等.网上介绍SpringAop源码很多,这里假设你对SpringAop有基本的了解.如果你认为Sp ...
记一次centos7内核可能意外丢失（测试直接干掉）恢复方法
本次是虚拟机装的centos7的内核不知原因以外丢失造成无法开机,开机显示找不到内核! 恢复方法: 挂载新的ISO文件,然后进入bios选择dvd启动. 启动后进入Troublesshooting,然 ...

weka数据挖掘拾遗（一）---- 生成Arff格式文件

weka数据挖掘拾遗（一）---- 生成Arff格式文件的更多相关文章

随机推荐

热门专题