Twenty Newsgroups Classification任务之二seq2sparse（2）

接上篇，SequenceFileTokenizerMapper的输出文件在/home/mahout/mahout-work-mahout0/20news-vectors/tokenized-documents/part-m-00000文件即可查看，同时可以编写下面的代码来读取该文件（该代码是根据前面读出聚类中心点文件改编的），如下：

package mahout.fansy.test.bayes.read;

import java.util.ArrayList;

import java.util.List;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Writable;

import org.apache.mahout.common.StringTuple;

import org.apache.mahout.common.iterator.sequencefile.PathFilters;

import org.apache.mahout.common.iterator.sequencefile.PathType;

import org.apache.mahout.common.iterator.sequencefile.SequenceFileDirValueIterable;

public class ReadFromTokenizedDocuments {

	/**

	 * @param args

	 */

	private static Configuration conf;

	public static void main(String[] args) {

		conf=new Configuration();

		conf.set("mapred.job.tracker", "ubuntu:9001");

		String path="hdfs://ubuntu:9000/home/mahout/mahout-work-mahout0/20news-vectors/tokenized-documents/part-m-00000";

		getValue(path,conf);

	}

	 /**

     * 把序列文件读入到一个变量中；

     * @param path 序列文件

     * @param conf  Configuration

     * @return  序列文件读取的变量

     */

    public static List<StringTuple> getValue(String path,Configuration conf){

    	Path hdfsPath=new Path(path);

    	List<StringTuple> list = new ArrayList<StringTuple>();

    	for (Writable value : new SequenceFileDirValueIterable<Writable>(hdfsPath, PathType.LIST,

    	        PathFilters.partFilter(), conf)) {

    	      Class<? extends Writable> valueClass = value.getClass();

    	      if (valueClass.equals(StringTuple.class)) {

    	    	  StringTuple st = (StringTuple) value;

    	          list.add(st);

    	      } else {

    	        throw new IllegalStateException("Bad value class: " + valueClass);

    	      }

    	    }

    	return list;

    }

}

通过上面的文件可以读取到第一个StringTuple的单词个数有1320个（去掉stop words的单词数）；

然后就又是一堆参数的设置，一直到267行，判断processIdf是否为非true，因为前面设置的是tfdif，所以这里进入else代码块，如下：

if (!processIdf) {

        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,

          minLLRValue, norm, logNormalize, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);

      } else {

        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath, outputDir, tfDirName, conf, minSupport, maxNGramSize,

          minLLRValue, -1.0f, false, reduceTasks, chunkSize, sequentialAccessOutput, namedVectors);

      }

这里直接调用DictionaryVectorizer的createTermFrequencyVectors方法，进入该方法（DictionaryVectorizer的145行），可以看到首先也是一些参数的设置，然后就到了startWordCounting方法了，进入这个方法可以看到这个是一个Job的基本设置，其Mapper、Combiner、Reducer分别为：TermCountMapper、TermCountCombiner、TermCountReducer，下面分别来看各个部分的作用（其实和最基本的wordcount很相似）：

TermCountMapper，首先贴代码：

protected void map(Text key, StringTuple value, final Context context) throws IOException, InterruptedException {

    OpenObjectLongHashMap<String> wordCount = new OpenObjectLongHashMap<String>();

    for (String word : value.getEntries()) {

      if (wordCount.containsKey(word)) {

        wordCount.put(word, wordCount.get(word) + 1);

      } else {

        wordCount.put(word, 1);

      }

    }

    wordCount.forEachPair(new ObjectLongProcedure<String>() {

      @Override

      public boolean apply(String first, long second) {

        try {

          context.write(new Text(first), new LongWritable(second));

        } catch (IOException e) {

          context.getCounter("Exception", "Output IO Exception").increment(1);

        } catch (InterruptedException e) {

          context.getCounter("Exception", "Interrupted Exception").increment(1);

        }

        return true;

      }

    });

该部分代码首先定义了一个Mahout开发人员定义的Map类，然后遍历value中的各个单词（比如第一个value中有1320个单词）；当遇到map中没有的单词就把其加入map中，否则把map中该单词的数量加1更新原来的单词的数量，即for循环里面做的事情；然后就是forEachPair方法了，这里应该是复写了该方法？好像是直接新建了一个类然后把这个新建的类作为forEachPair的参数；直接看context.write吧，应该是把wordCount(这个变量含有每个单词和它的计数)中的各个单词和单词计数分别作为key和value输出；

然后是TermCountCombiner和TermCountReducer，这两个代码一样的和当初学习Hadoop入门的第一个例子是一样的，这里就不多说了。查看log信息，可以看到reduce一共输出93563个单词。

然后就到了createDictionaryChunks函数了，进入到DictionaryVectorizer的215行中的该方法：

 List<Path> chunkPaths = Lists.newArrayList();

    Configuration conf = new Configuration(baseConf);

    FileSystem fs = FileSystem.get(wordCountPath.toUri(), conf);

    long chunkSizeLimit = chunkSizeInMegabytes * 1024L * 1024L;

    int chunkIndex = 0;

    Path chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex);

    chunkPaths.add(chunkPath);

    SequenceFile.Writer dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class);

    try {

      long currentChunkSize = 0;

      Path filesPattern = new Path(wordCountPath, OUTPUT_FILES_PATTERN);

      int i = 0;

      for (Pair<Writable,Writable> record

           : new SequenceFileDirIterable<Writable,Writable>(filesPattern, PathType.GLOB, null, null, true, conf)) {

        if (currentChunkSize > chunkSizeLimit) {

          Closeables.closeQuietly(dictWriter);

          chunkIndex++;

          chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex);

          chunkPaths.add(chunkPath);

          dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class);

          currentChunkSize = 0;

        }

        Writable key = record.getFirst();

        int fieldSize = DICTIONARY_BYTE_OVERHEAD + key.toString().length() * 2 + Integer.SIZE / 8;

        currentChunkSize += fieldSize;

        dictWriter.append(key, new IntWritable(i++));

      }

      maxTermDimension[0] = i;

    } finally {

      Closeables.closeQuietly(dictWriter);

    }

这里看到新建了一个Writer，然后遍历该文件的key和value，但是只读取key值，即单词，然后把这些单词进行编码，即第一个单词用0和它对应，第二个单词用1和它对应。

上面代码使用的dictWriter查看变量并没有看到哪个属性是存储单词和对应id的，所以这里的写入文件的机制是append就写入？还是我没有找到正确的属性？待查。。。

分享，快乐，成长

转载请注明出处：http://blog.csdn.net/fansy1990

Twenty Newsgroups Classification任务之二seq2sparse（2）的更多相关文章

Twenty Newsgroups Classification任务之二seq2sparse（5）
接上篇blog,继续分析.接下来要调用代码如下: // Should document frequency features be processed if (shouldPrune || proce ...
Twenty Newsgroups Classification任务之二seq2sparse
seq2sparse对应于mahout中的org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles,从昨天跑的算法中的任务监控界面可以看到 ...
Twenty Newsgroups Classification任务之二seq2sparse（3）
接上篇,如果想对上篇的问题进行测试其实可以简单的编写下面的代码: package mahout.fansy.test.bayes.write; import java.io.IOException; ...
mahout 运行Twenty Newsgroups Classification实例
按照mahout官网https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups的说法,我只用运行一条命令就可以完成这个算法 ...
Twenty Newsgroups Classification实例任务之TrainNaiveBayesJob(一)
接着上篇blog,继续看log里面的信息如下: + echo 'Training Naive Bayes model' Training Naive Bayes model + ./bin/mahou ...
项目笔记《DeepLung:Deep 3D Dual Path Nets for Automated Pulmonary Nodule Detection and Classification》（二）（上）模型设计
我只讲讲检测部分的模型,后面两样性分类的试验我没有做,这篇论文采用了很多肺结节检测论文都采用的u-net结构,准确地说是具有DPN结构的3D版本的u-net,直接上图. DPN是颜水成老师团队的成果, ...
深度学习数据集Deep Learning Datasets
Datasets These datasets can be used for benchmarking deep learning algorithms: Symbolic Music Datase ...
Open Data for Deep Learning
Open Data for Deep Learning Here you’ll find an organized list of interesting, high-quality datasets ...
深度学习课程笔记（二）Classification： Probility Generative Model
深度学习课程笔记(二)Classification: Probility Generative Model 2017.10.05 相关材料来自:http://speech.ee.ntu.edu.tw ...

随机推荐

NET Core 1.0
VS Code从零开始开发调试.NET Core 1.0 使用VS Code 从零开始开发调试.NET Core 1.0. .NET Core 是一个开源的.跨平台的 .NET 实现. VS Code ...
VirtualBox中的Ubuntu没有权限访问共享文件夹/media/sf_bak
之前已经搞定可以自动共享文件夹了,但是现在发现无法去访问,非root用户下,使用“ls /media/sf_bak”提示没有权限,当然如果切换到root,是可以的. [解决过程]1.把普通用户名加入到 ...
互联网创业十问？good（快速迭代、把握核心用户应对抄袭，不需要把商业模式考虑完备，4种失败的信号，失败者没资格说趁着年轻...）
著作权归作者所有.商业转载请联系作者获得授权,非商业转载请注明出处.作者:曹政链接:https://www.zhihu.com/question/20264499/answer/28168079来源: ...
oracle数据库连接无响应的解决
昨天中午时,查询到服务器的数据流水最晚记录是早上8点的,现场查看服务日志很奇怪,日志输出显示挂死在数据库连接这一步.多次调试无果,随后百度发现有资料显示oracle 10.2.1的版本有登录无响应的B ...
iOS swift lazy loading
Why bother lazy loading and purging pages, you ask? Well, in this example, it won't matter too much ...
BZOJ 2330: [SCOI2011]糖果( 差分约束 )
坑爹...要求最小值要转成最长路来做.... 小于关系要转化一下 , A < B -> A <= B - 1 ------------------------------------ ...
nginx提示：500 Internal Server Error错误的解决方法
现在越来越多的站点开始用 Nginx ,("engine x") 是一个高性能的 HTTP 和反向代理服务器,也是一个 IMAP/POP3/SMTP 代理服务器. Nginx 是由 ...
高级UIKit-05(CoreData)
[day06_1_CoreDataPerson]:保存person对象到coreData数据库保存大量数据时用CoreData保存到数据库,数据库会存在documents目录下操作步骤: 1.创建 ...
admin嵌套在spring mvc项目里，菜单栏点击新连接每次都会重置
<ul class="treeview-menu" id="ul_schedule"> <li><a href="#&q ...
怎样用HTML5 Canvas制作一个简单的游戏
原文连接: How To Make A Simple HTML5 Canvas Game 自从我制作了一些HTML5游戏(例如Crypt Run)后,我收到了很多建议,要求我写一篇关于怎样利用HTML ...

Twenty Newsgroups Classification任务之二seq2sparse（2）

Twenty Newsgroups Classification任务之二seq2sparse（2）的更多相关文章

随机推荐

热门专题