Mahout源码分析之 -- 文档向量化TF-IDF

fesh个人实践，欢迎经验交流！Blog地址：http://www.cnblogs.com/fesh/p/3775429.html

　Mahout之SparseVectorsFromSequenceFiles源码分析

一、原理

TF-IDF是一种统计方法，用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。

TFIDF的主要思想是：如果某个词或短语在一篇文章中出现的频率TF高，并且在其他文章中很少出现，则认为此词或者短语具有很好的类别区分能力，适合用来分类。

TFIDF实际上是：TF * IDF，TF词频(Term Frequency)，IDF逆向文件频率(Inverse Document Frequency)。

词频 (TF) 指的是某一个给定的词语在文件中出现的次数。这个数字通常会被归一化，以防止它偏向长的文件。（同一个词语在长文件里可能会比短文件有更高的词频，而不管该词语重要与否。）

逆向文件频率（IDF）是一个词语普遍重要性的度量，其主要思想是：如果包含词条t的文档越少，也就是n越小，IDF越大，则说明词条t具有很好的类别区分能力。

对于在某一特定文件里的词语来说，它的重要性可表示为：

以上式子中是该词在文件中的出现次数，而分母则是在文件中所有字词的出现次数之和（分母也可以是词出现次数的最大值）。

逆向文件频率（inverse document frequency，IDF）是一个词语普遍重要性的度量。某一特定词语的IDF，可以由总文件数目除以包含该词语之文件的数目，再将得到的商取对数得到：

其中

|D|：语料库中的文件总数
：包含词语的文件数目（即的文件数目）如果该词语不在语料库中，就会导致分母为零，因此一般情况下使用

然后

某一特定文件内的高词语频率，以及该词语在整个文件集合中的低文件频率，可以产生出高权重的TF-IDF。因此，TF-IDF倾向于过滤掉常见的词语，保留重要的词语。

二、源码分析

目标：将一个给定的sequence文件集合转化为SparseVectors

1、对文档分词

1.1）使用最新的{@link org.apache.lucene.util.Version}创建一个Analyzer，用来下文1.2分词；

      Class<? extends Analyzer> analyzerClass = StandardAnalyzer.class;

      if (cmdLine.hasOption(analyzerNameOpt)) {

        String className = cmdLine.getValue(analyzerNameOpt).toString();

        analyzerClass = Class.forName(className).asSubclass(Analyzer.class);

        // try instantiating it, b/c there isn't any point in setting it if

        // you can't instantiate it

        AnalyzerUtils.createAnalyzer(analyzerClass);

      }

1.2）使用{@link StringTuple}将input documents转化为token数组（input documents必须是{@link org.apache.hadoop.io.SequenceFile}格式）；

DocumentProcessor.tokenizeDocuments(inputDir, analyzerClass, tokenizedPath, conf);

输入：inputDir 输出：tokenizedPath

SequenceFileTokenizerMapper：

 //将input documents按Analyzer进行分词，并将分得的词放在一个StringTuple中
   TokenStream stream = analyzer.tokenStream(key.toString(), new StringReader(value.toString()));

    CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);

    stream.reset();

    StringTuple document = new StringTuple();//StringTuple是一个能够被用于Hadoop Map/Reduce Job的String类型有序List

    while (stream.incrementToken()) {

      if (termAtt.length() > ) {

        document.add(new String(termAtt.buffer(), , termAtt.length()));

      }

    }

2、创建TF向量（Term Frequency Vectors）---多个Map/Reduce Job

        DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,

                outputDir,

                tfDirName,

                conf,

                minSupport,

                maxNGramSize,

                minLLRValue,

                -1.0f,

                false,

                reduceTasks,

                chunkSize,

                sequentialAccessOutput,

                namedVectors);

2.1）全局词统计（TF）

startWordCounting(input, dictionaryJobPath, baseConf, minSupport);

使用Map/Reduce并行地统计全局的词频，这里只考虑(maxNGramSize == 1)

输入：tokenizedPath 输出：wordCountPath

TermCountMapper：

  //统计一个文本文档中的词频
  OpenObjectLongHashMap<String> wordCount = new OpenObjectLongHashMap<String>();

    for (String word : value.getEntries()) {

      if (wordCount.containsKey(word)) {

        wordCount.put(word, wordCount.get(word) + );

      } else {

        wordCount.put(word, );

      }

    }

    wordCount.forEachPair(new ObjectLongProcedure<String>() {

      @Override

      public boolean apply(String first, long second) {

        try {

          context.write(new Text(first), new LongWritable(second));

        } catch (IOException e) {

          context.getCounter("Exception", "Output IO Exception").increment();

        } catch (InterruptedException e) {

          context.getCounter("Exception", "Interrupted Exception").increment();

        }

        return true;

      }

    });

TermCountCombiner：（同 TermCountReducer）

TermCountReducer：

//汇总所有的words和单词的weights，并将同一word的权重sum
 long sum = ;

    for (LongWritable value : values) {

      sum += value.get();

    }

    if (sum >= minSupport) {//TermCountCombiner没有这个过滤）

      context.write(key, new LongWritable(sum));

    }

2.2）创建词典

 List<Path> dictionaryChunks;

      dictionaryChunks =

          createDictionaryChunks(dictionaryJobPath, output, baseConf, chunkSizeInMegabytes, maxTermDimension);

读取2.1词频Job的feature frequency List，并给它们指定id

输入：wordCountPath 输出：dictionaryJobPath

 /**

   * Read the feature frequency List which is built at the end of the Word Count Job and assign ids to them.

   * This will use constant memory and will run at the speed of your disk read

   */

  private static List<Path> createDictionaryChunks(Path wordCountPath,

                                                   Path dictionaryPathBase,

                                                   Configuration baseConf,

                                                   int chunkSizeInMegabytes,

                                                   int[] maxTermDimension) throws IOException {

    List<Path> chunkPaths = Lists.newArrayList();

    Configuration conf = new Configuration(baseConf);

    FileSystem fs = FileSystem.get(wordCountPath.toUri(), conf);

    long chunkSizeLimit = chunkSizeInMegabytes * 1024L * 1024L;//默认64M

    int chunkIndex = ;

    Path chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex);

    chunkPaths.add(chunkPath);

    SequenceFile.Writer dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class);

    try {

      long currentChunkSize = ;

      Path filesPattern = new Path(wordCountPath, OUTPUT_FILES_PATTERN);

      int i = ;

      for (Pair<Writable,Writable> record

           : new SequenceFileDirIterable<Writable,Writable>(filesPattern, PathType.GLOB, null, null, true, conf)) {

        if (currentChunkSize > chunkSizeLimit) {//生成新的词典文件

          Closeables.close(dictWriter, false);

          chunkIndex++;

          chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex);

          chunkPaths.add(chunkPath);

          dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class);

          currentChunkSize = ;

        }

        Writable key = record.getFirst();

        int fieldSize = DICTIONARY_BYTE_OVERHEAD + key.toString().length() *  + Integer.SIZE / ;

        currentChunkSize += fieldSize;

        dictWriter.append(key, new IntWritable(i++));//指定id

      }

      maxTermDimension[] = i;//记录最大word数目

    } finally {

      Closeables.close(dictWriter, false);

    }

    return chunkPaths;

  }

2.3）构造PartialVectors（TF）

int partialVectorIndex = ;

    Collection<Path> partialVectorPaths = Lists.newArrayList();

    for (Path dictionaryChunk : dictionaryChunks) {

      Path partialVectorOutputPath = new Path(output, VECTOR_OUTPUT_FOLDER + partialVectorIndex++);

      partialVectorPaths.add(partialVectorOutputPath);

      makePartialVectors(input, baseConf, maxNGramSize, dictionaryChunk, partialVectorOutputPath,

        maxTermDimension[], sequentialAccess, namedVectors, numReducers);

    }

将input documents使用a chunk of features创建a partial vector

（这是由于词典文件被分成了多个文件，每个文件只能构造总的vector的一部分，其中每一部分叫一个partial vector）

输入：tokenizedPath 输出：partialVectorPaths

Mapper：（Mapper）

TFPartialVectorReducer：

    //读取词典文件
//MAHOUT-1247

    Path dictionaryFile = HadoopUtil.getSingleCachedFile(conf);

    // key is word value is id

    for (Pair<Writable, IntWritable> record

            : new SequenceFileIterable<Writable, IntWritable>(dictionaryFile, true, conf)) {

      dictionary.put(record.getFirst().toString(), record.getSecond().get());

    }

//转化a document为a sparse vector
 StringTuple value = it.next();

    Vector vector = new RandomAccessSparseVector(dimension, value.length()); // guess at initial size

 for (String term : value.getEntries()) {

        if (!term.isEmpty() && dictionary.containsKey(term)) { // unigram

          int termId = dictionary.get(term);

          vector.setQuick(termId, vector.getQuick(termId) + );

        }

      }

2.4）合并PartialVectors（TF）

    Configuration conf = new Configuration(baseConf);

    Path outputDir = new Path(output, tfVectorsFolderName);

    PartialVectorMerger.mergePartialVectors(partialVectorPaths, outputDir, conf, normPower, logNormalize,

      maxTermDimension[], sequentialAccess, namedVectors, numReducers);

合并所有的partial {@link org.apache.mahout.math.RandomAccessSparseVector}s为完整的{@link org.apache.mahout.math.RandomAccessSparseVector}

输入：partialVectorPaths 输出：tfVectorsFolder

Mapper：（Mapper）

PartialVectorMergeReducer:

//合并partial向量为完整的TF向量
  Vector vector = new RandomAccessSparseVector(dimension, );

    for (VectorWritable value : values) {

      vector.assign(value.get(), Functions.PLUS);//将包含不同word的向量合并为一个

    }

3、创建IDF向量（document frequency Vectors）---多个Map/Reduce Job

      Pair<Long[], List<Path>> docFrequenciesFeatures = null;

      // Should document frequency features be processed

      if (shouldPrune || processIdf) {

        log.info("Calculating IDF");

        docFrequenciesFeatures =

                TFIDFConverter.calculateDF(new Path(outputDir, tfDirName), outputDir, conf, chunkSize);

      }

3.1)统计DF词频

Path wordCountPath = new Path(output, WORDCOUNT_OUTPUT_FOLDER);

startDFCounting(input, wordCountPath, baseConf);

输入：tfDir 输出：featureCountPath

TermDocumentCountMapper：

 //为一个文档中的每个word计数1、文档数1
   Vector vector = value.get();

    for (Vector.Element e : vector.nonZeroes()) {

      out.set(e.index());

      context.write(out, ONE);

    }

    context.write(TOTAL_COUNT, ONE);

Combiner：（TermDocumentCountReducer）

TermDocumentCountReducer：

   //将每个word的文档频率和文档总数sum
　　 long sum = ;

    for (LongWritable value : values) {

      sum += value.get();

    }

3.2）df词频分块

 return createDictionaryChunks(wordCountPath, output, baseConf, chunkSizeInMegabytes);

将df词频分块存放到多个文件，记录word总数、文档总数

输入：featureCountPath 输出：dictionaryPathBase

  /**

   * Read the document frequency List which is built at the end of the DF Count Job. This will use constant

   * memory and will run at the speed of your disk read

   */

  private static Pair<Long[], List<Path>> createDictionaryChunks(Path featureCountPath,

                                                                 Path dictionaryPathBase,

                                                                 Configuration baseConf,

                                                                 int chunkSizeInMegabytes) throws IOException {

    List<Path> chunkPaths = Lists.newArrayList();

    Configuration conf = new Configuration(baseConf);

    FileSystem fs = FileSystem.get(featureCountPath.toUri(), conf);

    long chunkSizeLimit = chunkSizeInMegabytes * 1024L * 1024L;

    int chunkIndex = ;

    Path chunkPath = new Path(dictionaryPathBase, FREQUENCY_FILE + chunkIndex);

    chunkPaths.add(chunkPath);

    SequenceFile.Writer freqWriter =

      new SequenceFile.Writer(fs, conf, chunkPath, IntWritable.class, LongWritable.class);

    try {

      long currentChunkSize = ;

      long featureCount = ;

      long vectorCount = Long.MAX_VALUE;

      Path filesPattern = new Path(featureCountPath, OUTPUT_FILES_PATTERN);

      for (Pair<IntWritable,LongWritable> record

           : new SequenceFileDirIterable<IntWritable,LongWritable>(filesPattern,

                                                                   PathType.GLOB,

                                                                   null,

                                                                   null,

                                                                   true,

                                                                   conf)) {

        if (currentChunkSize > chunkSizeLimit) {

          Closeables.close(freqWriter, false);

          chunkIndex++;

          chunkPath = new Path(dictionaryPathBase, FREQUENCY_FILE + chunkIndex);

          chunkPaths.add(chunkPath);

          freqWriter = new SequenceFile.Writer(fs, conf, chunkPath, IntWritable.class, LongWritable.class);

          currentChunkSize = ;

        }

        int fieldSize = SEQUENCEFILE_BYTE_OVERHEAD + Integer.SIZE /  + Long.SIZE / ;

        currentChunkSize += fieldSize;

        IntWritable key = record.getFirst();

        LongWritable value = record.getSecond();

        if (key.get() >= ) {

          freqWriter.append(key, value);

        } else if (key.get() == -) {//文档数目

          vectorCount = value.get();

        }

        featureCount = Math.max(key.get(), featureCount);

      }

      featureCount++;

      Long[] counts = {featureCount, vectorCount};//word数目、文档数目

      return new Pair<Long[], List<Path>>(counts, chunkPaths);

    } finally {

      Closeables.close(freqWriter, false);

    }

  }

4、创建TFIDF（Term Frequency-Inverse Document Frequency (Tf-Idf) Vectors）

        TFIDFConverter.processTfIdf(

                new Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),

                outputDir, conf, docFrequenciesFeatures, minDf, maxDF, norm, logNormalize,

                sequentialAccessOutput, namedVectors, reduceTasks);

4.1）生成PartialVectors（TFIDF）

  int partialVectorIndex = ;

    List<Path> partialVectorPaths = Lists.newArrayList();

    List<Path> dictionaryChunks = datasetFeatures.getSecond();

    for (Path dictionaryChunk : dictionaryChunks) {

      Path partialVectorOutputPath = new Path(output, VECTOR_OUTPUT_FOLDER + partialVectorIndex++);

      partialVectorPaths.add(partialVectorOutputPath);

      makePartialVectors(input,

                         baseConf,

                         datasetFeatures.getFirst()[],

                         datasetFeatures.getFirst()[],

                         minDf,

                         maxDF,

                         dictionaryChunk,

                         partialVectorOutputPath,

                         sequentialAccessOutput,

                         namedVector);

    }

使用a chunk of features创建a partial tfidf vector

输入：tfVectorsFolder 输出：partialVectorOutputPath

    DistributedCache.setCacheFiles(new URI[] {dictionaryFilePath.toUri()}, conf);//缓存df分块文件

Mapper：（Mapper）

TFIDFPartialVectorReducer：

  //计算每个文档中每个word的TFIDF值
Vector value = it.next().get();

    Vector vector = new RandomAccessSparseVector((int) featureCount, value.getNumNondefaultElements());

    for (Vector.Element e : value.nonZeroes()) {

      if (!dictionary.containsKey(e.index())) {

        continue;

      }

      long df = dictionary.get(e.index());

      if (maxDf > - && (100.0 * df) / vectorCount > maxDf) {

        continue;

      }

      if (df < minDf) {

        df = minDf;

      }

      vector.setQuick(e.index(), tfidf.calculate((int) e.get(), (int) df, (int) featureCount, (int) vectorCount));

    }

4.2）合并partial向量（TFIDF）

    Configuration conf = new Configuration(baseConf);

    Path outputDir = new Path(output, DOCUMENT_VECTOR_OUTPUT_FOLDER);

    PartialVectorMerger.mergePartialVectors(partialVectorPaths,

                                            outputDir,

                                            baseConf,

                                            normPower,

                                            logNormalize,

                                            datasetFeatures.getFirst()[].intValue(),

                                            sequentialAccessOutput,

                                            namedVector,

                                            numReducers);

合并所有的partial向量为一个完整的文档向量

输入：partialVectorOutputPath 输出：outputDir

Mapper：Mapper

PartialVectorMergeReducer：

    //汇总TFIDF向量
　　Vector vector = new RandomAccessSparseVector(dimension, );

    for (VectorWritable value : values) {

      vector.assign(value.get(), Functions.PLUS);

    }

Mahout源码分析之 -- 文档向量化TF-IDF的更多相关文章

quartz.net任务调度：源码及使用文档
目录: 1.quartz.net任务调度:源码及使用文档 2.quartz.net插件类库封装前言前段时间把自己封装quartz.net 类库的过程总结到博客园,有网友想要看一下源码,所以就把源码 ...
在MyEclipse显示struts2源码和doc文档及自动完成功能
分类: struts2 2010-01-07 16:34 1498人阅读评论(1) 收藏举报 myeclipsestruts文档xmlfileurl 在MyEclipse显示struts2源码和d ...
eclipse导入java和android sdk源码，帮助文档
eclipse导入java和android sdk源码,帮助文档 http://blog.csdn.net/ashelyhss/article/details/37993261 JavaDoc集成到E ...
Mahout源码分析：并行化FP-Growth算法
FP-Growth是一种常被用来进行关联分析,挖掘频繁项的算法.与Aprior算法相比,FP-Growth算法采用前缀树的形式来表征数据,减少了扫描事务数据库的次数,通过递归地生成条件FP-tree来 ...
Mahout源码分析之 -- QR矩阵分解
一.算法原理请参考我在大学时写的<QR方法求矩阵全部特征值>,其包含原理.实例及C语言实现:http://www.docin.com/p-114587383.html 二.源码分析这里 ...
【C#附源码】数据库文档生成工具支持（Excel+Html）
[2015] 很多时候,我们在生成数据库文档时,使用某些工具,可效果总不理想,不是内容不详细,就是表现效果一般般.很多还是word.html的.看着真是别扭.本人习惯用Excel,所以闲暇时,就简单的 ...
MyEclipse10查看Struts2源码及Javadoc文档
1:查看Struts2源码 (1):Referenced Libraries >struts2-core-2.1.6.jar>右击>properties. (2):Java Sour ...
MyEclipse查看Struts2源码及Javadoc文档
一.查看Struts2源码 1.Referenced Libraries >struts2-core-2.1.6.jar>右击>properties. 2.Java Source A ...
【C#附源码】数据库文档生成工具支持（Excel+Htm）
数据库文档生成工具是用C#开发的基于NPOI组件的小工具.软件源码大小不到10MB.支持生成Excel 和Html 两种文档形式.了解更多,请访问:http://www.oschina.net/cod ...

随机推荐

thinkphp 内置标签volist 控制换行
thinkphp 内置标签volist 控制换行 volist标签通常用于查询数据集(select方法)的结果输出,通常模型的select方法返回的结果是一个二维数组,可以直接使用volist标签进行 ...
sql语句与数据库
create table wyx( xh int primary key, xm varchar(20) not null, nl int, zcrq timestamp default curr ...
android之 listview加载性能优化ViewHolder
在android开发中Listview是一个很重要的组件,它以列表的形式根据数据的长自适应展示具体内容,用户可以自由的定义listview每一列的布局,但当listview有大量的数据需要加载的时候, ...
黑马程序员——JAVA基础之IO流缓冲区，转换流，字节流
------- android培训.java培训.期待与您交流! ---------- 字符流的缓冲区缓冲区的出现提高了对数据的读写效率. 对应类 • BufferedWriter ...
There is already an open DataReader associated with this Command which must be closed first." exception in Entity Framework
Fixing the "There is already an open DataReader associated with this Command which must be clos ...
NET-SNMP配置
配置/etc/snmp/snmpd.conf such as below : ============================================== com2sec notCon ...
POJ-2726-Holiday Hotel
Holiday Hotel Time Limit: 2000MS Memory Limit: 65536K Total Submissions: 8302 Accepted: 3249 D ...
C# 深拷贝通用方法
C#深拷贝通用方法(引用类型的拷贝) /// <summary> /// 深度COPY /// </summary> /// <typeparam name=" ...
remove() 方法的兼容问题
一直以为jq的remove()方法是兼容的,今天才发现,原来ie的写法不一样,特作此记录. removeNode方法的功能是删除一个节点,语法为node.removeNode(false)或者node ...
Delphi制作DLL
一.开使你的第一个DLL专案 1.File->Close all->File->New﹝DLL﹞ 代码: //自动产生Code如下 library Project2; //这有段废话 ...

Mahout源码分析之 -- 文档向量化TF-IDF

Mahout源码分析之 -- 文档向量化TF-IDF的更多相关文章

随机推荐

热门专题