Mahout源码分析之 -- 文档向量化TF-IDF
fesh个人实践,欢迎经验交流!Blog地址:http://www.cnblogs.com/fesh/p/3775429.html
Mahout之SparseVectorsFromSequenceFiles源码分析
一、原理
TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。
TFIDF的主要思想是:如果某个词或短语在一篇文章中出现的频率TF高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类。
TFIDF实际上是:TF * IDF,TF词频(Term Frequency),IDF逆向文件频率(Inverse Document Frequency)。
词频 (TF) 指的是某一个给定的词语在文件中出现的次数。这个数字通常会被归一化,以防止它偏向长的文件。(同一个词语在长文件里可能会比短文件有更高的词频,而不管该词语重要与否。)
逆向文件频率(IDF)是一个词语普遍重要性的度量,其主要思想是:如果包含词条t的文档越少,也就是n越小,IDF越大,则说明词条t具有很好的类别区分能力。
-
对于在某一特定文件里的词语
来说,它的重要性可表示为:以上式子中
是该词在文件
中的出现次数,而分母则是在文件
中所有字词的出现次数之和(分母也可以是词出现次数的最大值)。逆向文件频率(inverse document frequency,IDF)是一个词语普遍重要性的度量。某一特定词语的IDF,可以由总文件数目除以包含该词语之文件的数目,再将得到的商取对数得到:
其中
- |D|:语料库中的文件总数
:包含词语
的文件数目(即
的文件数目)如果该词语不在语料库中,就会导致分母为零,因此一般情况下使用
然后
某一特定文件内的高词语频率,以及该词语在整个文件集合中的低文件频率,可以产生出高权重的TF-IDF。因此,TF-IDF倾向于过滤掉常见的词语,保留重要的词语。
二、源码分析
目标:将一个给定的sequence文件集合转化为SparseVectors
1、对文档分词
1.1)使用最新的{@link org.apache.lucene.util.Version}创建一个Analyzer,用来下文1.2分词;
Class<? extends Analyzer> analyzerClass = StandardAnalyzer.class;
if (cmdLine.hasOption(analyzerNameOpt)) {
String className = cmdLine.getValue(analyzerNameOpt).toString();
analyzerClass = Class.forName(className).asSubclass(Analyzer.class);
// try instantiating it, b/c there isn't any point in setting it if
// you can't instantiate it
AnalyzerUtils.createAnalyzer(analyzerClass);
}
1.2)使用{@link StringTuple}将input documents转化为token数组(input documents必须是{@link org.apache.hadoop.io.SequenceFile}格式);
DocumentProcessor.tokenizeDocuments(inputDir, analyzerClass, tokenizedPath, conf);
输入:inputDir 输出:tokenizedPath
SequenceFileTokenizerMapper:
//将input documents按Analyzer进行分词,并将分得的词放在一个StringTuple中
TokenStream stream = analyzer.tokenStream(key.toString(), new StringReader(value.toString()));
CharTermAttribute termAtt = stream.addAttribute(CharTermAttribute.class);
stream.reset();
StringTuple document = new StringTuple();//StringTuple是一个能够被用于Hadoop Map/Reduce Job的String类型有序List
while (stream.incrementToken()) {
if (termAtt.length() > ) {
document.add(new String(termAtt.buffer(), , termAtt.length()));
}
}
2、创建TF向量(Term Frequency Vectors)---多个Map/Reduce Job
DictionaryVectorizer.createTermFrequencyVectors(tokenizedPath,
outputDir,
tfDirName,
conf,
minSupport,
maxNGramSize,
minLLRValue,
-1.0f,
false,
reduceTasks,
chunkSize,
sequentialAccessOutput,
namedVectors);
2.1)全局词统计(TF)
startWordCounting(input, dictionaryJobPath, baseConf, minSupport);
使用Map/Reduce并行地统计全局的词频,这里只考虑(maxNGramSize == 1)
输入:tokenizedPath 输出:wordCountPath
TermCountMapper:
//统计一个文本文档中的词频
OpenObjectLongHashMap<String> wordCount = new OpenObjectLongHashMap<String>();
for (String word : value.getEntries()) {
if (wordCount.containsKey(word)) {
wordCount.put(word, wordCount.get(word) + );
} else {
wordCount.put(word, );
}
}
wordCount.forEachPair(new ObjectLongProcedure<String>() {
@Override
public boolean apply(String first, long second) {
try {
context.write(new Text(first), new LongWritable(second));
} catch (IOException e) {
context.getCounter("Exception", "Output IO Exception").increment();
} catch (InterruptedException e) {
context.getCounter("Exception", "Interrupted Exception").increment();
}
return true;
}
});
TermCountCombiner:( 同 TermCountReducer)
TermCountReducer:
//汇总所有的words和单词的weights,并将同一word的权重sum
long sum = ;
for (LongWritable value : values) {
sum += value.get();
}
if (sum >= minSupport) {//TermCountCombiner没有这个过滤)
context.write(key, new LongWritable(sum));
}
2.2)创建词典
List<Path> dictionaryChunks;
dictionaryChunks =
createDictionaryChunks(dictionaryJobPath, output, baseConf, chunkSizeInMegabytes, maxTermDimension);
读取2.1词频Job的feature frequency List,并给它们指定id
输入:wordCountPath 输出:dictionaryJobPath
/**
* Read the feature frequency List which is built at the end of the Word Count Job and assign ids to them.
* This will use constant memory and will run at the speed of your disk read
*/
private static List<Path> createDictionaryChunks(Path wordCountPath,
Path dictionaryPathBase,
Configuration baseConf,
int chunkSizeInMegabytes,
int[] maxTermDimension) throws IOException {
List<Path> chunkPaths = Lists.newArrayList(); Configuration conf = new Configuration(baseConf); FileSystem fs = FileSystem.get(wordCountPath.toUri(), conf); long chunkSizeLimit = chunkSizeInMegabytes * 1024L * 1024L;//默认64M
int chunkIndex = ;
Path chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex);
chunkPaths.add(chunkPath); SequenceFile.Writer dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class); try {
long currentChunkSize = ;
Path filesPattern = new Path(wordCountPath, OUTPUT_FILES_PATTERN);
int i = ;
for (Pair<Writable,Writable> record
: new SequenceFileDirIterable<Writable,Writable>(filesPattern, PathType.GLOB, null, null, true, conf)) {
if (currentChunkSize > chunkSizeLimit) {//生成新的词典文件
Closeables.close(dictWriter, false);
chunkIndex++; chunkPath = new Path(dictionaryPathBase, DICTIONARY_FILE + chunkIndex);
chunkPaths.add(chunkPath); dictWriter = new SequenceFile.Writer(fs, conf, chunkPath, Text.class, IntWritable.class);
currentChunkSize = ;
} Writable key = record.getFirst();
int fieldSize = DICTIONARY_BYTE_OVERHEAD + key.toString().length() * + Integer.SIZE / ;
currentChunkSize += fieldSize;
dictWriter.append(key, new IntWritable(i++));//指定id
}
maxTermDimension[] = i;//记录最大word数目
} finally {
Closeables.close(dictWriter, false);
} return chunkPaths;
}
2.3)构造PartialVectors(TF)
int partialVectorIndex = ;
Collection<Path> partialVectorPaths = Lists.newArrayList();
for (Path dictionaryChunk : dictionaryChunks) {
Path partialVectorOutputPath = new Path(output, VECTOR_OUTPUT_FOLDER + partialVectorIndex++);
partialVectorPaths.add(partialVectorOutputPath);
makePartialVectors(input, baseConf, maxNGramSize, dictionaryChunk, partialVectorOutputPath,
maxTermDimension[], sequentialAccess, namedVectors, numReducers);
}
将input documents使用a chunk of features创建a partial vector
(这是由于词典文件被分成了多个文件,每个文件只能构造总的vector的一部分,其中每一部分叫一个partial vector)
输入:tokenizedPath 输出:partialVectorPaths
Mapper:(Mapper)
TFPartialVectorReducer:
//读取词典文件
//MAHOUT-1247
Path dictionaryFile = HadoopUtil.getSingleCachedFile(conf);
// key is word value is id
for (Pair<Writable, IntWritable> record
: new SequenceFileIterable<Writable, IntWritable>(dictionaryFile, true, conf)) {
dictionary.put(record.getFirst().toString(), record.getSecond().get());
}
//转化a document为a sparse vector
StringTuple value = it.next(); Vector vector = new RandomAccessSparseVector(dimension, value.length()); // guess at initial size for (String term : value.getEntries()) {
if (!term.isEmpty() && dictionary.containsKey(term)) { // unigram
int termId = dictionary.get(term);
vector.setQuick(termId, vector.getQuick(termId) + );
}
}
2.4)合并PartialVectors(TF)
Configuration conf = new Configuration(baseConf);
Path outputDir = new Path(output, tfVectorsFolderName);
PartialVectorMerger.mergePartialVectors(partialVectorPaths, outputDir, conf, normPower, logNormalize,
maxTermDimension[], sequentialAccess, namedVectors, numReducers);
合并所有的partial {@link org.apache.mahout.math.RandomAccessSparseVector}s为完整的{@link org.apache.mahout.math.RandomAccessSparseVector}
输入:partialVectorPaths 输出:tfVectorsFolder
Mapper:(Mapper)
PartialVectorMergeReducer:
//合并partial向量为完整的TF向量
Vector vector = new RandomAccessSparseVector(dimension, );
for (VectorWritable value : values) {
vector.assign(value.get(), Functions.PLUS);//将包含不同word的向量合并为一个
}
3、创建IDF向量(document frequency Vectors)---多个Map/Reduce Job
Pair<Long[], List<Path>> docFrequenciesFeatures = null;
// Should document frequency features be processed
if (shouldPrune || processIdf) {
log.info("Calculating IDF");
docFrequenciesFeatures =
TFIDFConverter.calculateDF(new Path(outputDir, tfDirName), outputDir, conf, chunkSize);
}
3.1)统计DF词频
Path wordCountPath = new Path(output, WORDCOUNT_OUTPUT_FOLDER);
startDFCounting(input, wordCountPath, baseConf);
输入:tfDir 输出:featureCountPath
TermDocumentCountMapper:
//为一个文档中的每个word计数1、文档数1
Vector vector = value.get();
for (Vector.Element e : vector.nonZeroes()) {
out.set(e.index());
context.write(out, ONE);
}
context.write(TOTAL_COUNT, ONE);
Combiner:(TermDocumentCountReducer)
TermDocumentCountReducer:
//将每个word的文档频率和文档总数sum
long sum = ;
for (LongWritable value : values) {
sum += value.get();
}
3.2)df词频分块
return createDictionaryChunks(wordCountPath, output, baseConf, chunkSizeInMegabytes);
将df词频分块存放到多个文件,记录word总数、文档总数
输入:featureCountPath 输出:dictionaryPathBase
/**
* Read the document frequency List which is built at the end of the DF Count Job. This will use constant
* memory and will run at the speed of your disk read
*/
private static Pair<Long[], List<Path>> createDictionaryChunks(Path featureCountPath,
Path dictionaryPathBase,
Configuration baseConf,
int chunkSizeInMegabytes) throws IOException {
List<Path> chunkPaths = Lists.newArrayList();
Configuration conf = new Configuration(baseConf); FileSystem fs = FileSystem.get(featureCountPath.toUri(), conf); long chunkSizeLimit = chunkSizeInMegabytes * 1024L * 1024L;
int chunkIndex = ;
Path chunkPath = new Path(dictionaryPathBase, FREQUENCY_FILE + chunkIndex);
chunkPaths.add(chunkPath);
SequenceFile.Writer freqWriter =
new SequenceFile.Writer(fs, conf, chunkPath, IntWritable.class, LongWritable.class); try {
long currentChunkSize = ;
long featureCount = ;
long vectorCount = Long.MAX_VALUE;
Path filesPattern = new Path(featureCountPath, OUTPUT_FILES_PATTERN);
for (Pair<IntWritable,LongWritable> record
: new SequenceFileDirIterable<IntWritable,LongWritable>(filesPattern,
PathType.GLOB,
null,
null,
true,
conf)) { if (currentChunkSize > chunkSizeLimit) {
Closeables.close(freqWriter, false);
chunkIndex++; chunkPath = new Path(dictionaryPathBase, FREQUENCY_FILE + chunkIndex);
chunkPaths.add(chunkPath); freqWriter = new SequenceFile.Writer(fs, conf, chunkPath, IntWritable.class, LongWritable.class);
currentChunkSize = ;
} int fieldSize = SEQUENCEFILE_BYTE_OVERHEAD + Integer.SIZE / + Long.SIZE / ;
currentChunkSize += fieldSize;
IntWritable key = record.getFirst();
LongWritable value = record.getSecond();
if (key.get() >= ) {
freqWriter.append(key, value);
} else if (key.get() == -) {//文档数目
vectorCount = value.get();
}
featureCount = Math.max(key.get(), featureCount); }
featureCount++;
Long[] counts = {featureCount, vectorCount};//word数目、文档数目
return new Pair<Long[], List<Path>>(counts, chunkPaths);
} finally {
Closeables.close(freqWriter, false);
}
}
4、创建TFIDF(Term Frequency-Inverse Document Frequency (Tf-Idf) Vectors)
TFIDFConverter.processTfIdf(
new Path(outputDir, DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER),
outputDir, conf, docFrequenciesFeatures, minDf, maxDF, norm, logNormalize,
sequentialAccessOutput, namedVectors, reduceTasks);
4.1)生成PartialVectors(TFIDF)
int partialVectorIndex = ;
List<Path> partialVectorPaths = Lists.newArrayList();
List<Path> dictionaryChunks = datasetFeatures.getSecond();
for (Path dictionaryChunk : dictionaryChunks) {
Path partialVectorOutputPath = new Path(output, VECTOR_OUTPUT_FOLDER + partialVectorIndex++);
partialVectorPaths.add(partialVectorOutputPath);
makePartialVectors(input,
baseConf,
datasetFeatures.getFirst()[],
datasetFeatures.getFirst()[],
minDf,
maxDF,
dictionaryChunk,
partialVectorOutputPath,
sequentialAccessOutput,
namedVector);
}
使用a chunk of features创建a partial tfidf vector
输入:tfVectorsFolder 输出:partialVectorOutputPath
DistributedCache.setCacheFiles(new URI[] {dictionaryFilePath.toUri()}, conf);//缓存df分块文件
Mapper:(Mapper)
TFIDFPartialVectorReducer:
//计算每个文档中每个word的TFIDF值
Vector value = it.next().get();
Vector vector = new RandomAccessSparseVector((int) featureCount, value.getNumNondefaultElements());
for (Vector.Element e : value.nonZeroes()) {
if (!dictionary.containsKey(e.index())) {
continue;
}
long df = dictionary.get(e.index());
if (maxDf > - && (100.0 * df) / vectorCount > maxDf) {
continue;
}
if (df < minDf) {
df = minDf;
}
vector.setQuick(e.index(), tfidf.calculate((int) e.get(), (int) df, (int) featureCount, (int) vectorCount));
}
4.2)合并partial向量(TFIDF)
Configuration conf = new Configuration(baseConf);
Path outputDir = new Path(output, DOCUMENT_VECTOR_OUTPUT_FOLDER);
PartialVectorMerger.mergePartialVectors(partialVectorPaths,
outputDir,
baseConf,
normPower,
logNormalize,
datasetFeatures.getFirst()[].intValue(),
sequentialAccessOutput,
namedVector,
numReducers);
合并所有的partial向量为一个完整的文档向量
输入:partialVectorOutputPath 输出:outputDir
Mapper:Mapper
PartialVectorMergeReducer:
//汇总TFIDF向量
Vector vector = new RandomAccessSparseVector(dimension, );
for (VectorWritable value : values) {
vector.assign(value.get(), Functions.PLUS);
}
Mahout源码分析之 -- 文档向量化TF-IDF的更多相关文章
- quartz.net任务调度:源码及使用文档
目录: 1.quartz.net任务调度:源码及使用文档 2.quartz.net插件类库封装 前言 前段时间把自己封装quartz.net 类库的过程总结到博客园,有网友想要看一下源码,所以就把源码 ...
- 在MyEclipse显示struts2源码和doc文档及自动完成功能
分类: struts2 2010-01-07 16:34 1498人阅读 评论(1) 收藏 举报 myeclipsestruts文档xmlfileurl 在MyEclipse显示struts2源码和d ...
- eclipse导入java和android sdk源码,帮助文档
eclipse导入java和android sdk源码,帮助文档 http://blog.csdn.net/ashelyhss/article/details/37993261 JavaDoc集成到E ...
- Mahout源码分析:并行化FP-Growth算法
FP-Growth是一种常被用来进行关联分析,挖掘频繁项的算法.与Aprior算法相比,FP-Growth算法采用前缀树的形式来表征数据,减少了扫描事务数据库的次数,通过递归地生成条件FP-tree来 ...
- Mahout源码分析之 -- QR矩阵分解
一.算法原理 请参考我在大学时写的<QR方法求矩阵全部特征值>,其包含原理.实例及C语言实现:http://www.docin.com/p-114587383.html 二.源码分析 这里 ...
- 【C#附源码】数据库文档生成工具支持(Excel+Html)
[2015] 很多时候,我们在生成数据库文档时,使用某些工具,可效果总不理想,不是内容不详细,就是表现效果一般般.很多还是word.html的.看着真是别扭.本人习惯用Excel,所以闲暇时,就简单的 ...
- MyEclipse10查看Struts2源码及Javadoc文档
1:查看Struts2源码 (1):Referenced Libraries >struts2-core-2.1.6.jar>右击>properties. (2):Java Sour ...
- MyEclipse查看Struts2源码及Javadoc文档
一.查看Struts2源码 1.Referenced Libraries >struts2-core-2.1.6.jar>右击>properties. 2.Java Source A ...
- 【C#附源码】数据库文档生成工具支持(Excel+Htm)
数据库文档生成工具是用C#开发的基于NPOI组件的小工具.软件源码大小不到10MB.支持生成Excel 和Html 两种文档形式.了解更多,请访问:http://www.oschina.net/cod ...
随机推荐
- c++字符串变量---8
原创博客:转载请标明出处:http://www.cnblogs.com/zxouxuewei/ 一.字符串变量的定义 1>.对于C与C++来说是没有字符串型的数据类型的,在C++中是通过包含st ...
- Android——手机内部文件存储(作业)
作业:把用户的注册信息存储到文件里,登录成功后读出并显示出来 activity_practice2的layout文件: <?xml version="1.0" encodin ...
- 排序算法总结(四)快速排序【QUICK SORT】
感觉自己这几篇都是主要参考的Wikipedia上的,快排就更加是了....wiki上的快排挺清晰并且容易理解的,需要注意的地方我也添加上了注释,大家可以直接看代码.需要注意的是,wikipedia上快 ...
- Restful 支持 自定义序列化
[ServiceContract] [ServiceKnownType(typeof(HRAwardObject))] [ServiceKnownType(typeof(WorkflowBasicIn ...
- Oracle 学习之路开始
今年刚毕业,从毕业到现在工作正式工作四个多月了(实习的几个月就不算了).工作之中遇到的困难不少,学到的东西也不少:但是感觉学到的东西还是不够,毕竟java水很深啊. 现在工作中并不是每天都能学到新的东 ...
- 使用swipecard实现卡片视图左右滑动监听以及点击监听
前言: 大家好,今天给大家介绍安卓一种特别实用有很酷炫的组件swipecard,当然这并不是安卓爸爸创造的,这是国内的一个我认为是大牛的一个人随便写着玩儿搞出来了,我看了他的代码介绍已经很清晰了,但 ...
- RocketMQ安装与部署说明
一.安装说明1.下载安装包,下载地址:https://github.com/alibaba/RocketMQ/releases/download/v3.1.7/alibaba-rocketmq-3.1 ...
- C# :GDI+技术生成复杂型彩色验证码(转载)
该类是生成一个验证码的类,集合了网上大部分的C#关于GDI+的文章进行多次改进,现在已经形成了可在生产环节中使用的验证码. 该验证码加入了背景噪点,背景噪点曲线和直线,背景噪点文字以及扭曲,调暗,模糊 ...
- Swagger使用总结
Swagger使用总结 1. Swagger是什么? 官方说法:Swagger是一个规范和完整的框架,用于生成.描述.调用和可视化 RESTful 风格的 Web 服务.总体目标是使客户端和文件系统作 ...
- HTC Vive开发笔记之UI Guideline
本文转自HTC官方论坛,原址https://www.htcvive.com/cn/forum/chat.php?mod=viewthread&tid=1641&extra=page=1 ...


