hadoop中国字、词频统计和排序

例如需求，下面：

有被看作图输入文件中。

代表ip地址，之后的偶数列代表搜索词。数字(奇数列)代表搜索次数。使用"\t"分隔。如今须要对搜索词进行分词并统计词频，此处不考虑搜索次数，可能是翻页，亦不考虑搜索链接的行为。

watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvbGFvemhhb2t1bg==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/SouthEast" alt="">

这里中文分词使用了IK分词包，直接将源代码放入src中。

感谢IK分词。

程序例如以下:

<span style="font-size:14px;">package seg;

import java.io.ByteArrayInputStream;

import java.io.IOException;

import java.io.InputStream;

import java.io.InputStreamReader;

import java.io.Reader;

import java.util.ArrayList;

import java.util.List;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

import org.wltea.analyzer.core.IKSegmenter;

import org.wltea.analyzer.core.Lexeme;

/**

 * @author zhf

 * @version 创建时间：2014年8月16日 下午3:04:40

 */

public class SegmentTool extends Configured implements Tool{

	public static void main(String[] args) throws Exception {

		int exitCode = ToolRunner.run(new SegmentTool(), args);

		System.exit(exitCode);

	}

	@Override

	public int run(String[] arg0) throws Exception {

		Configuration conf = new Configuration();

		String[] args = new GenericOptionsParser(conf,arg0).getRemainingArgs();

		if(args.length != 2){

			System.err.println("Usage:seg.SegmentTool <input> <output>");

			System.exit(2);

		}

		Job job = new Job(conf,"nseg.jar");

		FileSystem fs = FileSystem.get(conf);

		if(fs.exists(new Path(args[1])))

			fs.delete(new Path(args[1]),true);

		job.setJarByClass(SegmentTool.class);

		job.setMapperClass(SegmentMapper.class);

		job.setCombinerClass(SegReducer.class);

		job.setReducerClass(SegReducer.class);

		job.setMapOutputKeyClass(Text.class);

		job.setMapOutputValueClass(IntWritable.class);

		job.setOutputKeyClass(Text.class);

		job.setOutputValueClass(IntWritable.class);

		FileInputFormat.addInputPath(job, new Path(args[0]));

		FileOutputFormat.setOutputPath(job, new Path(args[1]));

		return job.waitForCompletion(true) ? 0 : 1;

	}

	public static class SegmentMapper extends Mapper<LongWritable,Text,Text,IntWritable>{

		private IKSegmenter iks = new IKSegmenter(true);

		private Text word = new Text();

		private final static IntWritable one = new IntWritable(1);

		public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException{

			String line = value.toString().trim();

			String[] str = line.split("\t");

			for(int i=1;i<str.length;i+=2){

				String tmp = str[i];

				if(tmp.startsWith("http"))

					continue;

				List<String> list = segment(tmp);

				for(String s : list){

					word.set(s);

					context.write(word, one);

				}

			}

		}

		private List<String> segment(String str) throws IOException{

			byte[] byt = str.getBytes();

			InputStream is = new ByteArrayInputStream(byt);

			Reader reader = new InputStreamReader(is);

			iks.reset(reader);

			Lexeme lexeme;

			List<String> list = new ArrayList<String>();

			while((lexeme = iks.next()) != null){

				String text = lexeme.getLexemeText();

				list.add(text);

			}

			return list;

		}

	}

	public static class SegReducer extends Reducer<Text,IntWritable,Text,IntWritable>{

		private IntWritable result = new IntWritable();

		public void reduce(Text key,Iterable<IntWritable> values,Context context) throws IOException, InterruptedException{

			int sum = 0;

			for(IntWritable val : values)

				sum += val.get();

			result.set(sum);

			context.write(key, result);

		}

	}

}</span>

使用的hadoop环境为：Hadoop 2.3.0-cdh5.0.0。

须要引入三个hadoop相关的jar : hadoop-mapreduce-client-core-2.0.0-cdh4.6.0.jar、hadoop-common-2.0.0-cdh4.6.0.jar、commons-cli-1.2.jar。

打包后。运行命令：yarn jar seg.jar seg.SegmentTool /test/user/zhf/input /test/user/zhf/output

输出结果部分例如以下：

<span style="font-size:18px;">阿迪达斯        1

附近    2

陈      22

陈乔恩  1

陈奕迅  1

陈毅    2

限额    4

陕西    4

除个别  1

隐私    1

隔壁    1

集成    4

集锦    1

雨中    2

雪      5

露      1

青      7

青岛    2</span>

可是并没有排序，假设数据量比較小，能够採用linux命令：sort -k2 -n -r kw_result.txt > kw_freq.txt进行排序。

数据量大的话，能够将结果导入Hive，由于仅仅有两列了。hive -e "select key,count from kw_table sort by count desc;" > kw_freq.txt 就可以得到有序的结果。

亦能够将之前的ouput作为下一个job的input,实现排序。须要反转map输出的key和value。

代码例如以下：

<span style="font-size:14px;">package seg;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.io.WritableComparator;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

/**

 * @author zhf

 * @version 创建时间：2014年8月16日 下午4:51:00

 */

public class SortByFrequency extends Configured implements Tool{

	public static void main(String[] args) throws Exception {

		int exitCode = ToolRunner.run(new SortByFrequency(), args);

		System.exit(exitCode);

	}

	@Override

	public int run(String[] arg0) throws Exception {

		Configuration conf = new Configuration();

		String[] args = new GenericOptionsParser(conf,arg0).getRemainingArgs();

		if(args.length != 2){

			System.err.println("Usage:seg.SortByFrequency <input> <output>");

			System.exit(2);

		}

		Job job = new Job(conf,"nseg.jar");

		FileSystem fs = FileSystem.get(conf);

		if(fs.exists(new Path(args[1])))

			fs.delete(new Path(args[1]),true);

		job.setJarByClass(SortByFrequency.class);

		job.setMapperClass(SortMapper.class);

		job.setReducerClass(SortReducer.class);

		job.setSortComparatorClass(DescComparator.class);

		job.setMapOutputKeyClass(IntWritable.class);

		job.setMapOutputValueClass(Text.class);

		job.setOutputKeyClass(Text.class);

		job.setOutputValueClass(IntWritable.class);

		FileInputFormat.addInputPath(job, new Path(args[0]));

		FileOutputFormat.setOutputPath(job, new Path(args[1]));

		return job.waitForCompletion(true) ? 0 : 1;

	}

	public static class SortMapper extends Mapper<LongWritable,Text,IntWritable,Text>{

		public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException{

			String str[] = value.toString().split("\t");

			context.write(new IntWritable(Integer.valueOf(str[1])), new Text(str[0]));

		}

	}

	public static class SortReducer extends Reducer<IntWritable,Text,Text,IntWritable>{

		private Text result = new Text();

		public void reduce(IntWritable key,Iterable<Text> values,Context context) throws IOException, InterruptedException{

			for(Text val : values){

				result.set(val);

				context.write(result, key);

			}

		}

	}

	public static class DescComparator extends WritableComparator{

		protected DescComparator() {

			super(IntWritable.class,true);

		}

		@Override

		public int compare(byte[] arg0, int arg1, int arg2, byte[] arg3,

				int arg4, int arg5) {

			return -super.compare(arg0, arg1, arg2, arg3, arg4, arg5);

		}

		@Override

		public int compare(Object a,Object b){

			return -super.compare(a, b);

		}

	}

}</span>

head查看的结果例如以下：

hadoop中国字、词频统计和排序的更多相关文章

Trie树：应用于统计和排序
Trie树:应用于统计和排序 1. 什么是trie树 1.Trie树 (特例结构树) Trie树,又称单词查找树.字典树,是一种树形结构,是一种哈希树的变种,是一种用于快速检索的多叉树结构 ...
Hadoop MapReduce 二次排序原理及其应用
关于二次排序主要涉及到这么几个东西: 在0.20.0 以前使用的是 setPartitionerClass setOutputkeyComparatorClass setOutputValueGrou ...
Hadoop基础-MapReduce的排序
Hadoop基础-MapReduce的排序作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.MapReduce的排序分类 1>.部分排序部分排序是对单个分区进行排序,举个 ...
Hadoop之词频统计小实验
声明: 1)本文由我原创撰写,转载时请注明出处,侵权必究. 2)本小实验工作环境为Ubuntu操作系统,hadoop1-2-1,jdk1.8.0. 3)统计词频工作在单节点的伪分布上,至于真正实 ...
使用ES对中文文章进行分词，并进行词频统计排序
前言:首先有这样一个需求,需要统计一篇10000字的文章,需要统计里面哪些词出现的频率比较高,这里面比较重要的是如何对文章中的一段话进行分词,例如“北京是×××的首都”,“北京”,“×××”,“中华” ...
Hadoop使用实例词频统计和气象分析
一.词频统计下载喜欢的电子书或大量文本数据,并保存在本地文本文件中编写map与reduce函数本地测试map与reduce 将文本数据上传至HDFS上用hadoop streaming提交任务 ...
Hadoop中的各种排序
本篇博客是金子在学习hadoop过程中的笔记的整理,不论看别人写的怎么好,还是自己边学边做笔记最好了. 1:shuffle阶段的排序(部分排序) shuffle阶段的排序可以理解成两部分,一个是对sp ...
Hadoop Hive 中的排序 Order by ,Sort by ,Distribute by以及 Cluster By
order by order by 会对输入做全局排序,因此只有一个reducer(多个reducer无法保证全局有序)只有一个reducer,会导致当输入规模较大时,需要较长的计算时间. set h ...
hadoop提交作业自定义排序和分组
现有数据如下: 3 3 3 2 3 1 2 2 2 1 1 1 要求为: 先按第一列从小到大排序,如果第一列相同,按第二列从小到大排序如果是hadoop默认的排序方式,只能比较key,也就是第一列, ...

随机推荐

Ruby（面向对象程序设计的脚本语言）入门
Ruby是一种为简单快捷的面向对象编程(面向对象程序设计)而创的脚本语言. 简单介绍 Ruby 是开源的,在Web上免费提供,但须要一个许可证. Ruby 是一种通用的.解释的编程语言. Ruby 是 ...
js课程 5-14 js如何实现控制动画角色走动
js课程 5-14 js如何实现控制动画角色走动一.总结一句话总结:首先是onkeydown事件,然后是改变元素的left和top属性 1.常用键盘事件有哪些? • onkeydown和 onke ...
spark源码解析之scala基本语法
1. scala初识 spark由scala编写,要解析scala,首先要对scala有基本的了解. 1.1 class vs object A class is a blueprint for ob ...
Redis笔记---set
1.redis set的介绍集合中的数据是不重复且没有顺序,集合类型和列表类型的对比. 集合类型:存储的是的是最多2的32次方减一个字符串,数据是没有顺序的,但是数据是唯一的列表类型:最多存储内容 ...
FZU 2020 组合
组合数求模要用逆元,用到了扩展的欧几里得算法. #include<cstdio> int mod; typedef long long LL; void gcd(LL a,LL b,LL ...
10.5 android输入系统_Reader线程_使用EventHub读取事件和核心类及配置文件_实验_分析
4. Reader线程_使用EventHub读取事件使用inotify监测/dev/input下文件的创建和删除使用epoll监测有无数据上报细节: a.fd1 = inotify_init(& ...
Ubuntu12.04.4 Vmware 虚拟机安装总结
Ubuntu12.04.4 Vmware 虚拟机安装总结背景:近期准备入手一块树莓派(RaspberryPI),准备一下开发环境,可惜机器硬盘小,又舍不得格调Win7,所以仅仅好装虚拟机了.考虑到对 ...
链表（三）——链表删除冗余结点&插入结点到有序链表
1.一个以递增方式排列的链表,去掉链表中的冗余值. 思路一:设有两个指针p和q.使p不动,q依次往后循环直到p->data不等于q->data,再将中间的冗余数据删除. 思路二:设有两个指 ...
【转载】C# winform操作excel（打开、内嵌）
本文转载自静待"花落<C# winform操作excel(打开.内嵌)> 说明:显示的excel是利用模板创建的 using System;using System.Coll ...
美轮美奂宇宙星空制作神器Spacescape
本文章由cartzhang编写,转载请注明出处. 所有权利保留. 文章链接:http://blog.csdn.net/cartzhang/article/details/46444569 作者:car ...

hadoop中国字、词频统计和排序

hadoop中国字、词频统计和排序的更多相关文章

随机推荐

热门专题