hadoop mapreduce 基础实例一记词

mapreduce实现一个简单的单词计数的功能。

一，准备工作：eclipse 安装hadoop 插件:

下载相关版本的hadoop-eclipse-plugin-2.2.0.jar到eclipse/plugins下。

二，实现:

新建mapreduce project

map 用于分词，reduce计数。

package tank.demo;

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**

 * @author tank

 * @date:2015年1月5日 上午10:03:43

 * @description:记词器

 * @version :0.1

 */

public class WordCount {

    public static class TokenizerMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

        private final static IntWritable one = new IntWritable(1);

        private Text word = new Text();

        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            StringTokenizer itr = new StringTokenizer(value.toString());

            while (itr.hasMoreTokens()) {

                word.set(itr.nextToken());

                context.write(word, one);

            }

        }

    }

    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

            int sum = 0;

            for (IntWritable val : values) {

                sum += val.get();

            }

            result.set(sum);

            context.write(key, result);

        }

    }

    public static void main(String[] args) throws Exception {

        Configuration conf = new Configuration();

        if (args.length != 2) {

            System.err.println("Usage: wordcount  ");

            System.exit(2);

        }

        Job job = new Job(conf, "word count");

        //主类

        job.setJarByClass(WordCount.class);

        job.setMapperClass(TokenizerMapper.class);

        job.setReducerClass(IntSumReducer.class);

        //map输出格式

        job.setMapOutputKeyClass(Text.class);

        job.setMapOutputValueClass(IntWritable.class);

        //输出格式

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));

        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }

}

打包world-count.jar

三，准备输入数据

hadoop fs -mkdir /user/hadoop/input//建好输入目录

//随便写点数据文件

echo hello my hadoop this is my first application>file1

echo hello world my deer my applicaiton >file2

//拷贝到hdfs中

hadoop fs -put file* /user/hadoop/input

hadoop fs -ls /user/hadoop/input //查看

四，运行

上传到集群环境中:

hadoop jar world-count.jar WordCount input output

截取一段输出如：

15/01/05 11:14:36 INFO mapred.Task: Task:attempt_local1938802295_0001_r_000000_0 is done. And is in the process of committing
15/01/05 11:14:36 INFO mapred.LocalJobRunner:
15/01/05 11:14:36 INFO mapred.Task: Task attempt_local1938802295_0001_r_000000_0 is allowed to commit now
15/01/05 11:14:36 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1938802295_0001_r_000000_0' to hdfs://192.168.183.130:9000/user/hadoop/output/_temporary/0/task_local1938802295_0001_r_000000
15/01/05 11:14:36 INFO mapred.LocalJobRunner: reduce > reduce
15/01/05 11:14:36 INFO mapred.Task: Task 'attempt_local1938802295_0001_r_000000_0' done.
15/01/05 11:14:36 INFO mapreduce.Job: Job job_local1938802295_0001 running in uber mode : false
15/01/05 11:14:36 INFO mapreduce.Job: map 100% reduce 100%
15/01/05 11:14:36 INFO mapreduce.Job: Job job_local1938802295_0001 completed successfully
15/01/05 11:14:36 INFO mapreduce.Job: Counters: 32
        File System Counters
                FILE: Number of bytes read=17706
                FILE: Number of bytes written=597506
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=205
                HDFS: Number of bytes written=85
                HDFS: Number of read operations=25
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=5
        Map-Reduce Framework
                Map input records=2
                Map output records=14
                Map output bytes=136
                Map output materialized bytes=176
                Input split bytes=232
                Combine input records=0
                Combine output records=0
                Reduce input groups=10
                Reduce shuffle bytes=0
                Reduce input records=14
                Reduce output records=10
                Spilled Records=28
                Shuffled Maps =0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=67
                CPU time spent (ms)=0
                Physical memory (bytes) snapshot=0
                Virtual memory (bytes) snapshot=0
                Total committed heap usage (bytes)=456536064
        File Input Format Counters
                Bytes Read=80
        File Output Format Counters
                Bytes Written=85

查看输出目录下的文件

[hadoop@tank1 ~]$ hadoop fs -cat /user/hadoop/output/part-r-00000
applicaiton     1
application     1
deer    1
first   1
hadoop 1
hello   2
is      1
my      4
this    1
world   1

已经正确统计出单词数量！

hadoop mapreduce 基础实例一记词的更多相关文章

Hadoop 综合揭秘——MapReduce 基础编程（介绍 Combine、Partitioner、WritableComparable、WritableComparator 使用方式）
前言本文主要介绍 MapReduce 的原理及开发,讲解如何利用 Combine.Partitioner.WritableComparator等组件对数据进行排序筛选聚合分组的功能.由于文章是针对开 ...
Hadoop学习基础之三：MapReduce
现在是讨论这个问题的不错的时机,因为最近媒体上到处充斥着新的革命所谓“云计算”的信息.这种模式需要利用大量的(低端)处理器并行工作来解决计算问题.实际上,这建议利用大量的低端处理器来构建数据中心,而不 ...
[Hadoop in Action] 第4章编写MapReduce基础程序
基于hadoop的专利数据处理示例 MapReduce程序框架用于计数统计的MapReduce基础程序支持用脚本语言编写MapReduce程序的hadoop流式API 用于提升性能的Combine ...
Hadoop MapReduce执行过程实例分析
1.MapReduce是如何执行任务的?2.Mapper任务是怎样的一个过程?3.Reduce是如何执行任务的?4.键值对是如何编号的?5.实例,如何计算没见最高气温? 分析MapReduce执行过程 ...
hadoop之mapreduce编程实例(系统日志初步清洗过滤处理)
刚刚开始接触hadoop的时候,总觉得必须要先安装hadoop集群才能开始学习MR编程,其实并不用这样,当然如果你有条件有机器那最好是自己安装配置一个hadoop集群,这样你会更容易理解其工作原理.我 ...
MongoDB:MapReduce基础及实例
背景 MapReduce是个非常灵活和强大的数据聚合工具.它的好处是可以把一个聚合任务分解为多个小的任务,分配到多服务器上并行处理. MongoDB也提供了MapReduce,当然查询语肯定是Java ...
【Hadoop离线基础总结】MapReduce增强（下）
MapReduce增强(下) MapTask运行机制详解以及MapTask的并行度 MapTask运行流程第一步:读取数据组件InputFormat(默认TextInputFormat)会通过get ...
【Hadoop离线基础总结】MapReduce增强（上）
MapReduce增强 MapReduce的分区与reduceTask的数量概述 MapReduce当中的分区:物以类聚,人以群分.相同key的数据,去往同一个reduce. ReduceTask的 ...
Hadoop（十五）MapReduce程序实例
一.统计好友对数(去重) 1.1.数据准备 joe, jon joe , kia joe, bob joe ,ali kia, joe kia ,jim kia, dee dee ,kia dee, ...

随机推荐

我踩过的Alwayson的坑！
最近被sql server Alwayson高可用组和读写分离,弄得神魂颠倒,身心俱疲.遇到了下面一些问题,提醒自己也给后来人做些记录. EntityFramework支不支持Alwayson? 起因 ...
windows电脑连接蓝牙耳机的正确步骤
前言我使用的是小米运动蓝牙耳机,操作系统为win7,废话少说直接上教程是否支持蓝牙功能按住win+R,打开[运行],输入devmgmt.msc,回车. 只要有Bluetooth 无线电收发器,那 ...
C# int数组转string字符串
方式一:通过循环数组拼接的方式: int[] types = new int[] { 1, 2, 3, 4, 5, 6, 7, 8, 9 }; string result = string.Empty ...
bzoj 1051: [HAOI2006]受欢迎的牛 (Tarjan 缩点）
链接:https://www.lydsy.com/JudgeOnline/problem.php?id=1051 思路: 首先用Tarjan把环缩成点,要想收到所有人的欢迎,那么这个点的出度必为0,且 ...
HAOI2017 简要题解
「HAOI2017」新型城市化题意有一个 $n$ 个点的无向图,其中只有 $m$ 对点之间没有连边,保证这张图可以被分为至多两个团. 对于 $m$ 对未连边的点对,判断有哪些点对满足将 ...
JLOI2016 简要题解
「JLOI2016」侦查守卫题意有一个 $n$ 个点的树,有 $m$ 个关键点需要被监视.可以在其中一些点上插眼,在 $i$ 号点上放眼需要花费 $w_i$ 的代价,可以监视距离 ...
用Python爬取"王者农药"英雄皮肤原
padding: 10px; border-bottom: 1px solid #d3d3d3; background-color: #2e8b57; } .second-menu-item { pa ...
【agc013d】Piling Up（动态规划）
[agc013d]Piling Up(动态规划) 题面 atcoder 洛谷有$n$个球,颜色为黑白中的一种,初始时颜色任意. 进行$m$次操作,每次操作都是先拿出一个求,再放进黑白各一个, ...
「JLOI2015」城池攻占解题报告
「JLOI2015」城池攻占注意到任意两个人的战斗力相对大小的不变的可以离线的把所有人赛到初始点的堆里然后做启发式合并就可以了 Code: #include <cstdio> #in ...
[BJOI2012]最多的方案（记忆化搜索）
第二关和很出名的斐波那契数列有关,地球上的OIer都知道:F1=1, F2=2, Fi = Fi-1 + Fi-2,每一项都可以称为斐波那契数.现在给一个正整数N,它可以写成一些斐波那契数的和的形式. ...

hadoop mapreduce 基础实例一记词

hadoop mapreduce 基础实例一记词的更多相关文章

随机推荐

热门专题