hadoop学习记录1 初始hadoop
起因
因为工作需要用到,所以需要学习hadoop,所以记录这篇文章,主要分享自己快速搭建hadoop环境与运行一个demo
搭建环境
网上搭建hadoop环境的例子我看蛮多的.但是我看都比较复杂,要求安装java,hadoop,然后各种设置..很多参数变量都不明白是啥意思...我的目标很简单,首先应该是用最简单的方法搭建好一个环境.各种变量呀参数呀这些我觉得一开始对我都不太重要..我只要能跑起来1个自己的简单demo就行.而且现实中基本上环境也不会让我来维护..所以对我来说简单就行.
刚好最近我一直在看docker..所以我就打算用docker来搭建这个环境.算是同时学习hadoop和docker吧.
首先安装docker....很简单...这里就不介绍了.官方有一键安装脚本...
docker hub中有1个官方的hadoop的例子.
https://hub.docker.com/r/sequenceiq/hadoop-docker/
我稍微修改了一下命令:
额外挂载了1个目录,因为我要上传我自己写的demo jar到docker里去用hadoop运行.
另外把这个container取名字为hadoop2,因为我跑了很多容器,取名字便于区分,而且后面可能要用多个hadoop容器来制作集群.
docker run -it -v /dockerVolumes/hadoop2:/dockerVolume --name hadoop2 sequenceiq/hadoop-docker:2.7. /etc/bootstrap.sh -bash
运行好这个命令,这个容器就已经运行起来了.我们可以跑一下官方的example.
cd $HADOOP_PREFIX
# run the mapreduce
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7..jar grep input output 'dfs[a-z.]+' # check the output
bin/hdfs dfs -cat output/*
输出内容:
bash-4.1# clear
bash-4.1# bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0.jar grep input output 'dfs[a-z.]+'
18/06/11 07:35:38 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/06/11 07:35:39 INFO input.FileInputFormat: Total input paths to process : 31
18/06/11 07:35:39 INFO mapreduce.JobSubmitter: number of splits:31
18/06/11 07:35:40 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1528635021541_0007
18/06/11 07:35:40 INFO impl.YarnClientImpl: Submitted application application_1528635021541_0007
18/06/11 07:35:40 INFO mapreduce.Job: The url to track the job: http://e1bed6899d06:8088/proxy/application_1528635021541_0007/
18/06/11 07:35:40 INFO mapreduce.Job: Running job: job_1528635021541_0007
18/06/11 07:35:45 INFO mapreduce.Job: Job job_1528635021541_0007 running in uber mode : false
18/06/11 07:35:45 INFO mapreduce.Job: map 0% reduce 0%
18/06/11 07:36:02 INFO mapreduce.Job: map 10% reduce 0%
18/06/11 07:36:03 INFO mapreduce.Job: map 19% reduce 0%
18/06/11 07:36:19 INFO mapreduce.Job: map 35% reduce 0%
18/06/11 07:36:20 INFO mapreduce.Job: map 39% reduce 0%
18/06/11 07:36:33 INFO mapreduce.Job: map 42% reduce 0%
18/06/11 07:36:35 INFO mapreduce.Job: map 55% reduce 0%
18/06/11 07:36:36 INFO mapreduce.Job: map 55% reduce 15%
18/06/11 07:36:39 INFO mapreduce.Job: map 55% reduce 18%
18/06/11 07:36:45 INFO mapreduce.Job: map 58% reduce 18%
18/06/11 07:36:46 INFO mapreduce.Job: map 61% reduce 18%
18/06/11 07:36:47 INFO mapreduce.Job: map 65% reduce 18%
18/06/11 07:36:48 INFO mapreduce.Job: map 65% reduce 22%
18/06/11 07:36:49 INFO mapreduce.Job: map 71% reduce 22%
18/06/11 07:36:51 INFO mapreduce.Job: map 71% reduce 24%
18/06/11 07:36:57 INFO mapreduce.Job: map 74% reduce 24%
18/06/11 07:36:59 INFO mapreduce.Job: map 77% reduce 24%
18/06/11 07:37:00 INFO mapreduce.Job: map 77% reduce 26%
18/06/11 07:37:01 INFO mapreduce.Job: map 84% reduce 26%
18/06/11 07:37:03 INFO mapreduce.Job: map 87% reduce 28%
18/06/11 07:37:06 INFO mapreduce.Job: map 87% reduce 29%
18/06/11 07:37:08 INFO mapreduce.Job: map 90% reduce 29%
18/06/11 07:37:09 INFO mapreduce.Job: map 94% reduce 29%
18/06/11 07:37:11 INFO mapreduce.Job: map 100% reduce 29%
18/06/11 07:37:12 INFO mapreduce.Job: map 100% reduce 100%
18/06/11 07:37:12 INFO mapreduce.Job: Job job_1528635021541_0007 completed successfully
18/06/11 07:37:12 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=345
FILE: Number of bytes written=3697476
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=80529
HDFS: Number of bytes written=437
HDFS: Number of read operations=96
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=31
Launched reduce tasks=1
Data-local map tasks=31
Total time spent by all maps in occupied slots (ms)=400881
Total time spent by all reduces in occupied slots (ms)=52340
Total time spent by all map tasks (ms)=400881
Total time spent by all reduce tasks (ms)=52340
Total vcore-seconds taken by all map tasks=400881
Total vcore-seconds taken by all reduce tasks=52340
Total megabyte-seconds taken by all map tasks=410502144
Total megabyte-seconds taken by all reduce tasks=53596160
Map-Reduce Framework
Map input records=2060
Map output records=24
Map output bytes=590
Map output materialized bytes=525
Input split bytes=3812
Combine input records=24
Combine output records=13
Reduce input groups=11
Reduce shuffle bytes=525
Reduce input records=13
Reduce output records=11
Spilled Records=26
Shuffled Maps =31
Failed Shuffles=0
Merged Map outputs=31
GC time elapsed (ms)=2299
CPU time spent (ms)=11090
Physical memory (bytes) snapshot=8178929664
Virtual memory (bytes) snapshot=21830377472
Total committed heap usage (bytes)=6461849600
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=76717
File Output Format Counters
Bytes Written=437
18/06/11 07:37:12 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/06/11 07:37:12 INFO input.FileInputFormat: Total input paths to process : 1
18/06/11 07:37:12 INFO mapreduce.JobSubmitter: number of splits:1
18/06/11 07:37:12 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1528635021541_0008
18/06/11 07:37:12 INFO impl.YarnClientImpl: Submitted application application_1528635021541_0008
18/06/11 07:37:12 INFO mapreduce.Job: The url to track the job: http://e1bed6899d06:8088/proxy/application_1528635021541_0008/
18/06/11 07:37:12 INFO mapreduce.Job: Running job: job_1528635021541_0008
18/06/11 07:37:24 INFO mapreduce.Job: Job job_1528635021541_0008 running in uber mode : false
18/06/11 07:37:24 INFO mapreduce.Job: map 0% reduce 0%
18/06/11 07:37:29 INFO mapreduce.Job: map 100% reduce 0%
18/06/11 07:37:35 INFO mapreduce.Job: map 100% reduce 100%
18/06/11 07:37:35 INFO mapreduce.Job: Job job_1528635021541_0008 completed successfully
18/06/11 07:37:35 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=291
FILE: Number of bytes written=230541
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=569
HDFS: Number of bytes written=197
HDFS: Number of read operations=7
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=3210
Total time spent by all reduces in occupied slots (ms)=3248
Total time spent by all map tasks (ms)=3210
Total time spent by all reduce tasks (ms)=3248
Total vcore-seconds taken by all map tasks=3210
Total vcore-seconds taken by all reduce tasks=3248
Total megabyte-seconds taken by all map tasks=3287040
Total megabyte-seconds taken by all reduce tasks=3325952
Map-Reduce Framework
Map input records=11
Map output records=11
Map output bytes=263
Map output materialized bytes=291
Input split bytes=132
Combine input records=0
Combine output records=0
Reduce input groups=5
Reduce shuffle bytes=291
Reduce input records=11
Reduce output records=11
Spilled Records=22
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=55
CPU time spent (ms)=1090
Physical memory (bytes) snapshot=415494144
Virtual memory (bytes) snapshot=1373601792
Total committed heap usage (bytes)=354942976
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=437
File Output Format Counters
Bytes Written=197
可以看到利用了docker...安装hadoop就1行命令....就能成功运行官方example了.超级简单
运行自己写的demo
我自己尝试写了个demo.就是读取一个txt里的文字,然后统计它的字符数量
1.首先我往hdfs里创建1个txt:
hdfs的命令可以参考 https://blog.csdn.net/zhaojw_420/article/details/53161624
hdfs dfs -put in.txt /myinput/in.txt
2.写自己的mapper和reducer
代码参考 https://gitee.com/abcwt112/hadoopDemo
参考里面的MyFirstMapper和MyFirstReducer和MyFirstStarter
package demo; import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException;
import java.util.Iterator; public class MyFirstReducer extends Reducer<IntWritable, IntWritable, IntWritable, IntWritable> {
@Override
protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int total = 0;
for (IntWritable value : values) {
total += value.get();
}
context.write(new IntWritable(1), new IntWritable(total));
} }
package demo; import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; public class MyFirstMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
context.write(new IntWritable(0), new IntWritable(line.length()));
}
}
package demo; import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.FileInputStream;
import java.io.IOException; public class MyFirstStarter {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Job job = new Job();
job.setJarByClass(MyFirstStarter.class);
job.setJobName("============ My First Job =============="); FileInputFormat.addInputPath(job, new Path("/myinput/in.txt"));
FileOutputFormat.setOutputPath(job, new Path("/myout")); job.setMapperClass(MyFirstMapper.class);
job.setReducerClass(MyFirstReducer.class); job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class); System.exit(job.waitForCompletion(true) ? 0: 1);
}
}
运行mvn package以后打成jar包丢掉linux的/dockerVolumes/hadoop2目录就可以了.因为在docker里挂载了目录,所以会自动丢到hadoop2这个容器里.
另外提一句...我mvn package打出来的jar里的MF文件没有指定main方法...导致各种找不到入口....在同事的帮助下了解到可以通过maven配置来解决:
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.7</source>
<target>1.7</target>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>${mainClass}</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build> <properties>
<mainClass>demo.MyFirstStarter</mainClass>
</properties>
另外docker安装的hadoop里的jdk是1.7我的环境是1.8..所以我再pom里还额外指定了用1.7去编码..
3.在hadoop2这个容器里运行我自己写的demo.
在$HADOOP_PREFIX目录下运行bin/hadoop jar /dockerVolume/hadoopDemo-1.0-SNAPSHOT.jar
bash-4.1# bin/hadoop jar /dockerVolume/hadoopDemo-1.0-SNAPSHOT.jar
18/06/11 07:54:11 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/06/11 07:54:12 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
18/06/11 07:54:13 INFO input.FileInputFormat: Total input paths to process : 1
18/06/11 07:54:13 INFO mapreduce.JobSubmitter: number of splits:1
18/06/11 07:54:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1528635021541_0009
18/06/11 07:54:13 INFO impl.YarnClientImpl: Submitted application application_1528635021541_0009
18/06/11 07:54:13 INFO mapreduce.Job: The url to track the job: http://e1bed6899d06:8088/proxy/application_1528635021541_0009/
18/06/11 07:54:13 INFO mapreduce.Job: Running job: job_1528635021541_0009
18/06/11 07:54:20 INFO mapreduce.Job: Job job_1528635021541_0009 running in uber mode : false
18/06/11 07:54:20 INFO mapreduce.Job: map 0% reduce 0%
18/06/11 07:54:25 INFO mapreduce.Job: map 100% reduce 0%
18/06/11 07:54:31 INFO mapreduce.Job: map 100% reduce 100%
18/06/11 07:54:31 INFO mapreduce.Job: Job job_1528635021541_0009 completed successfully
18/06/11 07:54:31 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=1606
FILE: Number of bytes written=232725
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=6940
HDFS: Number of bytes written=7
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=3059
Total time spent by all reduces in occupied slots (ms)=3265
Total time spent by all map tasks (ms)=3059
Total time spent by all reduce tasks (ms)=3265
Total vcore-seconds taken by all map tasks=3059
Total vcore-seconds taken by all reduce tasks=3265
Total megabyte-seconds taken by all map tasks=3132416
Total megabyte-seconds taken by all reduce tasks=3343360
Map-Reduce Framework
Map input records=160
Map output records=160
Map output bytes=1280
Map output materialized bytes=1606
Input split bytes=104
Combine input records=0
Combine output records=0
Reduce input groups=1
Reduce shuffle bytes=1606
Reduce input records=160
Reduce output records=1
Spilled Records=320
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=43
CPU time spent (ms)=1140
Physical memory (bytes) snapshot=434499584
Virtual memory (bytes) snapshot=1367728128
Total committed heap usage (bytes)=354942976
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=6836
File Output Format Counters
Bytes Written=7
运行成功!
查看输出结果
bash-4.1# bin/hdfs dfs -ls /myout
Found 2 items
-rw-r--r-- 1 root supergroup 0 2018-06-11 07:54 /myout/_SUCCESS
-rw-r--r-- 1 root supergroup 7 2018-06-11 07:54 /myout/part-r-00000
bash-4.1# bin/hdfs dfs -cat /myout/part-r-00000
1 6676
bash-4.1#
总共6676个字符..

6836 - 160个换行符 = 6676
成功运行自己写的demo!
hadoop学习记录1 初始hadoop的更多相关文章
- Hadoop学习总结之五:Hadoop的运行痕迹
Hadoop学习总结之五:Hadoop的运行痕迹 Hadoop 学习总结之一:HDFS简介 Hadoop学习总结之二:HDFS读写过程解析 Hadoop学习总结之三:Map-Reduce入门 Ha ...
- Hadoop学习记录
http://blog.csdn.net/m_star_jy_sy/article/details/26476907配置windows里eclipse连接hadoop集群 hadoop常见命令 启动H ...
- Hadoop学习记录(1)|伪分布安装
本文转载自向着梦想奋斗博客 Hadoop是什么? 适合大数据的分布式存储于计算平台 不适用小规模数据 作者:Doug Cutting 受Google三篇论文的启发 Hadoop核心项目 HDFS(Ha ...
- Hadoop学习记录(5)|集群搭建|节点动态添加删除
集群概念 计算机集群是一种计算机系统,通过一组松散继承的计算机软件或硬件连接连接起来高度紧密地协作完成计算工作. 集群系统中的单个计算机通常称为节点,通过局域网连接. 集群特点: 1.效率高,通过多态 ...
- Hadoop学习之Ubuntu12.04 Hadoop 环境搭建笔记
SSH无密码配置 Hadoop在Ubuntu12.04上搭建环境 报错及问题 SSH无密码配置 参考:Linux(Centos)配置OpenSSH无密码登陆 注意问题: Hadoop集成环境三台机器都 ...
- Hadoop学习1—浅谈hadoop
大数据这个词越来越热,本人一直想学习一下,正巧最近有时间了解一下.先从hadoop入手,在此记录学习中的点滴. 什么是hadoop? What Is Apache Hadoop? The Apache ...
- hadoop学习笔记--找到执行hadoop的入口
参与个hadoop项目,之前没搞过,赶紧学习: 照葫芦画瓢,得到代码是hdfs2local.sh脚本和LiaoNingFilter.jar包,迫不及待用jd-gui打开jar包,搜索到main(在MA ...
- hadoop学习通过虚拟机安装hadoop完全分布式集群
要想深入的学习hadoop数据分析技术,首要的任务是必须要将hadoop集群环境搭建起来,可以将hadoop简化地想象成一个小软件,通过在各个物理节点上安装这个小软件,然后将其运行起来,就是一个had ...
- hadoop学习记录(三)HBase基本概念
这一次开始学习HBase数据库. 我用的是VMWare + ubuntu16.04 +Hbase1.1.5 +hadoop2.6.0的组合. 经过亲自安装验证,版本间没有发生冲突,可以进行学习和开发. ...
随机推荐
- tableau-详细级别表达式——2、阵列分析
tableau做阵列分析 合作时间越长的客户对销售额的贡献越大吗? 下面的视图按照客户首次购买的年份将客户分组,以便对比各个阵列的年度销售贡献额. 每个客户的最早订单日期将体现出首次购买日期.不过,由 ...
- 通过以太坊发行代币(token)
2017年开始,区块链ICO项目层出不穷,市场热度一波更胜一波,很多ICO都是通过以太坊智能合约发行自己的代币(token),具体怎样才能发行代币呢?本文进行具体详细的介绍. 准备工作 以太坊官网ER ...
- Tornado之实例和扩展
1.Tornado文件的结构: 1.Controllers控制器 2.Models数据库操作 3.Views前端显示 2.样例 #!/usr/bin/env python # -*- coding: ...
- py2exe转换参数
在公司用python写了个统计数据并通过xlsxwriter模块生成excel的小工具, 完成后使用py2exe转换成exe文件过程中遇到了些问题, 记录下. from distutils.core ...
- curl 错误 [globbing] illegal character in range specification at pos
现象 在使用curl 进行ipv6请求的时候 curl -v "http://[1:1::1]/test.html" 发生了一个错误,报错是 [globbing] illegal ...
- Linux 之 hugepage 大页内存理论
HugePages是通过使用大页内存来取代传统的4kb内存页面,使得管理虚拟地址数变少,加快了从虚拟地址到物理地址的映射以及通过摒弃内存页面的换入换出以提高内存的整体性能.尤其是对于8GB以上的内存以 ...
- cocos2d
http://www.jetbrains.com/webstorm/download/index.html 运行又有下面错误 Fatal signal 11 (SIGSEGV) at 0x000000 ...
- php项目有负载,$_SERVER['HTTP_X_FORWARDED_FOR']函数在不同系统中获取到的值形式不一样,ios系统苹果手机只能获取到一个ip(113.87.214.xxx),而安卓手机获取到的是2个ip中间逗号隔开的形式(113.87.214.xxx , xxx.xxx.xxx.xxx)
这次由于有个抽奖活动功能,苹果手机每次都抽奖失败,安卓手机每次都抽奖失败(5台ios手机,8台Android手机). 错误日志查看是因为,抽奖用户的ip记录进数据库时出错,之前都是拿到ip直接插入数据 ...
- JS Data 对象 成员方法
var myDate = new Date(); myDate.getYear(); //获取当前年份(2位) myDate.getFullYear(); //获取完整的年份(4位,1970-???? ...
- python文本挖掘模版
import xlrd import jieba import sys import importlib import os #python内置的包,用于进行文件目录操作,我们将会用到os.listd ...