一、自定义in/outputFormat

  1.需求  

  现有一些原始日志需要做增强解析处理,流程:

    1、 从原始日志文件中读取数据

    2、 根据日志中的一个URL字段到外部知识库中获取信息增强到原始日志

    3、 如果成功增强,则输出到增强结果目录;如果增强失败,则抽取原始数据中URL字段输出到待爬清单目录

1374609560.11    1374609560.16    1374609560.16    1374609560.16    110    5    8615038208365    460023383869133    8696420056841778    2    460    0    14615            54941    10.188.77.252    61.145.116.27    35020    80    6    cmnet    1    221.177.218.34    221.177.217.161    221.177.218.34    221.177.217.167    ad.veegao.com    http://ad.veegao.com/veegao/iris.action        Apache-HttpClient/UNAVAILABLE (java 1.4)    POST    200    593    310    4    3    0    0    4    3    0    0    0    0    http://ad.veegao.com/veegao/iris.action    5903903079251243019    5903903103500771339    5980728
1374609558.91 1374609558.97 1374609558.97 1374609559.31 112 461 8615038208365 460023383869133 8696420056841778 2 460 0 14615 54941 10.188.77.252 101.226.76.175 37293 80 6 cmnet 1 221.177.218.34 221.177.217.161 221.177.218.34 221.177.217.167 short.weixin.qq.com http://short.weixin.qq.com/cgi-bin/micromsg-bin/getcdndns Android QQMail HTTP Client POST 200 543 563 2 3 0 0 2 3 0 0 0 0 http://short.weixin.qq.com/cgi-bin/micromsg-bin/getcdndns 5903903079251243019 5903903097240039435 5980728
1374609514.70 1374609514.75 1374609514.75 1374609515.58 110 5 8613674976196 460004901700207 8623350100353878 2 460 0 14694 58793 10.184.80.32 111.13.13.222 36181 80 6 cmnet 1 221.177.156.4 221.177.217.145 221.177.156.4 221.177.217.156 retype.wenku.bdimg.com http://retype.wenku.bdimg.com/img/97308d2b7375a417866f8f09 AMB_400 GET 200 345 4183 5 5 0 0 5 5 0 0 0 0 http://retype.wenku.bdimg.com/img/97308d2b7375a417866f8f09 5903900710696611851 5903902908140003339 5937307
1374609511.98 1374609512.02 1374609512.02 1374609512.48 110 362 8613674976196 460004901700207 8623350100353878 2 460 0 14694 58793 10.184.80.32 120.204.207.160 33548 80 6 cmnet 1 221.177.156.4 221.177.217.145 221.177.156.4 221.177.217.156 t4.qpic.cn http://t4.qpic.cn/mblogpic/217cf24d43f1f19255e2/120 AMB_400 GET 200 346 3184 4 4 0 0 4 4 0 0 0 0 http://t4.qpic.cn/mblogpic/217cf24d43f1f19255e2/120 5903900710696611851 5903902896317288459 5937307
1374609518.14 1374609518.24 1374609518.24 1374609518.72 110 362 8613674976196 460004901700207 8623350100353878 2 460 0 14694 58793 10.184.80.32 120.204.207.160 33548 80 6 cmnet 1 221.177.156.4 221.177.217.145 221.177.156.4 221.177.217.156 t4.qpic.cn http://t4.qpic.cn/mblogpic/96e02ad781c9be6f5ad2/120 AMB_400 GET 200 346 3328 4 4 0 0 4 4 0 0 0 0 http://t4.qpic.cn/mblogpic/96e02ad781c9be6f5ad2/120 5903900710696611851 5903902896317288459 5937307

日志示例

  2.分析    

    程序的关键点是要在一个mapreduce程序中根据数据的不同输出两类结果到不同目录,这类灵活的输出需求可以通过自定义outputformat来实现

    这里和之前不一样的点就是需要从数据库提取信息,示例用的是原始的。那我们从简就可以使用DbUtils来简化一些,在mapper中通过setup()进行初始化即可!

  3.代码

  这里偷个小懒就没有手动建立数据库之类测试了。关键点是自定义OutputFormat

  我们默认使用的是TextOutputFormat(),在自定义之前,当然有必要先参考这个默认的东东:

/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/ package org.apache.hadoop.mapreduce.lib.output; import java.io.DataOutputStream;
import java.io.IOException;
import java.io.UnsupportedEncodingException; import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapreduce.OutputFormat;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.util.*; /** An {@link OutputFormat} that writes plain text files. */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class TextOutputFormat<K, V> extends FileOutputFormat<K, V> {
public static String SEPERATOR = "mapreduce.output.textoutputformat.separator";
protected static class LineRecordWriter<K, V>
extends RecordWriter<K, V> {
private static final String utf8 = "UTF-8";
private static final byte[] newline;
static {
try {
newline = "\n".getBytes(utf8);
} catch (UnsupportedEncodingException uee) {
throw new IllegalArgumentException("can't find " + utf8 + " encoding");
}
} protected DataOutputStream out;
private final byte[] keyValueSeparator; public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
this.out = out;
try {
this.keyValueSeparator = keyValueSeparator.getBytes(utf8);
} catch (UnsupportedEncodingException uee) {
throw new IllegalArgumentException("can't find " + utf8 + " encoding");
}
} public LineRecordWriter(DataOutputStream out) {
this(out, "\t");
} /**
* Write the object to the byte stream, handling Text as a special
* case.
* @param o the object to print
* @throws IOException if the write throws, we pass it on
*/
private void writeObject(Object o) throws IOException {
if (o instanceof Text) {
Text to = (Text) o;
out.write(to.getBytes(), 0, to.getLength());
} else {
out.write(o.toString().getBytes(utf8));
}
} public synchronized void write(K key, V value)
throws IOException { boolean nullKey = key == null || key instanceof NullWritable;
boolean nullValue = value == null || value instanceof NullWritable;
if (nullKey && nullValue) {
return;
}
if (!nullKey) {
writeObject(key);
}
if (!(nullKey || nullValue)) {
out.write(keyValueSeparator);
}
if (!nullValue) {
writeObject(value);
}
out.write(newline);
} public synchronized
void close(TaskAttemptContext context) throws IOException {
out.close();
}
} public RecordWriter<K, V>
getRecordWriter(TaskAttemptContext job
) throws IOException, InterruptedException {
Configuration conf = job.getConfiguration();
boolean isCompressed = getCompressOutput(job);
String keyValueSeparator= conf.get(SEPERATOR, "\t");
CompressionCodec codec = null;
String extension = "";
if (isCompressed) {
Class<? extends CompressionCodec> codecClass =
getOutputCompressorClass(job, GzipCodec.class);
codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
extension = codec.getDefaultExtension();
}
Path file = getDefaultWorkFile(job, extension);
FileSystem fs = file.getFileSystem(conf);
if (!isCompressed) {
FSDataOutputStream fileOut = fs.create(file, false);
return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
} else {
FSDataOutputStream fileOut = fs.create(file, false);
return new LineRecordWriter<K, V>(new DataOutputStream
(codec.createOutputStream(fileOut)),
keyValueSeparator);
}
}
}

TextOutputFormat

  一些参考与实例https://www.cnblogs.com/liuming1992/p/4758504.html

       http://blog.csdn.net/woshisap/article/details/42320129

        http://chengjianxiaoxue.iteye.com/blog/2163284  --> 推荐

  4.自定义inputFormat  

public class WholeFileInputFormat extends
FileInputFormat<NullWritable, BytesWritable> {
//设置每个小文件不可分片,保证一个小文件生成一个key-value键值对
@Override
protected boolean isSplitable(JobContext context, Path file) {
return false;
} @Override
public RecordReader<NullWritable, BytesWritable> createRecordReader(
InputSplit split, TaskAttemptContext context) throws IOException,
InterruptedException {
WholeFileRecordReader reader = new WholeFileRecordReader();
reader.initialize(split, context);
return reader;
}
}

WholeFileInputFormat

class WholeFileRecordReader extends RecordReader<NullWritable, BytesWritable> {
private FileSplit fileSplit;
private Configuration conf;
private BytesWritable value = new BytesWritable();
private boolean processed = false; @Override
public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
this.fileSplit = (FileSplit) split;
this.conf = context.getConfiguration();
} @Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (!processed) {
byte[] contents = new byte[(int) fileSplit.getLength()];
Path file = fileSplit.getPath();
FileSystem fs = file.getFileSystem(conf);
FSDataInputStream in = null;
try {
in = fs.open(file);
IOUtils.readFully(in, contents, 0, contents.length);
value.set(contents, 0, contents.length);
} finally {
IOUtils.closeStream(in);
}
processed = true;
return true;
}
return false;
} @Override
public NullWritable getCurrentKey() throws IOException,
InterruptedException {
return NullWritable.get();
} @Override
public BytesWritable getCurrentValue() throws IOException,
InterruptedException {
return value;
} @Override
public float getProgress() throws IOException {
return processed ? 1.0f : 0.0f;
} @Override
public void close() throws IOException {
// do nothing
}
}

WholeFileRecordReader

    参考http://blog.csdn.net/woshixuye/article/details/53557487

       http://irwenqiang.iteye.com/blog/1448164

       http://m635674608.iteye.com/blog/2243076

  完整代码:

package cn.itcast.bigdata.mr.logenhance;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.Statement;
import java.util.HashMap;
import java.util.Map; public class DBLoader { public static void dbLoader(Map<String, String> ruleMap) throws Exception { Connection conn = null;
Statement st = null;
ResultSet res = null; try {
Class.forName("com.mysql.jdbc.Driver");
conn = DriverManager.getConnection("jdbc:mysql://localhost:3306/urldb", "root", "root");
st = conn.createStatement();
res = st.executeQuery("select url,content from url_rule");
while (res.next()) {
ruleMap.put(res.getString(1), res.getString(2));
} } finally {
try{
if(res!=null){
res.close();
}
if(st!=null){
st.close();
}
if(conn!=null){
conn.close();
} }catch(Exception e){
e.printStackTrace();
}
} } }

DBLoader

package cn.itcast.bigdata.mr.logenhance;

import java.io.IOException;
import java.util.HashMap;
import java.util.Map; import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class LogEnhance { static class LogEnhanceMapper extends Mapper<LongWritable, Text, Text, NullWritable> { Map<String, String> ruleMap = new HashMap<String, String>(); Text k = new Text();
NullWritable v = NullWritable.get(); // 从数据库中加载规则信息倒ruleMap中
@Override
protected void setup(Context context) throws IOException, InterruptedException { try {
DBLoader.dbLoader(ruleMap);
} catch (Exception e) {
e.printStackTrace();
} } @Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// 获取一个计数器用来记录不合法的日志行数, 组名, 计数器名称
Counter counter = context.getCounter("malformed", "malformedline");
String line = value.toString();
String[] fields = StringUtils.split(line, "\t");
try {
String url = fields[26];
String content_tag = ruleMap.get(url);
// 判断内容标签是否为空,如果为空,则只输出url到待爬清单;如果有值,则输出到增强日志
if (content_tag == null) {
k.set(url + "\t" + "tocrawl" + "\n");
context.write(k, v);
} else {
k.set(line + "\t" + content_tag + "\n");
context.write(k, v);
} } catch (Exception exception) {
counter.increment(1);
}
} } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf); job.setJarByClass(LogEnhance.class); job.setMapperClass(LogEnhanceMapper.class); job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class); // 要控制不同的内容写往不同的目标路径,可以采用自定义outputformat的方法
job.setOutputFormatClass(LogEnhanceOutputFormat.class); FileInputFormat.setInputPaths(job, new Path("D:/srcdata/webloginput/")); // 尽管我们用的是自定义outputformat,但是它是继承制fileoutputformat
// 在fileoutputformat中,必须输出一个_success文件,所以在此还需要设置输出path
FileOutputFormat.setOutputPath(job, new Path("D:/temp/output/")); // 不需要reducer
job.setNumReduceTasks(0); job.waitForCompletion(true);
System.exit(0); } }

LogEnhance

package cn.itcast.bigdata.mr.logenhance;

import java.io.IOException;

import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; /**
* maptask或者reducetask在最终输出时,先调用OutputFormat的getRecordWriter方法拿到一个RecordWriter
* 然后再调用RecordWriter的write(k,v)方法将数据写出
*
* @author
*
*/
public class LogEnhanceOutputFormat extends FileOutputFormat<Text, NullWritable> { @Override
public RecordWriter<Text, NullWritable> getRecordWriter(TaskAttemptContext context) throws IOException, InterruptedException { FileSystem fs = FileSystem.get(context.getConfiguration()); Path enhancePath = new Path("D:/temp/en/log.dat");
Path tocrawlPath = new Path("D:/temp/crw/url.dat"); FSDataOutputStream enhancedOs = fs.create(enhancePath);
FSDataOutputStream tocrawlOs = fs.create(tocrawlPath); return new EnhanceRecordWriter(enhancedOs, tocrawlOs);
} /**
* 构造一个自己的recordwriter
*
* @author
*
*/
static class EnhanceRecordWriter extends RecordWriter<Text, NullWritable> {
FSDataOutputStream enhancedOs = null;
FSDataOutputStream tocrawlOs = null; public EnhanceRecordWriter(FSDataOutputStream enhancedOs, FSDataOutputStream tocrawlOs) {
super();
this.enhancedOs = enhancedOs;
this.tocrawlOs = tocrawlOs;
} @Override
public void write(Text key, NullWritable value) throws IOException, InterruptedException {
String result = key.toString();
// 如果要写出的数据是待爬的url,则写入待爬清单文件 /logenhance/tocrawl/url.dat
if (result.contains("tocrawl")) {
tocrawlOs.write(result.getBytes());
} else {
// 如果要写出的数据是增强日志,则写入增强日志文件 /logenhance/enhancedlog/log.dat
enhancedOs.write(result.getBytes());
} } @Override
public void close(TaskAttemptContext context) throws IOException, InterruptedException {
if (tocrawlOs != null) {
tocrawlOs.close();
}
if (enhancedOs != null) {
enhancedOs.close();
} } } }

LogEnhanceOutputFormat

  其他 待补充。。

二、计数器与多Job串联

 1.计数器

 MapReduce 计数器(Counter)为我们提供一个窗口,用于观察 MapReduce Job 运行期的各种细节数据。对MapReduce性能调优很有帮助,MapReduce性能优化的评估大部分都是基于这些 Counter 的数值表现出来的。可以用来记录一些全局数据等!

  相关介绍与参考http://blog.csdn.net/xw_classmate/article/details/50954384

           https://www.cnblogs.com/codeOfLife/p/5521356.html

  2.多Job串联  

    一个稍复杂点的处理逻辑往往需要多个mapreduce程序串联处理,多job的串联可以借助mapreduce框架的JobControl实现

      ——一般不用,因为串联在job中容易写死,建议通过shell脚本来控制

    自定义实现可以通过jobName来区分多个Job,自己控制提交与依赖,所谓依赖就是一个M/R Job 的处理结果是另外的M/R 的输入

    自定义实现的示例,参考https://www.cnblogs.com/yjmyzz/p/4540469.html

    通过JobControl来控制Job的依赖关系:

    核心代码:

      ControlledJob cJob1 = new ControlledJob(job1.getConfiguration());
ControlledJob cJob2 = new ControlledJob(job2.getConfiguration());
ControlledJob cJob3 = new ControlledJob(job3.getConfiguration()); cJob1.setJob(job1);
cJob2.setJob(job2);
cJob3.setJob(job3); // 设置作业依赖关系
cJob2.addDependingJob(cJob1);
cJob3.addDependingJob(cJob2); JobControl jobControl = new JobControl("RecommendationJob");
jobControl.addJob(cJob1);
jobControl.addJob(cJob2);
jobControl.addJob(cJob3); // 新建一个线程来运行已加入JobControl中的作业,开始进程并等待结束
Thread jobControlThread = new Thread(jobControl);
jobControlThread.start();
while (!jobControl.allFinished()) {
Thread.sleep(500);
}
jobControl.stop(); return 0;

    更多完整示例http://blog.csdn.net/sven119/article/details/78806380

           http://mntms.iteye.com/blog/2096456  -->推荐

三、数据压缩

  

   经典用法:

    Mapper输出压缩:    

new API:

Configuration conf = new Configuration();
conf.setBoolean(Job.MAP_OUTPUT_COMPRESS, true);
conf.setClass(Job.MAP_OUTPUT_COMPRESS_CODEC, GzipCodec.class, CompressionCodec.class);
Job job = new Job(conf); old API: conf.setCompressMapOutput(true);
conf.setMapOutputCompressorClass(GzipCodec.class);

    Reducer输出压缩:

      配置方法:

mapreduce.output.fileoutputformat.compress=false
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
mapreduce.output.fileoutputformat.compress.type=RECORD

     代码设置法:

       //将reduce输出文件压缩
FileOutputFormat.setCompressOutput(job, true); //job使用压缩
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class); //设置压缩格式

   更多参考https://www.cnblogs.com/ggjucheng/archive/2012/04/22/2465580.html

四、常用MR配置参数优化

  1.资源相关参数

11.1 资源相关参数
//以下参数是在用户自己的mr应用程序中配置就可以生效
(1) mapreduce.map.memory.mb: 一个Map Task可使用的资源上限(单位:MB),默认为1024。如果Map Task实际使用的资源量超过该值,则会被强制杀死。
(2) mapreduce.reduce.memory.mb: 一个Reduce Task可使用的资源上限(单位:MB),默认为1024。如果Reduce Task实际使用的资源量超过该值,则会被强制杀死。
(3) mapreduce.map.java.opts: Map Task的JVM参数,你可以在此配置默认的java heap size等参数, e.g.
“-Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc” (@taskid@会被Hadoop框架自动换为相应的taskid), 默认值: “”
(4) mapreduce.reduce.java.opts: Reduce Task的JVM参数,你可以在此配置默认的java heap size等参数, e.g.
“-Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc”, 默认值: “”
(5) mapreduce.map.cpu.vcores: 每个Map task可使用的最多cpu core数目, 默认值: 1
(6) mapreduce.reduce.cpu.vcores: 每个Reduce task可使用的最多cpu core数目, 默认值: 1 //应该在yarn启动之前就配置在服务器的配置文件中才能生效
(7) yarn.scheduler.minimum-allocation-mb 1024 给应用程序container分配的最小内存
(8) yarn.scheduler.maximum-allocation-mb 8192 给应用程序container分配的最大内存
(9) yarn.scheduler.minimum-allocation-vcores 1
(10)yarn.scheduler.maximum-allocation-vcores 32
(11)yarn.nodemanager.resource.memory-mb 8192 //shuffle性能优化的关键参数,应在yarn启动之前就配置好
(12)mapreduce.task.io.sort.mb 100 //shuffle的环形缓冲区大小,默认100m
(13)mapreduce.map.sort.spill.percent 0.8 //环形缓冲区溢出的阈值,默认80%

  2.容错相关参数

(1) mapreduce.map.maxattempts: 每个Map Task最大重试次数,一旦重试参数超过该值,则认为Map Task运行失败,默认值:4。
(2) mapreduce.reduce.maxattempts: 每个Reduce Task最大重试次数,一旦重试参数超过该值,则认为Map Task运行失败,默认值:4。
(3) mapreduce.map.failures.maxpercent: 当失败的Map Task失败比例超过该值为,整个作业则失败,默认值为0. 如果你的应用程序允许丢弃部分输入数据,则该该值设为一个大于0的值,比如5,表示如果有低于5%的Map Task失败(如果一个Map Task重试次数超过mapreduce.map.maxattempts,则认为这个Map Task失败,其对应的输入数据将不会产生任何结果),整个作业扔认为成功。
(4) mapreduce.reduce.failures.maxpercent: 当失败的Reduce Task失败比例超过该值为,整个作业则失败,默认值为0.
(5) mapreduce.task.timeout: Task超时时间,经常需要设置的一个参数,该参数表达的意思为:如果一个task在一定时间内没有任何进入,即不会读取新的数据,也没有输出数据,则认为该task处于block状态,可能是卡住了,也许永远会卡主,为了防止因为用户程序永远block住不退出,则强制设置了一个该超时时间(单位毫秒),默认是300000。如果你的程序对每条输入数据的处理时间过长(比如会访问数据库,通过网络拉取数据等),建议将该参数调大,该参数过小常出现的错误提示是“AttemptID:attempt_14267829456721_123456_m_000224_0 Timed out after 300 secsContainer killed by the ApplicationMaster.”。

  3.本地作业参数

设置以下几个参数:
mapreduce.framework.name=local
mapreduce.jobtracker.address=local
fs.defaultFS=local

  4.效率和稳定性相关参数

(1) mapreduce.map.speculative: 是否为Map Task打开推测执行机制,默认为false
(2) mapreduce.reduce.speculative: 是否为Reduce Task打开推测执行机制,默认为false
(3) mapreduce.job.user.classpath.first & mapreduce.task.classpath.user.precedence:当同一个class同时出现在用户jar包和hadoop jar中时,优先使用哪个jar包中的class,默认为false,表示优先使用hadoop jar中的class。
(4) mapreduce.input.fileinputformat.split.minsize: FileInputFormat做切片时的最小切片大小,(5)mapreduce.input.fileinputformat.split.maxsize: FileInputFormat做切片时的最大切片大小
(切片的默认大小就等于blocksize,即 134217728)

大数据入门第九天——MapReduce详解(六)MR其他补充的更多相关文章

  1. 大数据入门第九天——MapReduce详解(五)mapJoin、GroupingComparator与更多MR实例

    一.数据倾斜分析——mapJoin 1.背景 接上一个day的Join算法,我们的解决join的方式是:在reduce端通过pid进行串接,这样的话: --order ,,P0001, ,,P0001 ...

  2. 大数据入门第八天——MapReduce详解(三)MR的shuffer、combiner与Yarn集群分析

    /mr的combiner /mr的排序 /mr的shuffle /mr与yarn /mr运行模式 /mr实现join /mr全局图 /mr的压缩 今日提纲 一.流量汇总排序的实现 1.需求 对日志数据 ...

  3. 大数据入门第八天——MapReduce详解(四)本地模式运行与join实例

    一.本地模式调试MR程序 1.准备 参考之前随笔的windows开发说明处:http://www.cnblogs.com/jiangbei/p/8366238.html 2.流程 最重要的是设置Loc ...

  4. 大数据开发-Spark Join原理详解

    数据分析中将两个数据集进行 Join 操作是很常见的场景.在 Spark 的物理计划阶段,Spark 的 Join Selection 类会根 据 Join hints 策略.Join 表的大小. J ...

  5. 大数据开发-linux下常见问题详解

    1.user ss is currently user by process 3234 问题原因:root --> ss --> root 栈递归一样 解决方式:exit 退出当前到ss再 ...

  6. 大数据入门基础系列之Hadoop1.X、Hadoop2.X和Hadoop3.X的多维度区别详解(博主推荐)

    不多说,直接上干货! 在前面的博文里,我已经介绍了 大数据入门基础系列之Linux操作系统简介与选择 大数据入门基础系列之虚拟机的下载.安装详解 大数据入门基础系列之Linux的安装详解 大数据入门基 ...

  7. 大数据入门:Hadoop安装、环境配置及检测

    目录 1.导包Hadoop包 2.配置环境变量 3.把winutil包拷贝到Hadoop bin目录下 4.把Hadoop.dll放到system32下 5.检测Hadoop是否正常安装 5.1在ma ...

  8. hadoop之mapreduce详解(进阶篇)

    上篇文章hadoop之mapreduce详解(基础篇)我们了解了mapreduce的执行过程和shuffle过程,本篇文章主要从mapreduce的组件和输入输出方面进行阐述. 一.mapreduce ...

  9. CentOS6安装各种大数据软件 第九章:Hue大数据可视化工具安装和配置

    相关文章链接 CentOS6安装各种大数据软件 第一章:各个软件版本介绍 CentOS6安装各种大数据软件 第二章:Linux各个软件启动命令 CentOS6安装各种大数据软件 第三章:Linux基础 ...

随机推荐

  1. LeetCode题解之Reverse Bits

    1.题目描述 2.题目分析 使用bitset 类的方法 3.代码 uint32_t reverseBits(uint32_t n) { bitset<> b(n); string b_s ...

  2. gridview导出数据,如果为0开头,丢失0解决方案

    1.protected void GridView1_RowDataBound( object sender, GridViewRowEventArgs e )  {    if (e.Row.Row ...

  3. EF的连表查询Lambda表达式和linq语句

    select c; ), b=> b.Id, p=> p.BlogId, (b, p) => new {b}); public class Blog { public int Id ...

  4. [翻译] ios-image-filters

    ios-image-filters https://github.com/esilverberg/ios-image-filters photoshop-style filter interface ...

  5. Python初学者第二十三天 函数进阶(2)装饰器

    装饰器: 需求----> 写一个功能,测试其他同事函数的调用效率. 第一版:功能版 import time def func(): time.sleep(0.2) print('非常复杂') d ...

  6. Linux sudo详解

    sudo:控制用户对系统命令的使用权限,root允许的操作.通过sudo可以提高普通用户的操作权限,不过这个权限是需要进行配置才可使用. 常用的命令展示 配置sudo的2种方式 1. visodu 编 ...

  7. Linux head/tail命令详解

    head命令用于显示文件的开头的内容.在默认情况下,head命令显示文件的头10行内容. tail命令用于显示文件的结尾的内容.在默认情况下,taild命令显示文件的后10行内容. head常见命令参 ...

  8. September 21st 2017 Week 38th Thursday

    What fire does not destroy, it hardens. 烈火摧毁不了的东西,只会变得更坚固. The true gold can stand the test of fire, ...

  9. ZT 链表逆序

    链表逆序 原帖地址http://blog.csdn.net/niuer09/article/details/5961004 分类: C/C++2010-10-23 17:23 18425人阅读 评论( ...

  10. 1.2 Why Python for Data Analysis(为什么使用Python做数据分析)

    1.2 Why Python for Data Analysis?(为什么使用Python做数据分析) 这节我就不进行过多介绍了,Python近几年的发展势头是有目共睹的,尤其是在科学计算,数据处理, ...