hadoop debug script

A Hadoop job may consist of many map tasks and reduce tasks. Therefore, debugging a
Hadoop job is often a complicated process. It is a good practice to first test a Hadoop job
using unit tests by running it with a subset of the data.
However, sometimes it is necessary to debug a Hadoop job in a distributed mode. To support
such cases, Hadoop provides a mechanism called debug scripts. This recipe explains how to
use debug scripts.

A debug script is a shell script, and Hadoop executes the script whenever a task encounters
an error. The script will have access to the $script, $stdout, $stderr, $syslog, and
$jobconfproperties, as environment variables populated by Hadoop. You can find a
sample script from resources/chapter3/debugscript. We can use the debug scripts
to copy all the logfiles to a single location, e-mail them to a single e-mail account, or perform
some analysis.
LOG_FILE=HADOOP_HOME/error.log
echo "Run the script" >> $LOG_FILE
echo $script >> $LOG_FILE
echo $stdout>> $LOG_FILE
echo $stderr>> $LOG_FILE
echo $syslog >> $LOG_FILE
echo $jobconf>> $LOG_FILE

when you execute this, you should pay attention to the execute path, or else it will not found debug script.

package chapter3;

import java.net.URI;

import org.apache.hadoop.filecache.DistributedCache;

import org.apache.hadoop.fs.FileStatus;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.JobConf;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordcountWithDebugScript {

    private static final String scriptFileLocation = "resources/chapter3/debugscript";

    private static final String HDFS_ROOT = "/debug";

    public static void setupFailedTaskScript(JobConf conf) throws Exception {

        // create a directory on HDFS where we'll upload the fail scripts

        FileSystem fs = FileSystem.get(conf);

        // Path debugDir = new Path("/debug");

        Path debugDir = new Path(HDFS_ROOT);

        // who knows what's already in this directory; let's just clear it.

        if (fs.exists(debugDir)) {

            fs.delete(debugDir, true);

        }

        // ...and then make sure it exists again

        fs.mkdirs(debugDir);

        // upload the local scripts into HDFS

        fs.copyFromLocalFile(new Path(scriptFileLocation), new Path(HDFS_ROOT

                + "/fail-script"));

        FileStatus[] list = fs.listStatus(new Path(HDFS_ROOT));

        if (list == null || list.length == 0) {

            System.out.println("No File found");

        } else {

            for (FileStatus f : list) {

                System.out.println("File found " + f.getPath());

            }

        }

        conf.setMapDebugScript("./fail-script");

        conf.setReduceDebugScript("./fail-script");

        // this create a simlink from the job directory to cache directory of

        // the mapper node

        DistributedCache.createSymlink(conf);

        URI fsUri = fs.getUri();

        String mapUriStr = fsUri.toString() + HDFS_ROOT

                + "/fail-script#fail-script";

        System.out.println("added " + mapUriStr + "to distributed cache 1");

        URI mapUri = new URI(mapUriStr);

        // Following copy the map uri to the cache directory of the job node

        DistributedCache.addCacheFile(mapUri, conf);

    }

    public static void main(String[] args) throws Exception {

        JobConf conf = new JobConf();

        setupFailedTaskScript(conf);

        Job job = new Job(conf, "word count");

        job.setJarByClass(FaultyWordCount.class);

        job.setMapperClass(FaultyWordCount.TokenizerMapper.class);

        job.setReducerClass(FaultyWordCount.IntSumReducer.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(IntWritable.class);

        FileSystem.get(conf).delete(new Path(args[1]), true);

        FileInputFormat.addInputPath(job, new Path(args[0]));

        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.waitForCompletion(true);

    }

}

digest from mapreduce cookbook

hadoop debug script的更多相关文章

Hadoop官方文档翻译——MapReduce Tutorial
MapReduce Tutorial(个人指导) Purpose(目的) Prerequisites(必备条件) Overview(综述) Inputs and Outputs(输入输出) MapRe ...
hadoop mapreduce核心功能描述
核心功能描述应用程序通常会通过提供map和reduce来实现 Mapper和Reducer接口,它们组成作业的核心. Mapper Mapper将输入键值对(key/value pair)映射到一组 ...
hadoop之 hadoop 2.2.X 弃用的配置属性名称及其替换名称对照表
Deprecated Properties 弃用属性 The following table lists the configuration property names that are depr ...
Hadoop专业解决方案-第5章开发可靠的MapReduce应用
本章主要内容: 1.利用MRUnit创建MapReduce的单元测试. 2.MapReduce应用的本地实例. 3.理解MapReduce的调试. 4.利用MapReduce防御式程序设计. 在WOX ...
Hadoop Map/Reduce教程
原文地址:http://hadoop.apache.org/docs/r1.0.4/cn/mapred_tutorial.html 目的先决条件概述输入与输出例子:WordCount v1.0 ...
一步一步跟我学习hadoop(5)----hadoop Map/Reduce教程（2）
Map/Reduce用户界面本节为用户採用框架要面对的各个环节提供了具体的描写叙述,旨在与帮助用户对实现.配置和调优进行具体的设置.然而,开发时候还是要相应着API进行相关操作. 首先我们须要了解M ...
VS2015/2013/2012 IIS Express Debug Classic ASP
参考资料: https://msdn.microsoft.com/en-us/library/ms241740(v=vs.100).aspx When you attach to an ASP Web ...
cdh版本的hue安装配置部署以及集成hadoop hbase hive mysql等权威指南
hue下载地址:https://github.com/cloudera/hue hue学习文档地址:http://archive.cloudera.com/cdh5/cdh/5/hue-3.7.0-c ...
Hadoop出现错误：WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable，解决方案
安装Hadoop的时候直接用的bin版本,根据教程安装好之后运行的时候发现出现了:WARN util.NativeCodeLoader: Unable to load native-hadoop li ...

随机推荐

开源的javascript实现页面打印功能，兼容所有的浏览器（情况属实）
这篇文章完全是属于技术文章,也是记录一下自己在项目当中遇到的坑爹问题啊,因为是B/S的程序,所以打印功能还是必须要有的,对于打印我选择了一个js插件,发现非常的简单和方便,所以这里拿出来和大家分享一下 ...
css:条件注释判断浏览器
所有的IE可识别 Target ALL VERSIONS of IE <!--[if IE]> <link rel="stylesheet" type=" ...
用Perl编写Apache模块续二 - SVN动态鉴权实现SVNAuth 禅道版
代码地址:https://code.csdn.net/x3dcn/svnauth 以禅道项目管理系统的数据库结构为标准,实现了可用的svn authz验证功能. 以用户名.密码.项目的acl开发程度o ...
JPA学习(1)基础认知
JPA 是什么 Java Persistence API:用于对象持久化的API. Java EE 5.0 平台标准的 ORM 规范,使得应用程序以统一的方式访问持久层: JPA和Hibernate的 ...
机器学习实战 - 读书笔记(11) - 使用Apriori算法进行关联分析
前言最近在看Peter Harrington写的"机器学习实战",这是我的学习心得,这次是第11章 - 使用Apriori算法进行关联分析. 基本概念关联分析(associat ...
S2 易买网总结
易买网项目总结 --指导老师:原玉明不知不觉,又到了S2结业的时间了,S1的项目KTV项目还历历在目.一路走来,感觉时间过的好快,我们离就业也越来越近... 展示: 1.主页面(首页) 01.商品分 ...
ALV常用参数详细描述
调用功能模块: CALL FUNCTION 'REUSE_ALV_GRID_DISPLAY' EXPORTING i_interface_check = '' ...
Oracle_spatial的函数介绍[转]
Oracle_spatial的函数一sdo_Geom包的函数: 用于表示两个几何对象的关系(结果为True/False)的函数:RELATE,WITHIN_DISTANCE 验证的函数:VALIDA ...
Microsoft Dynamics 2013 --Social Pane
Microsoft Dynamics 2013 有一个新的东西--Social Pane (图1) 进入窗体设置,发现改选项卡的详细设置如下 (图2) Tab键的选项有3种[活动][公告][注释],若 ...
UDF2
问题根据给定的gps点point(x,y)和北京的shape数据,关联出 AOI ID IO 输入 gps点表 create table gps ( x double, //经度 y double ...

hadoop debug script

hadoop debug script的更多相关文章

随机推荐

热门专题