hadoop 输出中文乱码问题

本文转载至：

　　http://www.aboutyun.com/thread-7358-1-1.html

hadoop涉及输出文本的默认输出编码统一用没有BOM的UTF-8的形式，但是对于中文的输出window系统默认的是GBK，有些格式文件例如CSV格式的文件用excel打开输出编码为没有BOM的UTF-8文件时，输出的结果为乱码，只能由UE或者记事本打开才能正常显示。因此将hadoop默认输出编码更改为GBK成为非常常见的需求。
默认的情况下MR主程序中，设定输出编码的设置语句为：

job.setOutputFormatClass(TextOutputFormat.class);

复制代码

TextOutputFormat.class

复制代码

的代码如下：

/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.mapreduce.lib.output;
import java.io.DataOutputStream;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapreduce.OutputFormat;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.util.*;
/** An {@link OutputFormat} that writes plain text files. */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class TextOutputFormat<K, V> extends FileOutputFormat<K, V> {
public static String SEPERATOR = "mapreduce.output.textoutputformat.separator";
protected static class LineRecordWriter<K, V>
extends RecordWriter<K, V> {
private static final String utf8 = "UTF-8"; // 将UTF-8转换成GBK
private static final byte[] newline;
static {
try {
newline = "\n".getBytes(utf8);
} catch (UnsupportedEncodingException uee) {
throw new IllegalArgumentException("can't find " + utf8 + " encoding");
}
}
protected DataOutputStream out;
private final byte[] keyValueSeparator;
public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
this.out = out;
try {
this.keyValueSeparator = keyValueSeparator.getBytes(utf8);
} catch (UnsupportedEncodingException uee) {
throw new IllegalArgumentException("can't find " + utf8 + " encoding");
}
}
public LineRecordWriter(DataOutputStream out) {
this(out, "\t");
}
/**
* Write the object to the byte stream, handling Text as a special
* case.
* @param o the object to print
* @throws IOException if the write throws, we pass it on
*/
private void writeObject(Object o) throws IOException {
if (o instanceof Text) {
Text to = (Text) o; // 将此行代码注释掉
out.write(to.getBytes(), 0, to.getLength()); // 将此行代码注释掉
} else { // 将此行代码注释掉
out.write(o.toString().getBytes(utf8));
}
}
public synchronized void write(K key, V value)
throws IOException {
boolean nullKey = key == null || key instanceof NullWritable;
boolean nullValue = value == null || value instanceof NullWritable;
if (nullKey && nullValue) {
return;
}
if (!nullKey) {
writeObject(key);
}
if (!(nullKey || nullValue)) {
out.write(keyValueSeparator);
}
if (!nullValue) {
writeObject(value);
}
out.write(newline);
}
public synchronized
void close(TaskAttemptContext context) throws IOException {
out.close();
}
}
public RecordWriter<K, V>
getRecordWriter(TaskAttemptContext job
) throws IOException, InterruptedException {
Configuration conf = job.getConfiguration();
boolean isCompressed = getCompressOutput(job);
String keyValueSeparator= conf.get(SEPERATOR, "\t");
CompressionCodec codec = null;
String extension = "";
if (isCompressed) {
Class<? extends CompressionCodec> codecClass =
getOutputCompressorClass(job, GzipCodec.class);
codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
extension = codec.getDefaultExtension();
}
Path file = getDefaultWorkFile(job, extension);
FileSystem fs = file.getFileSystem(conf);
if (!isCompressed) {
FSDataOutputStream fileOut = fs.create(file, false);
return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
} else {
FSDataOutputStream fileOut = fs.create(file, false);
return new LineRecordWriter<K, V>(new DataOutputStream
(codec.createOutputStream(fileOut)),
keyValueSeparator);
}
}
}

复制代码

从上述代码的第48行可以看出hadoop已经限定此输出格式统一为UTF-8，因此为了改变hadoop的输出代码的文本编码只需定义一个和TextOutputFormat相同的类GbkOutputFormat同样继承FileOutputFormat（注意是org.apache.hadoop.mapreduce.lib.output.FileOutputFormat）即可，如下代码：

import java.io.DataOutputStream;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapreduce.OutputFormat;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.*;
@InterfaceAudience.Public
@InterfaceStability.Stable
public class GbkOutputFormat<K, V> extends FileOutputFormat<K, V> {
public static String SEPERATOR = "mapreduce.output.textoutputformat.separator";
protected static class LineRecordWriter<K, V>
extends RecordWriter<K, V> {
private static final String utf8 = "GBK";
private static final byte[] newline;
static {
try {
newline = "\n".getBytes(utf8);
} catch (UnsupportedEncodingException uee) {
throw new IllegalArgumentException("can't find " + utf8 + " encoding");
}
}
protected DataOutputStream out;
private final byte[] keyValueSeparator;
public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
this.out = out;
try {
this.keyValueSeparator = keyValueSeparator.getBytes(utf8);
} catch (UnsupportedEncodingException uee) {
throw new IllegalArgumentException("can't find " + utf8 + " encoding");
}
}
public LineRecordWriter(DataOutputStream out) {
this(out, "\t");
}
/**
* Write the object to the byte stream, handling Text as a special
* case.
* @param o the object to print
* @throws IOException if the write throws, we pass it on
*/
private void writeObject(Object o) throws IOException {
if (o instanceof Text) {
// Text to = (Text) o;
// out.write(to.getBytes(), 0, to.getLength());
// } else {
out.write(o.toString().getBytes(utf8));
}
}
public synchronized void write(K key, V value)
throws IOException {
boolean nullKey = key == null || key instanceof NullWritable;
boolean nullValue = value == null || value instanceof NullWritable;
if (nullKey && nullValue) {
return;
}
if (!nullKey) {
writeObject(key);
}
if (!(nullKey || nullValue)) {
out.write(keyValueSeparator);
}
if (!nullValue) {
writeObject(value);
}
out.write(newline);
}
public synchronized
void close(TaskAttemptContext context) throws IOException {
out.close();
}
}
public RecordWriter<K, V>
getRecordWriter(TaskAttemptContext job
) throws IOException, InterruptedException {
Configuration conf = job.getConfiguration();
boolean isCompressed = getCompressOutput(job);
String keyValueSeparator= conf.get(SEPERATOR, "\t");
CompressionCodec codec = null;
String extension = "";
if (isCompressed) {
Class<? extends CompressionCodec> codecClass =
getOutputCompressorClass(job, GzipCodec.class);
codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
extension = codec.getDefaultExtension();
}
Path file = getDefaultWorkFile(job, extension);
FileSystem fs = file.getFileSystem(conf);
if (!isCompressed) {
FSDataOutputStream fileOut = fs.create(file, false);
return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
} else {
FSDataOutputStream fileOut = fs.create(file, false);
return new LineRecordWriter<K, V>(new DataOutputStream
(codec.createOutputStream(fileOut)),
keyValueSeparator);
}
}
}

复制代码

最后将输出编码类型设置成GbkOutputFormat.class，如：

job.setOutputFormatClass(GbkOutputFormat.class);

复制代码

参考：

http://semantic.iteye.com/blog/1846238

复制代码

hadoop 输出中文乱码问题的更多相关文章

.Net Core 控制台输出中文乱码
Net Core 控制台输出中文乱码的解决方法: public static void Main(string[] args) { Console.Output ...
在Servlet中出现一个输出中文乱码的问题(已经解)。
在Servlet中出现一个输出中文乱码的问题,已经解. @Override public void doPost(HttpServletRequest reqeust, HttpServletResp ...
idea 控制台输出中文乱码解决方法
使用intellij idea 14.1时,console 会输出中文乱码.下面分两种情况解决这种问题:一种是maven构建项目.一种是tomcat(不以maven构建)构建项目. 1.tomcat输 ...
编码(ACSII unicod UTF-8)、QT输出中文乱码深入分析
总结: 1. qt输出中文乱码原因分析 qt的编程环境默认是utf-8编码格式(关于编码见下文知识要点一): cout << "中文" << endl; 程 ...
使用WebLogic时控制台输出中文乱码解决方法
使用WebLogic时控制台输出中文乱码解决方法 1.找到weblogic安装目录,当前项目配置的domain 2.找到bin下的setDomainEnv.cmd文件 3.打开文件,从文件最后搜索第一 ...
二十一、IntelliJ IDEA 控制台输出中文乱码问题的解决方法
首先,找到 IntelliJ IDEA 的安装目录,进入bin目录下,定位到idea.vmoptions文件,如下图所示: 双击打开idea.vmoptions文件,如下图所示: 然后,在其中追加-D ...
解决phantomjs输出中文乱码
解决phantomjs输出中文乱码,可以在js文件里添加如下语句: phantom.outputEncoding="gb2312"; // 解决输出乱码
resin后台输出中文乱码的解决办法！
resin后台输出中文乱码的解决办法! 学习了:https://blog.csdn.net/kobeguang/article/details/34116429 编辑conf/resin.con文件: ...
resin后台输出中文乱码的解决的方法！
近期从tomcat移植到resin,发现这东西不错啊! 仅仅是后台输出时有时候中文会乱码. 如今找到resin后台输出中文乱码的解决的方法: 编辑conf/resin.con文件: <!--ja ...

随机推荐

coreData笔记
1. CDVehicle *vehicle = (CDVehicle *)[[NSManagedObject alloc] initWithEntity:entity insertIntoMan ...
【BZOJ1812】[Ioi2005]riv 树形DP
[BZOJ1812][Ioi2005]riv Description 几乎整个Byteland王国都被森林和河流所覆盖.小点的河汇聚到一起,形成了稍大点的河.就这样,所有的河水都汇聚并流进了一条大河, ...
160816、webpack 入门指南
什么是 webpack? webpack是近期最火的一款模块加载器兼打包工具,它能把各种资源,例如JS(含JSX).coffee.样式(含less/sass).图片等都作为模块来使用和处理. 我们可以 ...
ASP-Command-SQL格式
conn.open constrSet c=Server.CreateObject("ADODB.Command")With cSet .ActiveConnection = co ...
【转】 HMC与VIOS对新LPAR提供存储与网络虚拟化的支持
前面的几篇博文的操作环境都是在IVM下,IVM可以看作是VIOS的一部分,或者是对VIOS功能的一个扩展,一个IVM只能管理1台物理服务器,而HMC则是一对多.在有HMC来管理物理服务器的情形下,VI ...
接口测试工具 — postman（post请求）
1.登录接口 2.添加学生信息,这个接口是用来讲入参是json类型的 3.学生金币充值接口,这个接口是为了讲添加cookie以及身份验证的 4.上传文件接口
python作用域和JavaScript作用域
JavaScript 一.JavaScript中无块级作用域一个大括号一个作用域,就属于块级作用域,在Java和c#才存在块级作用域 function Main(){ if(1==1){ var n ...
关于python代码是编译执行还是解释执行
Python 是编译型语言还是解释型语言?回答这个问题前,应该先弄清楚什么是编译型语言,什么是解释型语言. 所谓编译执行就是源代码经过编译器编译处理,生成目标机器码,就是机器能直接运行的二进制代码,下 ...
Mac下最好用的文本编辑器
友情提醒:图多杀猫. 曾经在Windows下一直用gVim.能够用键盘控制一切,操作起来是又快又爽,还支持一大堆插件.想怎么玩就怎么玩.后来转Mac后,也沿袭着之前的习惯.一直在用终端的Vim.偶尔会 ...
(转)js获取内网ip地址，操作系统，浏览器版本等信息
这次呢,说一下使用js获取用户电脑的ip信息,刚开始只是想获取用户ip,后来就顺带着获取了操作系统和浏览器信息. 先说下获取用户ip地址,包括像ipv4,ipv6,掩码等内容,但是大部分都要根据浏览器 ...

hadoop 输出中文乱码问题

hadoop 输出中文乱码问题的更多相关文章

随机推荐

热门专题