hadoop 输出中文乱码问题
本文转载至:
http://www.aboutyun.com/thread-7358-1-1.html
hadoop涉及输出文本的默认输出编码统一用没有BOM的UTF-8的形式,但是对于中文的输出window系统默认的是GBK,有些格式文件例如CSV格式的文件用excel打开输出编码为没有BOM的UTF-8文件时,输出的结果为乱码,只能由UE或者记事本打开才能正常显示。因此将hadoop默认输出编码更改为GBK成为非常常见的需求。
默认的情况下MR主程序中,设定输出编码的设置语句为:
- job.setOutputFormatClass(TextOutputFormat.class);
复制代码
- TextOutputFormat.class
复制代码
的代码如下:
- /**
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements. See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership. The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
- package org.apache.hadoop.mapreduce.lib.output;
- import java.io.DataOutputStream;
- import java.io.IOException;
- import java.io.UnsupportedEncodingException;
- import org.apache.hadoop.classification.InterfaceAudience;
- import org.apache.hadoop.classification.InterfaceStability;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.fs.FileSystem;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.fs.FSDataOutputStream;
- import org.apache.hadoop.io.NullWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.io.compress.CompressionCodec;
- import org.apache.hadoop.io.compress.GzipCodec;
- import org.apache.hadoop.mapreduce.OutputFormat;
- import org.apache.hadoop.mapreduce.RecordWriter;
- import org.apache.hadoop.mapreduce.TaskAttemptContext;
- import org.apache.hadoop.util.*;
- /** An {@link OutputFormat} that writes plain text files. */
- @InterfaceAudience.Public
- @InterfaceStability.Stable
- public class TextOutputFormat<K, V> extends FileOutputFormat<K, V> {
- public static String SEPERATOR = "mapreduce.output.textoutputformat.separator";
- protected static class LineRecordWriter<K, V>
- extends RecordWriter<K, V> {
- private static final String utf8 = "UTF-8"; // 将UTF-8转换成GBK
- private static final byte[] newline;
- static {
- try {
- newline = "\n".getBytes(utf8);
- } catch (UnsupportedEncodingException uee) {
- throw new IllegalArgumentException("can't find " + utf8 + " encoding");
- }
- }
- protected DataOutputStream out;
- private final byte[] keyValueSeparator;
- public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
- this.out = out;
- try {
- this.keyValueSeparator = keyValueSeparator.getBytes(utf8);
- } catch (UnsupportedEncodingException uee) {
- throw new IllegalArgumentException("can't find " + utf8 + " encoding");
- }
- }
- public LineRecordWriter(DataOutputStream out) {
- this(out, "\t");
- }
- /**
- * Write the object to the byte stream, handling Text as a special
- * case.
- * @param o the object to print
- * @throws IOException if the write throws, we pass it on
- */
- private void writeObject(Object o) throws IOException {
- if (o instanceof Text) {
- Text to = (Text) o; // 将此行代码注释掉
- out.write(to.getBytes(), 0, to.getLength()); // 将此行代码注释掉
- } else { // 将此行代码注释掉
- out.write(o.toString().getBytes(utf8));
- }
- }
- public synchronized void write(K key, V value)
- throws IOException {
- boolean nullKey = key == null || key instanceof NullWritable;
- boolean nullValue = value == null || value instanceof NullWritable;
- if (nullKey && nullValue) {
- return;
- }
- if (!nullKey) {
- writeObject(key);
- }
- if (!(nullKey || nullValue)) {
- out.write(keyValueSeparator);
- }
- if (!nullValue) {
- writeObject(value);
- }
- out.write(newline);
- }
- public synchronized
- void close(TaskAttemptContext context) throws IOException {
- out.close();
- }
- }
- public RecordWriter<K, V>
- getRecordWriter(TaskAttemptContext job
- ) throws IOException, InterruptedException {
- Configuration conf = job.getConfiguration();
- boolean isCompressed = getCompressOutput(job);
- String keyValueSeparator= conf.get(SEPERATOR, "\t");
- CompressionCodec codec = null;
- String extension = "";
- if (isCompressed) {
- Class<? extends CompressionCodec> codecClass =
- getOutputCompressorClass(job, GzipCodec.class);
- codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
- extension = codec.getDefaultExtension();
- }
- Path file = getDefaultWorkFile(job, extension);
- FileSystem fs = file.getFileSystem(conf);
- if (!isCompressed) {
- FSDataOutputStream fileOut = fs.create(file, false);
- return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
- } else {
- FSDataOutputStream fileOut = fs.create(file, false);
- return new LineRecordWriter<K, V>(new DataOutputStream
- (codec.createOutputStream(fileOut)),
- keyValueSeparator);
- }
- }
- }
复制代码
从上述代码的第48行可以看出hadoop已经限定此输出格式统一为UTF-8,因此为了改变hadoop的输出代码的文本编码只需定义一个和TextOutputFormat相同的类GbkOutputFormat同样继承FileOutputFormat(注意是org.apache.hadoop.mapreduce.lib.output.FileOutputFormat)即可,如下代码:
- import java.io.DataOutputStream;
- import java.io.IOException;
- import java.io.UnsupportedEncodingException;
- import org.apache.hadoop.classification.InterfaceAudience;
- import org.apache.hadoop.classification.InterfaceStability;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.fs.FileSystem;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.fs.FSDataOutputStream;
- import org.apache.hadoop.io.NullWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.io.compress.CompressionCodec;
- import org.apache.hadoop.io.compress.GzipCodec;
- import org.apache.hadoop.mapreduce.OutputFormat;
- import org.apache.hadoop.mapreduce.RecordWriter;
- import org.apache.hadoop.mapreduce.TaskAttemptContext;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- import org.apache.hadoop.util.*;
- @InterfaceAudience.Public
- @InterfaceStability.Stable
- public class GbkOutputFormat<K, V> extends FileOutputFormat<K, V> {
- public static String SEPERATOR = "mapreduce.output.textoutputformat.separator";
- protected static class LineRecordWriter<K, V>
- extends RecordWriter<K, V> {
- private static final String utf8 = "GBK";
- private static final byte[] newline;
- static {
- try {
- newline = "\n".getBytes(utf8);
- } catch (UnsupportedEncodingException uee) {
- throw new IllegalArgumentException("can't find " + utf8 + " encoding");
- }
- }
- protected DataOutputStream out;
- private final byte[] keyValueSeparator;
- public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
- this.out = out;
- try {
- this.keyValueSeparator = keyValueSeparator.getBytes(utf8);
- } catch (UnsupportedEncodingException uee) {
- throw new IllegalArgumentException("can't find " + utf8 + " encoding");
- }
- }
- public LineRecordWriter(DataOutputStream out) {
- this(out, "\t");
- }
- /**
- * Write the object to the byte stream, handling Text as a special
- * case.
- * @param o the object to print
- * @throws IOException if the write throws, we pass it on
- */
- private void writeObject(Object o) throws IOException {
- if (o instanceof Text) {
- // Text to = (Text) o;
- // out.write(to.getBytes(), 0, to.getLength());
- // } else {
- out.write(o.toString().getBytes(utf8));
- }
- }
- public synchronized void write(K key, V value)
- throws IOException {
- boolean nullKey = key == null || key instanceof NullWritable;
- boolean nullValue = value == null || value instanceof NullWritable;
- if (nullKey && nullValue) {
- return;
- }
- if (!nullKey) {
- writeObject(key);
- }
- if (!(nullKey || nullValue)) {
- out.write(keyValueSeparator);
- }
- if (!nullValue) {
- writeObject(value);
- }
- out.write(newline);
- }
- public synchronized
- void close(TaskAttemptContext context) throws IOException {
- out.close();
- }
- }
- public RecordWriter<K, V>
- getRecordWriter(TaskAttemptContext job
- ) throws IOException, InterruptedException {
- Configuration conf = job.getConfiguration();
- boolean isCompressed = getCompressOutput(job);
- String keyValueSeparator= conf.get(SEPERATOR, "\t");
- CompressionCodec codec = null;
- String extension = "";
- if (isCompressed) {
- Class<? extends CompressionCodec> codecClass =
- getOutputCompressorClass(job, GzipCodec.class);
- codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
- extension = codec.getDefaultExtension();
- }
- Path file = getDefaultWorkFile(job, extension);
- FileSystem fs = file.getFileSystem(conf);
- if (!isCompressed) {
- FSDataOutputStream fileOut = fs.create(file, false);
- return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
- } else {
- FSDataOutputStream fileOut = fs.create(file, false);
- return new LineRecordWriter<K, V>(new DataOutputStream
- (codec.createOutputStream(fileOut)),
- keyValueSeparator);
- }
- }
- }
复制代码
最后将输出编码类型设置成GbkOutputFormat.class,如:
- job.setOutputFormatClass(GbkOutputFormat.class);
复制代码
参考:
- http://semantic.iteye.com/blog/1846238
复制代码
hadoop 输出中文乱码问题的更多相关文章
- .Net Core 控制台输出中文乱码
Net Core 控制台输出中文乱码的解决方法: public static void Main(string[] args) { Console.Output ...
- 在Servlet中出现一个输出中文乱码的问题(已经解)。
在Servlet中出现一个输出中文乱码的问题,已经解. @Override public void doPost(HttpServletRequest reqeust, HttpServletResp ...
- idea 控制台输出 中文乱码 解决方法
使用intellij idea 14.1时,console 会输出中文乱码.下面分两种情况解决这种问题:一种是maven构建项目.一种是tomcat(不以maven构建)构建项目. 1.tomcat输 ...
- 编码(ACSII unicod UTF-8)、QT输出中文乱码深入分析
总结: 1. qt输出中文乱码原因分析 qt的编程环境默认是utf-8编码格式(关于编码见下文知识要点一): cout << "中文" << endl; 程 ...
- 使用WebLogic时控制台输出中文乱码解决方法
使用WebLogic时控制台输出中文乱码解决方法 1.找到weblogic安装目录,当前项目配置的domain 2.找到bin下的setDomainEnv.cmd文件 3.打开文件,从文件最后搜索第一 ...
- 二十一、IntelliJ IDEA 控制台输出中文乱码问题的解决方法
首先,找到 IntelliJ IDEA 的安装目录,进入bin目录下,定位到idea.vmoptions文件,如下图所示: 双击打开idea.vmoptions文件,如下图所示: 然后,在其中追加-D ...
- 解决phantomjs输出中文乱码
解决phantomjs输出中文乱码,可以在js文件里添加如下语句: phantom.outputEncoding="gb2312"; // 解决输出乱码
- resin后台输出中文乱码的解决办法!
resin后台输出中文乱码的解决办法! 学习了:https://blog.csdn.net/kobeguang/article/details/34116429 编辑conf/resin.con文件: ...
- resin后台输出中文乱码的解决的方法!
近期从tomcat移植到resin,发现这东西不错啊! 仅仅是后台输出时有时候中文会乱码. 如今找到resin后台输出中文乱码的解决的方法: 编辑conf/resin.con文件: <!--ja ...
随机推荐
- 关于spring MVC中加载多个validator的方法。
首先讲下什么叫做validator: validator是验证器,可以验证后台接受的数据,对数据做校验. SpringMVC服务器验证有两种方式,一种是基于Validator接口,一种是使用Annot ...
- Cocos2d-x Lua中Sprite精灵类
精灵类是Sprite,它的类图如下图所示. Sprite类图 Sprite类直接继承了Node类,具有Node基本特征.此外,我们还可以看到Sprite类的子类有:PhysicsSprite和Skin ...
- (六)通过solr7的API实现商品的列表查询
(六)通过solr7的API实现商品的列表查询 工具类: 获取 HttpSolrClient public class Constant { public static HttpSolrClient ...
- Redis持久化之rdb&aof
Redis有两种持久化的方式:快照(RDB文件)和追加式文件(AOF文件) RDB持久化方式是在一个特定的间隔保存某个时间点的一个数据快照. AOF(Append only file)持久化方式则会记 ...
- pythpn的zip函数
zip可接受多个序列作为参数,返回一个tuple列表. 例1:没有参数 >>> b = zip() >>> print b [] 例2:一个参数 >>& ...
- spring(13)------全面深入解析spring的AOP
一,AOP的基本思想 AOP(Aspect Oriented Programming)翻译成中文的大意是面向切面编程,主要目的解决让不该牵扯在一起的代码分离开来. (1)认识AOP 应用程序中通常包含 ...
- OFMessageDecoder 分析
OFMessageDecoder 继承了抽象类 FrameDecoder.FrameDecoder 会将接收到的ChannelBuffers 转换成有意义的 frame 对象.在基于流的传输 ...
- spring中配置缓存—ehcache
常用的缓存工具有ehcache.memcache和redis,这里介绍spring中ehcache的配置. 1.在pom添加依赖: <!-- ehcache 相关依赖 --> <de ...
- payload有效载荷(转)
payload 记载着信息的那部分数据.通常在传输数据时,为了使数据传输更可靠,要把原始数据分批传输,并且在每一批数据的头和尾都加上一定的辅助信息,比如这一批数据量的大小,校验位等,这样就相当于给已经 ...
- Javascript自动打开匹配的超链接
可以用来点击广告.... 部分代码: function AutoClick() { var DivLink=document.getElementById("divLink"); ...