hadoop 输出中文乱码问题
本文转载至:
http://www.aboutyun.com/thread-7358-1-1.html
hadoop涉及输出文本的默认输出编码统一用没有BOM的UTF-8的形式,但是对于中文的输出window系统默认的是GBK,有些格式文件例如CSV格式的文件用excel打开输出编码为没有BOM的UTF-8文件时,输出的结果为乱码,只能由UE或者记事本打开才能正常显示。因此将hadoop默认输出编码更改为GBK成为非常常见的需求。
默认的情况下MR主程序中,设定输出编码的设置语句为:
- job.setOutputFormatClass(TextOutputFormat.class);
复制代码
- TextOutputFormat.class
复制代码
的代码如下:
- /**
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements. See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership. The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
- package org.apache.hadoop.mapreduce.lib.output;
- import java.io.DataOutputStream;
- import java.io.IOException;
- import java.io.UnsupportedEncodingException;
- import org.apache.hadoop.classification.InterfaceAudience;
- import org.apache.hadoop.classification.InterfaceStability;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.fs.FileSystem;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.fs.FSDataOutputStream;
- import org.apache.hadoop.io.NullWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.io.compress.CompressionCodec;
- import org.apache.hadoop.io.compress.GzipCodec;
- import org.apache.hadoop.mapreduce.OutputFormat;
- import org.apache.hadoop.mapreduce.RecordWriter;
- import org.apache.hadoop.mapreduce.TaskAttemptContext;
- import org.apache.hadoop.util.*;
- /** An {@link OutputFormat} that writes plain text files. */
- @InterfaceAudience.Public
- @InterfaceStability.Stable
- public class TextOutputFormat<K, V> extends FileOutputFormat<K, V> {
- public static String SEPERATOR = "mapreduce.output.textoutputformat.separator";
- protected static class LineRecordWriter<K, V>
- extends RecordWriter<K, V> {
- private static final String utf8 = "UTF-8"; // 将UTF-8转换成GBK
- private static final byte[] newline;
- static {
- try {
- newline = "\n".getBytes(utf8);
- } catch (UnsupportedEncodingException uee) {
- throw new IllegalArgumentException("can't find " + utf8 + " encoding");
- }
- }
- protected DataOutputStream out;
- private final byte[] keyValueSeparator;
- public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
- this.out = out;
- try {
- this.keyValueSeparator = keyValueSeparator.getBytes(utf8);
- } catch (UnsupportedEncodingException uee) {
- throw new IllegalArgumentException("can't find " + utf8 + " encoding");
- }
- }
- public LineRecordWriter(DataOutputStream out) {
- this(out, "\t");
- }
- /**
- * Write the object to the byte stream, handling Text as a special
- * case.
- * @param o the object to print
- * @throws IOException if the write throws, we pass it on
- */
- private void writeObject(Object o) throws IOException {
- if (o instanceof Text) {
- Text to = (Text) o; // 将此行代码注释掉
- out.write(to.getBytes(), 0, to.getLength()); // 将此行代码注释掉
- } else { // 将此行代码注释掉
- out.write(o.toString().getBytes(utf8));
- }
- }
- public synchronized void write(K key, V value)
- throws IOException {
- boolean nullKey = key == null || key instanceof NullWritable;
- boolean nullValue = value == null || value instanceof NullWritable;
- if (nullKey && nullValue) {
- return;
- }
- if (!nullKey) {
- writeObject(key);
- }
- if (!(nullKey || nullValue)) {
- out.write(keyValueSeparator);
- }
- if (!nullValue) {
- writeObject(value);
- }
- out.write(newline);
- }
- public synchronized
- void close(TaskAttemptContext context) throws IOException {
- out.close();
- }
- }
- public RecordWriter<K, V>
- getRecordWriter(TaskAttemptContext job
- ) throws IOException, InterruptedException {
- Configuration conf = job.getConfiguration();
- boolean isCompressed = getCompressOutput(job);
- String keyValueSeparator= conf.get(SEPERATOR, "\t");
- CompressionCodec codec = null;
- String extension = "";
- if (isCompressed) {
- Class<? extends CompressionCodec> codecClass =
- getOutputCompressorClass(job, GzipCodec.class);
- codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
- extension = codec.getDefaultExtension();
- }
- Path file = getDefaultWorkFile(job, extension);
- FileSystem fs = file.getFileSystem(conf);
- if (!isCompressed) {
- FSDataOutputStream fileOut = fs.create(file, false);
- return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
- } else {
- FSDataOutputStream fileOut = fs.create(file, false);
- return new LineRecordWriter<K, V>(new DataOutputStream
- (codec.createOutputStream(fileOut)),
- keyValueSeparator);
- }
- }
- }
复制代码
从上述代码的第48行可以看出hadoop已经限定此输出格式统一为UTF-8,因此为了改变hadoop的输出代码的文本编码只需定义一个和TextOutputFormat相同的类GbkOutputFormat同样继承FileOutputFormat(注意是org.apache.hadoop.mapreduce.lib.output.FileOutputFormat)即可,如下代码:
- import java.io.DataOutputStream;
- import java.io.IOException;
- import java.io.UnsupportedEncodingException;
- import org.apache.hadoop.classification.InterfaceAudience;
- import org.apache.hadoop.classification.InterfaceStability;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.fs.FileSystem;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.fs.FSDataOutputStream;
- import org.apache.hadoop.io.NullWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.io.compress.CompressionCodec;
- import org.apache.hadoop.io.compress.GzipCodec;
- import org.apache.hadoop.mapreduce.OutputFormat;
- import org.apache.hadoop.mapreduce.RecordWriter;
- import org.apache.hadoop.mapreduce.TaskAttemptContext;
- import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
- import org.apache.hadoop.util.*;
- @InterfaceAudience.Public
- @InterfaceStability.Stable
- public class GbkOutputFormat<K, V> extends FileOutputFormat<K, V> {
- public static String SEPERATOR = "mapreduce.output.textoutputformat.separator";
- protected static class LineRecordWriter<K, V>
- extends RecordWriter<K, V> {
- private static final String utf8 = "GBK";
- private static final byte[] newline;
- static {
- try {
- newline = "\n".getBytes(utf8);
- } catch (UnsupportedEncodingException uee) {
- throw new IllegalArgumentException("can't find " + utf8 + " encoding");
- }
- }
- protected DataOutputStream out;
- private final byte[] keyValueSeparator;
- public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
- this.out = out;
- try {
- this.keyValueSeparator = keyValueSeparator.getBytes(utf8);
- } catch (UnsupportedEncodingException uee) {
- throw new IllegalArgumentException("can't find " + utf8 + " encoding");
- }
- }
- public LineRecordWriter(DataOutputStream out) {
- this(out, "\t");
- }
- /**
- * Write the object to the byte stream, handling Text as a special
- * case.
- * @param o the object to print
- * @throws IOException if the write throws, we pass it on
- */
- private void writeObject(Object o) throws IOException {
- if (o instanceof Text) {
- // Text to = (Text) o;
- // out.write(to.getBytes(), 0, to.getLength());
- // } else {
- out.write(o.toString().getBytes(utf8));
- }
- }
- public synchronized void write(K key, V value)
- throws IOException {
- boolean nullKey = key == null || key instanceof NullWritable;
- boolean nullValue = value == null || value instanceof NullWritable;
- if (nullKey && nullValue) {
- return;
- }
- if (!nullKey) {
- writeObject(key);
- }
- if (!(nullKey || nullValue)) {
- out.write(keyValueSeparator);
- }
- if (!nullValue) {
- writeObject(value);
- }
- out.write(newline);
- }
- public synchronized
- void close(TaskAttemptContext context) throws IOException {
- out.close();
- }
- }
- public RecordWriter<K, V>
- getRecordWriter(TaskAttemptContext job
- ) throws IOException, InterruptedException {
- Configuration conf = job.getConfiguration();
- boolean isCompressed = getCompressOutput(job);
- String keyValueSeparator= conf.get(SEPERATOR, "\t");
- CompressionCodec codec = null;
- String extension = "";
- if (isCompressed) {
- Class<? extends CompressionCodec> codecClass =
- getOutputCompressorClass(job, GzipCodec.class);
- codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
- extension = codec.getDefaultExtension();
- }
- Path file = getDefaultWorkFile(job, extension);
- FileSystem fs = file.getFileSystem(conf);
- if (!isCompressed) {
- FSDataOutputStream fileOut = fs.create(file, false);
- return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
- } else {
- FSDataOutputStream fileOut = fs.create(file, false);
- return new LineRecordWriter<K, V>(new DataOutputStream
- (codec.createOutputStream(fileOut)),
- keyValueSeparator);
- }
- }
- }
复制代码
最后将输出编码类型设置成GbkOutputFormat.class,如:
- job.setOutputFormatClass(GbkOutputFormat.class);
复制代码
参考:
- http://semantic.iteye.com/blog/1846238
复制代码
hadoop 输出中文乱码问题的更多相关文章
- .Net Core 控制台输出中文乱码
Net Core 控制台输出中文乱码的解决方法: public static void Main(string[] args) { Console.Output ...
- 在Servlet中出现一个输出中文乱码的问题(已经解)。
在Servlet中出现一个输出中文乱码的问题,已经解. @Override public void doPost(HttpServletRequest reqeust, HttpServletResp ...
- idea 控制台输出 中文乱码 解决方法
使用intellij idea 14.1时,console 会输出中文乱码.下面分两种情况解决这种问题:一种是maven构建项目.一种是tomcat(不以maven构建)构建项目. 1.tomcat输 ...
- 编码(ACSII unicod UTF-8)、QT输出中文乱码深入分析
总结: 1. qt输出中文乱码原因分析 qt的编程环境默认是utf-8编码格式(关于编码见下文知识要点一): cout << "中文" << endl; 程 ...
- 使用WebLogic时控制台输出中文乱码解决方法
使用WebLogic时控制台输出中文乱码解决方法 1.找到weblogic安装目录,当前项目配置的domain 2.找到bin下的setDomainEnv.cmd文件 3.打开文件,从文件最后搜索第一 ...
- 二十一、IntelliJ IDEA 控制台输出中文乱码问题的解决方法
首先,找到 IntelliJ IDEA 的安装目录,进入bin目录下,定位到idea.vmoptions文件,如下图所示: 双击打开idea.vmoptions文件,如下图所示: 然后,在其中追加-D ...
- 解决phantomjs输出中文乱码
解决phantomjs输出中文乱码,可以在js文件里添加如下语句: phantom.outputEncoding="gb2312"; // 解决输出乱码
- resin后台输出中文乱码的解决办法!
resin后台输出中文乱码的解决办法! 学习了:https://blog.csdn.net/kobeguang/article/details/34116429 编辑conf/resin.con文件: ...
- resin后台输出中文乱码的解决的方法!
近期从tomcat移植到resin,发现这东西不错啊! 仅仅是后台输出时有时候中文会乱码. 如今找到resin后台输出中文乱码的解决的方法: 编辑conf/resin.con文件: <!--ja ...
随机推荐
- 进程通信(socket)
“一切皆Socket!” 话虽些许夸张,但是事实也是,现在的网络编程几乎都是用的socket. ——有感于实际编程和开源项目研究. 我们深谙信息交流的价值,那网络中进程之间如何通信,如我们每天打开浏览 ...
- [HAOI2012] 容易题[母函数]
794. [HAOI2012] 容易题 ★★☆ 输入文件:easy.in 输出文件:easy.out 简单对比时间限制:1 s 内存限制:128 MB 秒 输入:easy.in 输出: ...
- knockoutJs在移动设备上有时无法更新控件值
最近在用cordova(phonegap)写一个移动app,表单比较复杂,用了knockoutJs作为前端的MVVM框架进行数据绑定. 但是发现有时候(其实是每次)如果最后在input中编辑一个值,然 ...
- 【python】-- Django 分页 、cookie、Session、CSRF
Django 分页 .cookie.Session.CSRF 一.分页 分页功能在每个网站都是必要的,下面主要介绍两种分页方式: 1.Django内置分页 from django.shortcuts ...
- spring web中完成单元测试
对于在springweb总完单元测试,之前找过些资料,摸索了很久,记录下最终自己使用的方法 1,创建测试类,创建测试资源文件夹 src/test/resources/WEB_INFO/conf 将工程 ...
- 数据库垂直拆分,水平拆分利器,cobar升级版mycat(转)
原文:数据库垂直拆分,水平拆分利器,cobar升级版mycat 1,关于Mycat Mycat情报 基于阿里的开源cobar ,可以用于生产系统中,目前在做如下的一些改进: 非阻塞IO的实现,相对于目 ...
- Python3.6全栈开发实例[025]
25.文件a1.txt内容(升级题)name:apple price:10 amount:3 year:2012name:tesla price:100000 amount:1 year:2013通过 ...
- linux 安装zip/unzip/g++/gdb/vi/vim等软件
近期公司新配置了一台64位云server.去部署的时候发现,没有安装zip/unzip压缩解压软件. 于是仅仅好自己安装这两个软件.linux最好用的还是yum. 两个指令就安装好了. 首先把软件安装 ...
- jQuery 查找标签
1 基本选择器 2 基本筛选器 3 属性选择器 4 间接选择 1 基本选择器 //id选择器: $("#id") //标签选择器: $("tagName") / ...
- 瑞丽熵(renyi entropy)
在信息论中,Rényi熵是Hartley熵,Shannon熵,碰撞熵和最小熵的推广.熵能量化了系统的多样性,不确定性或随机性.Rényi熵以AlfrédRényi命名.在分形维数估计的背景下,Rény ...