Hadoop实战:用Hadoop处理Excel通话记录
项目需求
有博主与家庭成员之间的通话记录一份,存储在Excel文件中,如下面的数据集所示。我们需要基于这份数据,统计每个月每个家庭成员给自己打电话的次数,并按月份输出到不同文件夹。
数据集
下面是部分数据,数据格式:编号 联系人 电话 时间。

项目实现
首先,输入文件是Excel格式,我们可以借助poi jar包来解析Excel文件,如果本地没有可以下载:poi-3.9.jar 和 poi-excelant-3.9.jar 并引入到项目中。借助这两个jar包,我们先来实现一个Excel的解析类 —— ExcelParser.java。
package com.hadoop.phoneStatistics; import java.io.IOException;
import java.io.InputStream;
import java.util.Iterator; import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row; /**
* @author Zimo
* 用于解析Excel中的通话记录
*/
public class ExcelParser { private static final Log LOG = LogFactory.getLog(ExcelParser.class);
private StringBuilder currentString = null;
private long bytesRead = ; public String parseExcelData(InputStream is) {
try {
HSSFWorkbook workbook = new HSSFWorkbook(); // Taking first sheet from the workbook
HSSFSheet sheet = workbook.getSheetAt(); // Iterate through each rows from first sheet
Iterator<Row> rowIterator = sheet.iterator();
currentString = new StringBuilder();
while (rowIterator.hasNext()) {
Row row = rowIterator.next(); // For each row, iterate through each columns
Iterator<Cell> cellIterator = row.cellIterator(); while (cellIterator.hasNext()) {
Cell cell = cellIterator.next();
switch (cell.getCellType()) {
case Cell.CELL_TYPE_BOOLEAN:
bytesRead++;
currentString.append(cell.getBooleanCellValue() + "\t");
break; case Cell.CELL_TYPE_NUMERIC:
bytesRead++;
currentString.append(cell.getNumericCellValue() + "\t");
break; case Cell.CELL_TYPE_STRING:
bytesRead++;
currentString.append(cell.getStringCellValue() + "\t");
break;
}
}
currentString.append("\n");
}
is.close();
} catch (IOException ioe) {
// TODO: handle exception
LOG.error("IO Exception : File not found " + ioe);
}
return currentString.toString();
} public long getBytesRead()
{
return bytesRead;
} }
ExcelPhoneStatistics.java
package com.hadoop.phoneStatistics; import java.io.IOException; import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory; /**
* @author Zimo
* 处理通话记录
*/
public class ExcelPhoneStatistics extends Configured implements Tool {
private static Logger logger = LoggerFactory.getLogger(ExcelPhoneStatistics.class); public static class ExcelMapper extends
Mapper<LongWritable, Text, Text, Text>
{ private static Logger LOG = LoggerFactory.getLogger(ExcelMapper.class);
private Text pkey = new Text();
private Text pvalue = new Text(); /**
* Excel Spreadsheet is supplied in string form to the mapper. We are
* simply emitting them for viewing on HDFS.
*/
public void map(LongWritable key, Text value, Context context)
throws InterruptedException, IOException
{
//1.0, 老爸, 13999123786, 2014-12-20
String line = value.toString();
String[] records = line.split("\\s+");
String[] months = records[].split("-");//获取月份
pkey.set(records[] + "\t" + months[]);//昵称+月份
pvalue.set(records[]);//手机号
context.write(pkey, pvalue);
LOG.info("Map processing finished");
}
} public static class PhoneReducer extends Reducer<Text, Text, Text, Text>
{
private Text pvalue = new Text(); protected void reduce(Text Key, Iterable<Text> Values, Context context)
throws IOException, InterruptedException
{
int sum = ;
Text outKey = Values.iterator().next();
for (Text value : Values)
{
sum++;
}
pvalue.set(outKey+"\t"+sum);
context.write(Key, pvalue);
}
} public static class PhoneOutputFormat extends
MailMultipleOutputFormat<Text, Text>
{ @Override
protected String generateFileNameForKeyValue(Text key,
Text value, Configuration conf)
{
//name+month
String[] records = key.toString().split("\t");
return records[] + ".txt";
} } @Override
public int run(String[] args) throws Exception
{
Configuration conf = new Configuration();// 配置文件对象
Path mypath = new Path(args[]);
FileSystem hdfs = mypath.getFileSystem(conf);// 创建输出路径
if (hdfs.isDirectory(mypath))
{
hdfs.delete(mypath, true);
}
logger.info("Driver started"); Job job = new Job();
job.setJarByClass(ExcelPhoneStatistics.class);
job.setJobName("Excel Record Reader");
job.setMapperClass(ExcelMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setInputFormatClass(ExcelInputFormat.class);//自定义输入格式 job.setReducerClass(PhoneReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(PhoneOutputFormat.class);//自定义输出格式 FileInputFormat.addInputPath(job, new Path(args[]));
FileOutputFormat.setOutputPath(job, new Path(args[]));
job.waitForCompletion(true);
return ;
} public static void main(String[] args) throws Exception
{
String[] args0 = {
// args[0], args[1]
"hdfs://master:8020/phone/phone.xls",
"hdfs://master:8020/phone/out/"
};
int ec = ToolRunner.run(new Configuration(), new ExcelPhoneStatistics(), args0);
System.exit(ec);
}
}
ExcelInputFormat.java
package com.hadoop.phoneStatistics; import java.io.IOException;
import java.io.InputStream; import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit; /**
* @author Zimo
* 自定义输入格式
* * <p>
* An {@link org.apache.hadoop.mapreduce.InputFormat} for excel spread sheet files.
* Multiple sheets are supported
* <p/>
* Keys are the position in the file, and values are the row containing all columns for the
* particular row.
*/ public class ExcelInputFormat extends FileInputFormat<LongWritable, Text> { @Override
public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
return new ExcelRecordReader();
} public class ExcelRecordReader extends RecordReader<LongWritable, Text> { private LongWritable key;
private Text value;
private InputStream is;
private String[] strArrayofLines; @Override
public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException, InterruptedException {
// TODO Auto-generated method stub
FileSplit split = (FileSplit) genericSplit;
Configuration job = context.getConfiguration();
final Path file = split.getPath(); FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(file); is = fileIn;
String line = new ExcelParser().parseExcelData(is);//调用解析excel方法
this.strArrayofLines = line.split("\n");
} @Override
public boolean nextKeyValue() throws IOException, InterruptedException {
// TODO Auto-generated method stub
if (key == null) {
key = new LongWritable();
value = new Text(strArrayofLines[]);
} else {
if (key.get() < this.strArrayofLines.length - ) {
long pos = (int)key.get();
key.set(pos + );
value.set(this.strArrayofLines[(int)(pos + )]);
pos++;
} else {
return false;
}
}
if (key == null || value == null) {
return false;
} else {
return true;
}
} @Override
public LongWritable getCurrentKey() throws IOException, InterruptedException {
// TODO Auto-generated method stub
return key;
} @Override
public Text getCurrentValue() throws IOException, InterruptedException {
// TODO Auto-generated method stub
return value;
} @Override
public float getProgress() throws IOException, InterruptedException {
// TODO Auto-generated method stub
return ;
} @Override
public void close() throws IOException {
// TODO Auto-generated method stub
if (is != null) {
is.close();
}
} } }
MailMultipleOutputFormat.java
package com.hadoop.phoneStatistics; import java.io.DataOutputStream;
import java.io.IOException;
import java.util.HashMap;
import java.util.Iterator; import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapreduce.OutputCommitter;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.ReflectionUtils; import com.jcraft.jsch.Compression; /**
* @author Zimo
* @param <MultiRecordWriter>
* 自定义输出格式
*/
public abstract class MailMultipleOutputFormat<K extends WritableComparable<?>, V extends Writable>
extends FileOutputFormat<K, V>{ private MultiRecordWriter writer = null; public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job) throws IOException { if (writer == null) {
writer = new MultiRecordWriter(job, getTaskOutputPath(job));
}
return writer;
} private Path getTaskOutputPath(TaskAttemptContext conf) throws IOException {
Path workPath = null;
OutputCommitter committer = super.getOutputCommitter(conf);
if (committer instanceof FileOutputCommitter) {
workPath = ((FileOutputCommitter) committer).getWorkPath();
} else {
Path outputPath = super.getOutputPath(conf);
if (outputPath == null) {
throw new IOException("Undefined job output-path");
}
workPath = outputPath;
}
return workPath;
} //通过key, value, conf来确定输出文件名(含扩展名)
protected abstract String generateFileNameForKeyValue(K key, V value, Configuration conf); public class MultiRecordWriter extends RecordWriter<K, V> {
//RecordWriter的缓存
private HashMap<String, RecordWriter<K, V>> recordWriters = null;
private TaskAttemptContext job = null;
//输出目录
private Path workPath = null; public MultiRecordWriter(TaskAttemptContext job, Path workpath) {
// TODO Auto-generated constructor stub
super();
this.job = job;
this.workPath = workpath;
recordWriters = new HashMap<String, RecordWriter<K, V>>();
} @Override
public void write(K key, V value) throws IOException, InterruptedException {
// TODO Auto-generated method stub
//得到输出文件名
String baseName = generateFileNameForKeyValue(key, value, job.getConfiguration());
RecordWriter<K, V> rw = this.recordWriters.get(baseName);
if (rw == null) {
rw = getBaseRecordWriter(job, baseName);
this.recordWriters.put(baseName, rw);
}
rw.write(key, value);
} @Override
public void close(TaskAttemptContext context) throws IOException, InterruptedException {
// TODO Auto-generated method stub
Iterator<RecordWriter<K, V>> values = this.recordWriters.values().iterator();
while (values.hasNext()) {
values.next().close(context);
}
this.recordWriters.clear();
} private RecordWriter<K, V> getBaseRecordWriter(TaskAttemptContext job, String baseName)
throws IOException { Configuration conf = job.getConfiguration();
boolean isCompressed = getCompressOutput(job);
String keyValueSeparator = "\t";//key value 分隔符
RecordWriter<K, V> recordWriter = null;
if (isCompressed) {
Class<? extends CompressionCodec> codecClass = getOutputCompressorClass(job, GzipCodec.class);
CompressionCodec codec = ReflectionUtils.newInstance(codecClass, conf);
Path file = new Path(workPath, baseName + codec.getDefaultExtension());
FSDataOutputStream fileOut = file.getFileSystem(conf).create(file, false);
recordWriter = new MailRecordWriter<K, V>(
new DataOutputStream(codec.createOutputStream(fileOut)), keyValueSeparator);
} else {
Path file = new Path(workPath, baseName);
FSDataOutputStream fileOut = file.getFileSystem(conf).create(file, false);
recordWriter = new MailRecordWriter<K, V>(fileOut, keyValueSeparator);
}
return recordWriter;
} } }
MailRecordWriter.java
package com.hadoop.phoneStatistics; import java.io.DataOutputStream;
import java.io.IOException;
import java.io.UnsupportedEncodingException; import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext; /**
* @author Zimo
*
*/
public class MailRecordWriter< K, V > extends RecordWriter< K, V >
{
private static final String utf8 = "UTF-8";
private static final byte[] newline;
static
{
try
{
newline = "\n".getBytes(utf8);
} catch (UnsupportedEncodingException uee)
{
throw new IllegalArgumentException("can't find " + utf8 + " encoding");
}
}
protected DataOutputStream out;
private final byte[] keyValueSeparator;
public MailRecordWriter(DataOutputStream out, String keyValueSeparator)
{
this.out = out;
try
{
this.keyValueSeparator = keyValueSeparator.getBytes(utf8);
} catch (UnsupportedEncodingException uee)
{
throw new IllegalArgumentException("can't find " + utf8 + " encoding");
}
}
public MailRecordWriter(DataOutputStream out)
{
this(out, "/t");
}
private void writeObject(Object o) throws IOException
{
if (o instanceof Text)
{
Text to = (Text) o;
out.write(to.getBytes(), , to.getLength());
} else
{
out.write(o.toString().getBytes(utf8));
}
}
public synchronized void write(K key, V value) throws IOException
{
boolean nullKey = key == null || key instanceof NullWritable;
boolean nullValue = value == null || value instanceof NullWritable;
if (nullKey && nullValue)
{
return;
}
if (!nullKey)
{
writeObject(key);
}
if (!(nullKey || nullValue))
{
out.write(keyValueSeparator);
}
if (!nullValue)
{
writeObject(value);
}
out.write(newline);
}
public synchronized void close(TaskAttemptContext context) throws IOException
{
out.close();
}
}
项目结果

处理结果如上图所示,输出数据格式为:姓名+月份+电话号码+通话次数。我们成功将所有通话记录按月输出为一个文件夹,并统计出了和每一个人的通话次数。
以上就是博主为大家介绍的这一板块的主要内容,这都是博主自己的学习过程,希望能给大家带来一定的指导作用,有用的还望大家点个支持,如果对你没用也望包涵,有错误烦请指出。如有期待可关注博主以第一时间获取更新哦,谢谢!
版权声明:本文为博主原创文章,未经博主允许不得转载。
Hadoop实战:用Hadoop处理Excel通话记录的更多相关文章
- hadoop基础----hadoop实战(七)-----hadoop管理工具---使用Cloudera Manager安装Hadoop---Cloudera Manager和CDH5.8离线安装
hadoop基础----hadoop实战(六)-----hadoop管理工具---Cloudera Manager---CDH介绍 简介 我们在上篇文章中已经了解了CDH,为了后续的学习,我们本章就来 ...
- hadoop处理Excel通话记录
前面我们所写mr程序的输入都是文本文件,但真正工作中我们难免会碰到需要处理其它格式的情况,下面以处理excel数据为例 1.项目需求 有刘超与家庭成员之间的通话记录一份,存储在Excel文件中,如下面 ...
- Hadoop实战:Hadoop分布式集群部署(一)
一.系统参数优化配置 1.1 系统内核参数优化配置 修改文件/etc/sysctl.conf,使用sysctl -p命令即时生效. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ...
- hadoop基础----hadoop实战(九)-----hadoop管理工具---CDH的错误排查(持续更新)
在CDH安装完成后或者CDH使用过程中经常会有错误或者警报,需要我们去解决,积累如下: 解决红色警报 时钟偏差 这是因为我们的NTP服务不起作用导致的,几台机子之间有几秒钟的时间偏差. 这种情况下一是 ...
- Hadoop实战之一~Hadoop概述
对技术,我还是抱有敬畏之心的. Hadoop概述 Hadoop是一个开源分布式云计算平台,基于Map/Reduce模型的,处理海量数据的离线分析工具.基于Java开发,建立在HDFS上,最早由Goog ...
- Hadoop实战之四~hadoop作业调度详解(2)
这篇文章将接着上一篇wordcount的例子,抽象出最简单的过程,一探MapReduce的运算过程中,其系统调度到底是如何运作的. 情况一:数据和运算分开的情况 wordcount这个例子的是hado ...
- Hadoop实战实例
Hadoop实战实例 Hadoop实战实例 Hadoop 是Google MapReduce的一个Java实现.MapReduce是一种简化的分布式编程模式,让程序自动分布 ...
- Hadoop实战课程
Hadoop生态系统配置Hadoop运行环境Hadoop系统架构HDFS分布式文件系统MapReduce分布式计算(MapReduce项目实战)使用脚本语言Pig(Pig项目实战)数据仓库工具Hive ...
- Hadoop on Mac with IntelliJ IDEA - 10 陆喜恒. Hadoop实战(第2版)6.4.1(Shuffle和排序)Map端 内容整理
下午对着源码看陆喜恒. Hadoop实战(第2版)6.4.1 (Shuffle和排序)Map端,发现与Hadoop 1.2.1的源码有些出入.下面作个简单的记录,方便起见,引用自书本的语句都用斜体表 ...
随机推荐
- [bzoj3670] [NOI2014] [lg2375] 动物园
nxt数组为KMP的next数组num[i]储存了i前面可以匹配的串的个数.先在KMP求nxt中顺便求出num最后再找到对于i的最大的前后缀不重叠的可匹配的j,ans*=(num[j]+1)%1000 ...
- eclipse tomcat 无法加载导入的web项目,There are no resources that can be added or removed from the server. .
应该是项目自己的setting文件夹下的描述信息和.project文件的描述信息,不能适用于这个eclipse和tomcat. 解决方法: 1,找相同类型的工程(tomcat能引用的)2,把新建项目里 ...
- ORM查询相关
一.多对多的正反向查询 class Class(models.Model): name = models.CharField(max_length=32,verbose_name="班级名& ...
- .NET后台控制网页标签的ICO图标
aspx文件的head属性中增加runat="server" 后台cs文件中: /// <summary> /// 客户端注册ICO图标 /// </summar ...
- NEKOGAMES
http://bbs.3dmgame.com/thread-4133434-1-1.html
- Learning Python 010 函数 1
Python 函数 1 调用函数 举个例子 多于Python内部的函数,你可以在Python的交互式终端中使用help()函数来查看函数的使用方法.比如:abs()函数,如果你不知道如何使用它,你可以 ...
- 深入了解 Cloud Studio 开发在云端
Cloud Studio 为开发者提供了一个永不间断的云端工作站,不管有没有开发经验都可以毫无门槛的体验云端开发的乐趣,支持绝大部分编程语言.Cloud Studio 提供了完整的 Linux 环境, ...
- CSS学习系列4 -- 再说CSS中的浮动运用及clear:left/right实际用法
在 CSS学习系列2 -- CSS中的清除浮动 中,我们详细说了CSS中清除浮动的方法及使用 后来我自己在项目开发一个需要使用浮动的网页时,进行了实际运用,加上后来看到一篇好文章.所以就在这里再次写篇 ...
- 13.Weblogic任意文件上传漏洞(CVE-2018-2894)复现
Weblogic任意文件上传漏洞(CVE-2018-2894)复现 漏洞背景 WebLogic管理端未授权的两个页面存在任意上传getshell漏洞,可直接获取权限.两个页面分别为/ws_utc/be ...
- [亂數] <細說> C/C++ 亂數基本使用與常見問題
陸陸續續寫了 EA 一.二年,以前亂數引導文回頭看時才發現,怎麼有這麼多細節的錯誤.沒系統. 這篇文章主要引導初學者使用亂數,同時附上常被翻出來討論的議題,C/C++適用,唯以 C 語言撰之. 也由 ...