视频网站数据MapReduce清洗及Hive数据分析

一.需求描述

利用MapReduce清洗视频网站的原数据，用Hive统计出各种TopN常规指标：

视频观看数 Top10

视频类别热度 Top10

视频观看数 Top20 所属类别包含这 Top20 视频的个数

视频观看数 Top50 所关联视频的所属类别的热度排名

每个类别中的视频热度 Top10，以Music为例

每个类别中视频流量 Top10，以Music为例

上传视频最多的用户 Top10 以及他们上传的视频

每个类别视频观看数 Top10

2.数据源结构说明

数据源1： user.txt

数据样例:

barelypolitical 151 5106

bonk65 89 144

camelcars 26 674

数据样例中的三个字段结构：

上传者用户名	string
上传视频数	int
朋友数量	int

数据源2： video.txt

数据样例：

fQShwYqGqsw	lonelygirl15	736	People & Blogs	133	151763	3.01	666	765	fQShwYqGqsw	LfAaY1p_2Is	5LELNIVyMqo	vW6ZpqXjCE4	vPUAf43vc-Q	ZllfQZCc2_M	it2d7LaU_TA	KGRx8TgZEeU	aQWdqI1vd6o	kzwa8NBlUeo	X3ctuFCCF5k	Ble9N2kDiGc	R24FONE2CDs	IAY5q60CmYY	mUd0hcEnHiU	6OUcp6UJ2bA	dv0Y_uoHrLc	8YoxhsUMlgA	h59nXANN-oo	113yn3sv0eo

数据样例中的字段结构：

视频唯一 id	11 位字符串
视频上传者	上传视频的用户名 String
视频年龄	视频上传日期和 2007 年 2 月15 日之间的整数天
视频类别	上传视频指定的视频分类
视频长度	整形数字标识的视频长度
观看次数	视频被浏览的次数
视频评分	满分 5 分
流量	视频的流量,整型数字
评论数	一个视频的整数评论数
相关视频 id	相关视频的 id,最多 20 个

上面只是拿出了一两条数据来介绍数据集的结构，在后续项目中要用到的数据集可以自行下载

二.数据清洗

1)数据分析

在video.txt中，视频可以有多个所属分类,每个所属分类用&符号分割,并且分割的两边有空格字符,多个相关视频又用“\t”进行分割。为了分析数据时方便对存在多个子元素的数据进行操作,我们首先进行数据重组清洗操作。

具体做法：将所有的类别用“&”分割,同时去掉两边空格,多个相关视频 id 也使用“&”进行分割，这里看起来将"&"换成"\t"更方便，但是如果这样做就会将视频所属分类分割成不同字段，这样就没有办法进行清洗了

2)注意事项

这里的数据清洗不涉及reduce操作，所以只用map即可，视频的相关视频id可以没有，但是比如评论数必须有值，没有评论即为0，所以如果一条数据的字段缺少，也是脏数据，是要被清洗的

 package mapreduce.videoETL;

 import java.io.IOException;

 import org.apache.hadoop.conf.Configuration;

 import org.apache.hadoop.conf.Configured;

 import org.apache.hadoop.fs.Path;

 import org.apache.hadoop.io.LongWritable;

 import org.apache.hadoop.io.NullWritable;

 import org.apache.hadoop.io.Text;

 import org.apache.hadoop.mapreduce.Job;

 import org.apache.hadoop.mapreduce.Mapper;

 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 import org.apache.hadoop.util.Tool;

 import org.apache.hadoop.util.ToolRunner;

 public class VideoMapReduce extends Configured implements Tool{

     public static class VideoMap extends Mapper<LongWritable, Text, NullWritable, Text> {

         private Text mapOutputValue = new Text();

         @Override

         protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{

             String line = value.toString();

             String[] splits = line.split("\t");

             //1、过滤不合法数据

             if(splits.length < 9) return;

             //2、去掉&符号左右两边的空格

             splits[3] = splits[3].replaceAll(" ", "");

             StringBuilder sb = new StringBuilder();

             //3、\t 换成&符号

             for(int i = 0; i < splits.length; i++){

             sb.append(splits[i]);

             if(i < 9){

             if(i != splits.length - 1){

             sb.append("\t");

             }

             }else{

             if(i != splits.length - 1){

             sb.append("&");

             }

             }

             }

             String newline = sb.toString();

             mapOutputValue.set(newline);

             context.write(NullWritable.get(), mapOutputValue);

         }

     }

     public int run(String[] args) throws Exception {

         Configuration conf = this.getConf();

         Job job=Job.getInstance(conf);

         job.setJarByClass(VideoMapReduce.class);

         //指定输入数据的目录

         Path inpath = new Path(args[0]);

         FileInputFormat.addInputPath(job, inpath);

         //指定数据计算完成后输出的目录

         Path outpath = new Path(args[1]);

         FileOutputFormat.setOutputPath(job, outpath);

         //指定调用哪一个map和reduce方法

         job.setMapperClass(VideoMap.class);

         //指定map输出键值对类型

         job.setMapOutputKeyClass(NullWritable.class);

         job.setMapOutputValueClass(Text.class);

         //指定reduce个数

         job.setNumReduceTasks(0);

         //提交job任务

         boolean isSuccess =job.waitForCompletion(true);  

         return isSuccess ? 0:1;

     }

     public static void main(String[] args) throws Exception {

         Configuration configuration = new Configuration();

         //指定HDFS地址

         args=new String[]{

                 "hdfs://beifeng01/user/beifeng01/mapreduce/input/testdata/videoData/video.txt",

                 "hdfs://beifeng01/user/beifeng01/mapreduce/output"

         };

         //Run job

         int status = ToolRunner.run(configuration, new VideoMapReduce(), args);

         //关闭

         System.exit(status);

     }

 }

mapreduce任务跑完之后可以在输出目录中查看数据是否清洗成功

创建表

这里总共需要创建4张表，明明只有两个数据文件，为什么要创建4张表呢？因为这里创建的表要使用orc的压缩方式，而不使用默认的textfile的方式，orc的压缩方式要想向表中导入数据需要使用子查询的方式导入，即把从另一张表中查询到的数据插入orc压缩格式的表汇中，所以这里需要四张表，两张textfile类型的表user和video，两张orc类型的表user_orc和video_orc

1.先创建textfile类型的表

 create table video(

 videoId string,

 uploader string,

 age int,

 category array<string>,

 length int,

 views int,

 rate float,

 ratings int,

 comments int,

 relatedId array<string>)

 row format delimited

 fields terminated by "\t"

 collection items terminated by "&"

 stored as textfile;

 create table user(

 uploader string,

 videos int,

 friends int)

 row format delimited

 fields terminated by "\t"

 stored as textfile;

向两张表中导入数据，从hdfs中导入

load data inpath 'user表所在hdfs中的位置' into table user;

load data inpath '清洗后的vidoe表所在hdfs中的位置' into table video;

2.创建两张orc类型的表

 create table video_orc(

 videoId string,

 uploader string,

 age int,

 category array<string>,

 length int,

 views int,

 rate float,

 ratings int,

 comments int,

 relatedId array<string>)

 clustered by (uploader) into 8 buckets

 row format delimited fields terminated by "\t"

 collection items terminated by "&"

 stored as orc;

 create table user_orc(

 uploader string,

 videos int,

 friends int)

 clustered by (uploader) into 24 buckets

 row format delimited

 fields terminated by "\t"

 stored as orc;

向两张表中导入数据

 insert into table user_orc select * from user;

 insert into table video_orc select * from video;

这时候数据就加载到两张表中了，可以进行简单的查看

 select * from user_orc limit 10;

 select * from video_orc limit 10;

三最终业务实现

1.视频观看数 Top10

使用order by做一个全局排序即可

 select videoId,uploader,views from video_orc order by views desc limit 20;

2. 视频类别热度 Top10

需求分析：统计出每个类别有多少个视频，然后显示出视频最多的前10个，我们需要使用group by对视频类别进行聚合，然后使用count()进行统计出每个类别视频个数，最后将视频个数进行排序输出前10个，因为一个视频可能对应多个类别，要想使用group by，需要先将类别进行列转行(展开)

 select

 category_name as category,

 count(t.videoId) as hot

 from (

 select

 videoId,

 category_name

 from

 video_orc lateral view explode(category) t_catetory as category_name) t

 group by

 t.category_name

 order by hot

 desc limit 10;

explode是将category列展开，例如category列如果是一个集合，集合中是key-value对，展开后就是key一行，value一行，如果表中有多个字段，就要加上lateral view，t_catetory是虚拟表的名，必须有

3.视频观看数 Top20 所属类别包含这 Top20 视频的个数

需求分析：先找到观看数最高的 20 个视频所属条目的所有信息,降序排列，把这 20 条信息中的 category 分裂出来(列转行)，最后查询视频分类名称和该分类下有多少个 Top20 的视频

 select

 category_name as category,

 count(t2.videoId) as hot_with_views

 from (

 select

 videoId,

 category_name

 from (

 select

 *

 from

 video_orc

 order by

 views

 desc limit

 20) t1 lateral view explode(category) t_catetory as category_name) t2

 group by

 category_name

 order by

 hot_with_views

 desc;

4.视频观看数 Top50 所关联视频的所属类别的热度排名

需求分析：查询出观看数最多的前 50 个视频的所有信息(包含了每个视频对应的关联视频),记为临时表 t1，将找到的 50 条视频信息的相关视频的id列转行,记为临时表 t2，将相关视频的 id 和user_orc 表进行 inner join 操作，按照视频类别进行分组,统计每组视频个数,然后排序

 select

 category_name as category,

 count(t5.videoId) as hot

 from (

 select

 videoId,

 category_name

 from (

 select

 distinct(t2.videoId),

 t3.category

 from (

 select

 explode(relatedId) as videoId

 from (

 select

 *

 from

 video_orc

 order by

 views

 desc limit

 50) t1) t2

 inner join

 video_orc t3 on t2.videoId = t3.videoId) t4 lateral view explode(category)

 t_catetory as category_name) t5

 group by

 category_name

 order by

 hot

 desc;

5.每个类别中的视频热度 Top10，以Music为例

需求分析：先将user_orc表中的category(视频类别) 展开，可以创建一张表用于存放视频类别，然后向表中插入数据，最后统计对应类别(Music)中的视频热度

创建表

 create table test(

 videoId string,

 uploader string,

 age int,

 categoryId string,

 length int,

 views int,

 rate float,

 ratings int,

 comments int,

 relatedId array<string>)

 row format delimited

 fields terminated by "\t"

 collection items terminated by "&"

 stored as orc;

插入数据

 insert into table test

 select

 videoId,

 uploader,

 age,

 categoryId,

 length,

 views,

 rate,

 ratings,

 comments,

 relatedId

 from

 video_orc lateral view explode(category) catetory as categoryId;

统计Music类别中的视频热度Top10

 select

 videoId,

 views

 from

 test

 where

 categoryId = "Music"

 order by

 views

 desc limit

 10;

6. 每个类别中视频流量 Top10，以Music为例

需求分析：直接在5中创建的表中按照ratings(流量)排序

 select

 videoId,

 views,

 ratings

 from

 test

 where

 categoryId = "Music"

 order by

 ratings

 desc limit

 10;

7.上传视频最多的用户 Top10 以及他们上传的视频

需求分析：先找到上传视频最多的 10 个用户的用户信息，通过 uploader 字段与 youtube_orc 表进行 join,得到的信息按照 views 观看次数进行排序即可

 select

 t2.videoId,

 t2.views,

 t2.ratings,

 t1.videos,

 t1.friends

 from (

 select

 *

 from

 user_orc

 order by

 videos desc

 limit

 10) t1

 join

 video_orc t2

 on

 t1.uploader = t2.uploader

 order by

 views desc

 limit

 20;

8.每个类别视频观看数 Top10

需求分析：先得到 categoryId 展开的表数据，子查询按照 categoryId 进行分区,然后分区内排序,并生成递增数字,该递增数字这一列起名为 rank 列，通过子查询产生的临时表,查询 rank 值小于等于 10 的数据行即可

 select

 t1.*

 from (

 select

 videoId,

 categoryId,

 views,

 row_number() over(partition by categoryId order by views desc) rank from

 test) t1

 where

 rank <= 10;

9.可能出现的问题

JVM堆内存溢出

解决办法：在 yarn-site.xml 中加入如下代码

<property>

    <name>yarn.scheduler.maximum-allocation-mb</name>

    <value>2048</value>

</property>

<property>

    <name>yarn.scheduler.minimum-allocation-mb</name>

    <value>2048</value>

</property>

<property>

    <name>yarn.nodemanager.vmem-pmem-ratio</name>

    <value>2.1</value>

</property>

<property>

    <name>mapred.child.java.opts</name>

    <value>-Xmx1024m</value>

</property>

视频网站数据MapReduce清洗及Hive数据分析的更多相关文章

mapreduce清洗数据
继上篇 MapReduce清洗数据 package mapreduce; import java.io.IOException; import org.apache.hadoop.conf.Confi ...
大数据系列之数据仓库Hive命令使用及JDBC连接
Hive系列博文,持续更新~~~ 大数据系列之数据仓库Hive原理大数据系列之数据仓库Hive安装大数据系列之数据仓库Hive中分区Partition如何使用大数据系列之数据仓库Hive命令使用 ...
将爬取的实习僧网站数据传入HDFS
一.引言: 作为一名大三的学生,找实习对于我们而言是迫在眉睫的.实习作为迈入工作的第一步,它的重要性不言而喻,一份好的实习很大程度上决定了我们以后的职业规划. 那么,一份好的实习应该考量哪些因素呢? ...
图解大数据 | 海量数据库查询-Hive与HBase详解
作者:韩信子@ShowMeAI 教程地址:http://www.showmeai.tech/tutorials/84 本文地址:http://www.showmeai.tech/article-det ...
大数据-06-Spark之读写Hive数据
简介 Hive中的表是纯逻辑表,就只是表的定义等,即表的元数据.Hive本身不存储数据,它完全依赖HDFS和MapReduce.这样就可以将结构化的数据文件映射为为一张数据库表,并提供完整的SQL查询 ...
【大数据系列】apache hive 官方文档翻译
GettingStarted 开始 Created by Confluence Administrator, last modified by Lefty Leverenz on Jun 15, 20 ...
Hive数据分析——Spark是一种基于rdd（弹性数据集）的内存分布式并行处理框架，比于Hadoop将大量的中间结果写入HDFS，Spark避免了中间结果的持久化
转自:http://blog.csdn.net/wh_springer/article/details/51842496 近十年来,随着Hadoop生态系统的不断完善,Hadoop早已成为大数据事实上 ...
Python爬取招聘网站数据，给学习、求职一点参考
1.项目背景随着科技的飞速发展,数据呈现爆发式的增长,任何人都摆脱不了与数据打交道,社会对于“数据”方面的人才需求也在不断增大.因此了解当下企业究竟需要招聘什么样的人才?需要什么样的技能?不管是对于 ...
Java实现视频网站的视频上传、视频转码、视频关键帧抽图, 及视频播放功能
视频网站中提供的在线视频播放功能,播放的都是FLV格式的文件,它是Flash动画文件,可通过Flash制作的播放器来播放该文件.项目中用制作的player.swf播放器. 多媒体视频处理工具FFmpe ...

随机推荐

SQL点点滴滴_DELETE小计
惨痛的教训: 某次在执行delete时,一时疏忽忘记写where条件了, 1.删除tb_mobile_cust_micromsg中的内容,前提是c_customer这个字段的值与#datamod表中c ...
Linux配置临时IP和网关命令
配置IP以及子网掩码: ifconfig eth0 192.168.1.33 netmask 255.255.255.0 up 设置网关: route add default gw 192.168 ...
WIN7与WIN10 安装
---恢复内容开始--- 开始的操作系统是黑白屏的DOS,随着光标的一闪一闪并逐渐后移,一条条指令输入电脑,并执行相关指令完成任务.慢慢的,视窗操作系统最初是基于DOS的windows 9X内核WIN ...
java之Pattern.compile相关正则表达式
java之Pattern.compile相关正则表达式 1.验证邮箱地址是否正确:String check = "^([a-z0-9A-Z]+[-|\\.]?)+[a-z0-9A-Z]@([ ...
shell-day1
shell概述:这里说的是命令行shell,例如"bash/sh/ksh/csh"(Unix/Linux系统).cmd.exe命令提示字符(windwos系统),这里主要介绍Uni ...
windows生成硬链接
因工作电脑需要同时使用pl/sql和toad工具需要同时配置32位和64位oracle client如此增加了维护tnsnames.ora的复杂程度使用windows硬链接可以减少工作量,每次只修改源 ...
CopyOnWriteArrayList对比ArrayList
ArrayList非线程安全,CopyOnWriteArrayList线程安全 ArrayList添加元素的时候内部会预先分配存储空间,CopyOnWriteArrayList每次添加元素都会重新co ...
Sql的一些常规判断
sql server中如何判断表或者数据库的存在,但在实际使用中,需判断Status状态位:其中某些状态位可由用户使用 sp_dboption(read only.dbo use only.singl ...
spring集成ehcache本地缓存
1.maven依赖  <dependency> <groupId>net.sf.ehcache</groupId&g ...
问题解决：java.sql.SQLException:Value '0000-00-00' can not be represented as java.sql.Date
问题描述: 数据表中有记录的time字段(属性为timestamp)其值为:“0000-00-00 00:00:00” 程序使用select 语句从中取数据时出现以下异常: Java.sql.SQLE ...

视频网站数据MapReduce清洗及Hive数据分析

视频网站数据MapReduce清洗及Hive数据分析的更多相关文章

随机推荐

热门专题