日志分析_统计每日各时段的的PV,UV

第一步: 需求分析

需要哪些字段(时间:每一天,各个时段,id,url,guid,tracTime)

需要分区为天/时

PV(统计记录数)

UV(guid去重)

第二步: 实施步骤

建Hive表,表列分隔符和文件保持一至

Load数据到Hive表中

写HiveSql进行统计,将结果放入Hive另一张表中(数据清洗)

从Hive的另一张表中的数据导出到Mysql,使用sqoop

网站项目从Mysql读取这张表的信息

预期结果

日期		小时		PV		UV

第三步: 实施

# 建源表(注意进入beeline用户名密码是linux的)

　　create database if not exists track_log;

　　use track_log;

　　create table if not exists yhd_source(

　　id              string,

　　url             string,

　　referer         string,

　　keyword         string,

　　type            string,

　　guid            string,

　　pageId          string,

　　moduleId        string,

　　linkId          string,

　　attachedInfo    string,

　　sessionId       string,

　　trackerU        string,

　　trackerType     string,

　　ip              string,

　　trackerSrc      string,

　　cookie          string,

　　orderCode       string,

　　trackTime       string,

　　endUserId       string,

　　firstLink       string,

　　sessionViewNo   string,

　　productId       string,

　　curMerchantId   string,

　　provinceId      string,

　　cityId          string,

　　fee             string,

　　edmActivity     string,

　　edmEmail        string,

　　edmJobId        string,

　　ieVersion       string,

　　platform        string,

　　internalKeyword string,

　　resultSum       string,

　　currentPage     string,

　　linkPosition    string,

　　buttonPosition  string

　　)row format delimited fields terminated by '\t'

　　stored as textfile

　　load data local inpath '/home/liuwl/opt/datas/2015082818' into table yhd_source;

　　load data local inpath '/home/liuwl/opt/datas/2015082819' into table yhd_source;

# 创建清洗表

　　create table if not exists yhd_clean(

　　id string,

　　url string,

　　guid string,

　　date string,

　　hour string)

　　row format delimited fields terminated by '\t'

　　insert into table yhd_clean select id,url,guid,substring(trackTime,9,2) date,substring(trackTime,12,2) hour from yhd_source;

　　select id,date,hour from yhd_clean limit 5;

# 改建分区表(静态分区)

　　create table if not exists yhd_part1(

　　id string,

　　url string,

　　guid string

　　) partitioned by (date string,hour string)

　　row format delimited fields terminated by '\t'

　　insert into table yhd_part1 partition (date='28',hour='18') select id,url,guid from yhd_clean where date='28' and hour='18';

　　insert into table yhd_part1 partition (date='28',hour='19') select id,url,guid from yhd_clean where date='28' and hour='19';

　　select id,date ,hour from yhd_part1 where date ='28' and hour='18' limit 10;

# 使用动态分区需要修改部分参数

  　hive.exec.dynamic.partition--true

  　hive.exec.dynamic.partition.mode--nonstrict

　　create table if not exists yhd_part2(

　　id string,

　　url string,

　　guid string

　　) partitioned by (date string,hour string)

　　row format delimited fields terminated by '\t'

# 动态分区根据partition字段进行匹配

　　insert into table yhd_part2 partition (date,hour) select * from yhd_clean;

　　select id,date ,hour from yhd_part2 where date ='28' and hour='18' limit 10;

# 实现需求

　　PV: select date,hour,count(url) PV from yhd_part1 group by date,hour;

　　0: jdbc:hive2://hadoop09-linux-01.ibeifeng.co> select date,hour,count(url) PV from yhd_part1 group by date,hour;

　　+-------+-------+--------+--+

　　| date  | hour  |   pv   |

　　+-------+-------+--------+--+

　　| 28    | 18    | 64972  |

　　| 28    | 19    | 61162  |

　　+-------+-------+--------+--+

　　UV: select date,hour,count(distinct(guid)) UV from yhd_part1 group by date,hour;

　　0: jdbc:hive2://hadoop09-linux-01.ibeifeng.co> select date,hour,count(distinct(guid)) UV from yhd_part1 group by date,hour;

　　+-------+-------+--------+--+

　　| date  | hour  |   uv   |

　　+-------+-------+--------+--+

　　| 28    | 18    | 23938  |

　　| 28    | 19    | 22330  |

　　+-------+-------+--------+--+

# 结合放入log_result表

　　create table if not exists log_result as select date,hour,count(url) PV,count(distinct(guid)) UV from yhd_part1 group by date,hour;

　　select date,hour,pv,uv from log_result;

　　0: jdbc:hive2://hadoop09-linux-01.ibeifeng.co> select date,hour,pv,uv from log_result;

　　+-------+-------+--------+--------+--+

　　| date  | hour  |   pv   |   uv   |

　　+-------+-------+--------+--------+--+

　　| 28    | 18    | 64972  | 23938  |

　　| 28    | 19    | 61162  | 22330  |

　　+-------+-------+--------+--------+--+

# 将结果表导出到Mysql,使用Sqoop

# 在Mysql中创建数据库和表

　　create database if not exists track_result;

　　use track_result;

　　create table if not exists log_track_result(

　　date varchar(10) not null,

　　hour varchar(10) not null,

　　pv varchar(10) not null,

　　uv varchar(10) not null,

　　primary key(date,hour)

　　);

# 使用sqoop export 导出到log_track_result表

　　bin/sqoop export \

　　--connect jdbc:mysql://hadoop09-linux-01.ibeifeng.com:3306/track_result \

　　--username root \

　　--password root \

　　--table log_track_result \

　　--export-dir /user/hive/warehouse/track_log.db/log_result \

　　--num-mappers 1 \

　　--input-fields-terminated-by '\001'

# 在Mysql中查询测试

　　select * from log_track_result;

　　mysql> select * from log_track_result;

　　+------+------+-------+-------+

　　| date | hour | pv    | uv    |

　　+------+------+-------+-------+

　　| 28   | 18   | 64972 | 23938 |

　　| 28   | 19   | 61162 | 22330 |

　　+------+------+-------+-------+

　　2 rows in set (0.00 sec)

日志分析_统计每日各时段的的PV,UV的更多相关文章

日志分析_使用shell完整日志分析案例
一.需求分析 1. 日志文件每天生成一份(需要将日志文件定时上传至hdfs) 2. 分析日志文件中包含的字段:访问IP,访问时间,访问URL,访问状态,访问流量 3. 现在有"昨日" ...
统计_statistics_不同的人_大样本_分析_统计方法_useful ？
统计_statistics_不同的人_大样本_分析_
日志分析-mime统计
提取日志中未落入标准字段的mime,分adx,adtype 统计mime的数量和包含js的数量占比 require 'date' require 'net/http' require 'uri' re ...
使用Spark进行搜狗日志分析实例——统计每个小时的搜索量
package sogolog import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} /* ...
nginx日志分析及其统计PV、UV、IP
一.nginx日志结构 nginx中access.log 的日志结构: $remote_addr 客户端地址 211.28.65.253 $remote_user 客户端用户名称 -- $time_l ...
yhd日志分析(二)
yhd日志分析(二) 继续yhd日志分析,统计数据日期 uv pv 登录人数游客人数平均访问时长二跳率独立ip数 1 分析登录人数 count(distinct endUserId) 游客 ...
Spark SQL慕课网日志分析（1）--系列软件(单机)安装配置使用
来源: 慕课网 Spark SQL慕课网日志分析_大数据实战目标: spark系列软件的伪分布式的安装.配置.编译 spark的使用系统: mac 10.13.3 /ubuntu 16.06,两个 ...
mtools 是由MongoDB 官方工程师实现的一套工具集，可以很快速的日志查询分析、统计功能，此外还支持本地集群部署管理.
mtools 是由MongoDB 官方工程师实现的一套工具集,可以很快速的日志查询分析.统计功能,此外还支持本地集群部署管理 https://www.cnblogs.com/littleatp/p/9 ...
shell常用命令及正则辅助日志分析统计
https://www.cnblogs.com/wj033/p/3451618.html 正则日志分析统计 3 grep 'onerror' v3-0621.log | egrep -v '(\d ...

随机推荐

人性的弱点&&影响力
How wo win friends and influence people 人性的弱点 by 卡耐基人际关系基本技巧不要批评.谴责.抱怨真诚的欣赏他人激发他人的渴望获得别人好感的方式微 ...
MarkupExtension
目的如果要在XAML里引用静态或动态对象实例,或在XAML中创建带有参数的类.这时,我们需要用到XAML扩展.XAML扩展常用来设定属性值.使用标识扩展,告诉 XAML 处理不要像通常那样将属性值 ...
loj 1379(最短路变形)
题目链接:http://acm.hust.edu.cn/vjudge/problem/viewProblem.action?id=27087 思路:题目的意思是求S->T的所有路径中花费总和小于 ...
AJAX案例一：发送POST请求
<%@ page language="java" import="java.util.*" pageEncoding="UTF-8"% ...
psql-03数据类型(1)
PostgreSQL支持的数据类型远比其他数据库要多; 类型输入与转换 select int '1', date '2015-12-6'; select '1'::int, '2015-12-6':: ...
【虚拟机】苹果虚拟机mac10.11.6+Xcode8.1
[虚拟机]苹果虚拟机mac10.11.6+Xcode8.1本虚拟机加装Xcode8.1,方便大家更好学习Swift3.0语言以及iOS开发.安装注意事项:第一步:确认硬件:1.确认主板以及cpu支持虚 ...
hdu 4000Fruit Ninja 树状数组
Time Limit: 2000/1000 MS (Java/Others) Memory Limit: 32768/32768 K (Java/Others)Total Submission( ...
eclipse 异常Unhandled event loop exception解决办法
http://blog.csdn.net/leiswpu/article/details/26712709
Problem to create "New Database Diagram" in Microsoft SQL Server Management Studio for SQL Server 2012
Error: when click "New Database Diagram", a error popped up and said "Attempted to re ...
Redis内存缓存系统入门
网站:http://redis.io/ key-value cache and store data structure server 1. 服务器端 1.1 安装下载安装包:http://r ...

日志分析_统计每日各时段的的PV,UV

日志分析_统计每日各时段的的PV,UV的更多相关文章

随机推荐

热门专题