hadoop日志分析

一、项目要求

本文讨论的日志处理方法中的日志，仅指Web日志。事实上并没有精确的定义，可能包含但不限于各种前端Webserver——apache、lighttpd、nginx、tomcat等产生的用户訪问日志，以及各种Web应用程序自己输出的日志。

二、需求分析： KPI指标设计

PV(PageView): 页面訪问量统计

IP: 页面独立IP的訪问量统计

Time: 用户每小时PV的统计

Source: 用户来源域名的统计

Browser: 用户的訪问设备统计

以下我着重分析浏览器统计

三、分析过程

1、日志的一条nginx记录内容

222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939

"http://www.angularjs.cn/A00n"

"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"

2、对上面的日志记录进行分析

remote_addr : 记录client的ip地址, 222.68.172.190

remote_user : 记录clientusername称, –

time_local: 记录訪问时间与时区, [18/Sep/2013:06:49:57 +0000]

request: 记录请求的url与http协议, “GET /images/my.jpg HTTP/1.1″

status: 记录请求状态,成功是200, 200

body_bytes_sent: 记录发送给client文件主体内容大小, 19939

http_referer: 用来记录从那个页面链接訪问过来的, “http://www.angularjs.cn/A00n”

http_user_agent: 记录客户浏览器的相关信息, “Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36″

3、java语言分析上面一条日志记录（使用空格切分）

String line ="222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] \"GET /images/my.jpg HTTP/1.1\" 200 19939 \"http://www.angularjs.cn/A00n\" \"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36\"";

String[] elementList = line.split(" ");

for(int
i=0;i<elementList.length;i++){

System.out.println(i+" : "+elementList[i]);

}

測试结果：

: 222.68.172.190

: -

: [18/Sep/2013:06:49:57

: +0000]

: "GET

: /images/my.jpg

: HTTP/1.1"

: 200

: 19939

: "http://www.angularjs.cn/A00n"

: "Mozilla/5.0

: (Windows

: NT

: 6.1)

: AppleWebKit/537.36

: (KHTML,

: like

: Gecko)

: Chrome/29.0.1547.66

: Safari/537.36"

4、实体Kpi类的代码：

public
class Kpi {

private
String remote_addr;// 记录client的ip地址

private
String remote_user;// 记录clientusername称,忽略属性"-"

private
String time_local;// 记录訪问时间与时区

private
String request;// 记录请求的url与http协议

private
String status;// 记录请求状态；成功是200

private
String body_bytes_sent;// 记录发送给client文件主体内容大小

private
String http_referer;// 用来记录从那个页面链接訪问过来的

private
String http_user_agent;// 记录客户浏览器的相关信息

private
String method;//请求方法 get post

private
String http_version;
//http版本号

public
String getMethod() {

return
method;

}

public
void setMethod(String method) {

this.method = method;

}

public
String getHttp_version() {

return
http_version;

}

public
void setHttp_version(String http_version) {

this.http_version = http_version;

}

public
String getRemote_addr() {

return
remote_addr;

}

public
void setRemote_addr(String remote_addr) {

this.remote_addr = remote_addr;

}

public
String getRemote_user() {

return
remote_user;

}

public
void setRemote_user(String remote_user) {

this.remote_user = remote_user;

}

public
String getTime_local() {

return
time_local;

}

public
void setTime_local(String time_local) {

this.time_local = time_local;

}

public
String getRequest() {

return
request;

}

public
void setRequest(String request) {

this.request = request;

}

public
String getStatus() {

return
status;

}

public
void setStatus(String status) {

this.status = status;

}

public
String getBody_bytes_sent() {

return
body_bytes_sent;

}

public
void setBody_bytes_sent(String body_bytes_sent) {

this.body_bytes_sent = body_bytes_sent;

}

public
String getHttp_referer() {

return
http_referer;

}

public
void setHttp_referer(String http_referer) {

this.http_referer = http_referer;

}

public
String getHttp_user_agent() {

return
http_user_agent;

}

public
void setHttp_user_agent(String http_user_agent) {

this.http_user_agent = http_user_agent;

}

@Override

public
String toString() {

return
"Kpi [remote_addr="
+ remote_addr + ", remote_user="

+ remote_user +", time_local="
+ time_local + ", request="

+ request +", status="
+ status + ", body_bytes_sent="

+ body_bytes_sent +", http_referer=" + http_referer

+", http_user_agent="
+ http_user_agent + ", method="
+ method

+", http_version="
+ http_version + "]";

}

5、kpi的工具类

package
org.aaa.kpi;

public
class KpiUtil {

/***

* line记录转化成kpi对象

* @param line 日志的一条记录

* @author tianbx

* */

public
static Kpi transformLineKpi(String line){

String[] elementList = line.split(" ");

Kpi kpi =new Kpi();

kpi.setRemote_addr(elementList[0]);

kpi.setRemote_user(elementList[1]);

kpi.setTime_local(elementList[3].substring(1));

kpi.setMethod(elementList[5].substring(1));

kpi.setRequest(elementList[6]);

kpi.setHttp_version(elementList[7]);

kpi.setStatus(elementList[8]);

kpi.setBody_bytes_sent(elementList[9]);

kpi.setHttp_referer(elementList[10]);

kpi.setHttp_user_agent(elementList[11] +" " + elementList[12]);

return
kpi;

}

6、算法模型: 并行算法

Browser: 用户的訪问设备统计

– Map: {key:$http_user_agent,value:1}

– Reduce: {key:$http_user_agent,value:求和(sum)}

7、map-reduce分析代码

import
java.io.IOException;

import
java.util.Iterator;

import
org.apache.hadoop.fs.Path;

import
org.apache.hadoop.io.IntWritable;

import
org.apache.hadoop.io.Text;

import
org.apache.hadoop.mapred.FileInputFormat;

import
org.apache.hadoop.mapred.FileOutputFormat;

import
org.apache.hadoop.mapred.JobClient;

import
org.apache.hadoop.mapred.JobConf;

import
org.apache.hadoop.mapred.MapReduceBase;

import
org.apache.hadoop.mapred.Mapper;

import
org.apache.hadoop.mapred.OutputCollector;

import
org.apache.hadoop.mapred.Reducer;

import
org.apache.hadoop.mapred.Reporter;

import
org.apache.hadoop.mapred.TextInputFormat;

import
org.apache.hadoop.mapred.TextOutputFormat;

import
org.hmahout.kpi.entity.Kpi;

import
org.hmahout.kpi.util.KpiUtil;

import
cz.mallat.uasparser.UASparser;

import
cz.mallat.uasparser.UserAgentInfo;

public
class KpiBrowserSimpleV {

public
static class KpiBrowserSimpleMapperextends MapReduceBase

implements
Mapper<Object, Text, Text, IntWritable> {

UASparser parser =null;

@Override

public
void map(Object key, Text value,

OutputCollector<Text, IntWritable> out, Reporter reporter)

throws
IOException {

Kpi kpi = KpiUtil.transformLineKpi(value.toString());

if(kpi!=null
&& kpi.getHttP_user_agent_info()!=null){

if(parser==null){

parser =new UASparser();

}

UserAgentInfo info =

parser.parseBrowserOnly(kpi.getHttP_user_agent_info());

if("unknown".equals(info.getUaName())){

out.collect(new
Text(info.getUaName()),
new IntWritable(1));

}else{

out.collect(new
Text(info.getUaFamily()),
new IntWritable(1));

}

public
static class KpiBrowserSimpleReducerextends
MapReduceBase implements

Reducer<Text, IntWritable, Text, IntWritable>{

@Override

public
void reduce(Text key, Iterator<IntWritable> value,

OutputCollector<Text, IntWritable> out, Reporter reporter)

throws
IOException {

IntWritable sum =new IntWritable(0);

while(value.hasNext()){

sum.set(sum.get()+value.next().get());

}

out.collect(key, sum);

}

public
static void main(String[] args)throws IOException {

String input ="hdfs://127.0.0.1:9000/user/tianbx/log_kpi/input";

String output ="hdfs://127.0.0.1:9000/user/tianbx/log_kpi/browerSimpleV";

JobConf conf =new JobConf(KpiBrowserSimpleV.class);

conf.setJobName("KpiBrowserSimpleV");

String url ="classpath:";

conf.addResource(url+"/hadoop/core-site.xml");

conf.addResource(url+"/hadoop/hdfs-site.xml");

conf.addResource(url+"/hadoop/mapred-site.xml");

conf.setMapOutputKeyClass(Text.class);

conf.setMapOutputValueClass(IntWritable.class);

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(KpiBrowserSimpleMapper.class);

conf.setCombinerClass(KpiBrowserSimpleReducer.class);

conf.setReducerClass(KpiBrowserSimpleReducer.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf,new Path(input));

FileOutputFormat.setOutputPath(conf,new Path(output));

JobClient.runJob(conf);

System.exit(0);

}

8、输出文件log_kpi/browerSimpleV内容

AOL Explorer 1

Android Webkit 123

Chrome 4867

CoolNovo 23

Firefox 1700

Google App Engine 5

IE 1521

Jakarta Commons-HttpClient 3

Maxthon 27

Mobile Safari 273

Mozilla 130

Openwave Mobile Browser 2

Opera 2

Pale Moon 1

Python-urllib 4

Safari 246

Sogou Explorer 157

unknown 4685

8 R制作图片

data<-read.table(file="borwer.txt",header=FALSE,sep=",")

names(data)<-c("borwer","num")

qplot(borwer,num,data=data,geom="bar")

解决这个问题

1、排除爬虫和程序点击，对抗作弊

解决的方法：页面做个检測鼠标是否动。

2、浏览量怎么排除图片

3、浏览量排除假点击？

4、哪一个搜索引擎訪问的？

5、点击哪一个keyword訪问的？

6、从哪一个地方訪问的？

7、使用哪一个浏览器訪问的？

hadoop日志分析的更多相关文章

Hadoop日志分析系统启动脚本
Hadoop日志分析系统启动脚本 #!/bin/bash #Flume日志数据的根文件夹 root_path=/flume #Mapreduce处理后的数据文件夹 process_path=/proc ...
hadoop 日志分析
1:在每一个tomcat服务器上,生成的日志目录中,在java中用定时器每天将当天的日志上传到hadoop中 (技术要点:quatz+hadoop-client)具体的目录动态的采用时间品名 2:ha ...
Hadoop日志分析工具——White Elephant
White Elephant 是一个Hadoop日志收集器和展示器,它提供了用户角度的Hadoop集群可视化.White Elephant 是全球最大的职业社交网站Linkedin开发的一套分析Had ...
Nginx+Flume+Hadoop日志分析，Ngram+AutoComplete
配置Nginx yum install nginx (在host99和host101) service nginx start开启服务 ps -ef |grep nginx看一下进程 ps -ef | ...
Hadoop 日志分析。
http://www.ibm.com/developerworks/cn/java/java-lo-mapreduce/
Hadoop日志文件分析系统
Hadoop日志分析系统项目需求: 需要统计一下线上日志中某些信息每天出现的频率,举个简单的例子,统计线上每天的请求总数和异常请求数.线上大概几十台服务器,每台服务器大概每天产生4到5G左右的日志 ...
SparkStreaming实时日志分析--实时热搜词
Overview 整个项目的整体架构如下: 关于SparkStreaming的部分: Flume传数据到SparkStreaming:为了简单使用的是push-based的方式.这种方式可能会丢失数据 ...
Hadoop学习笔记—20.网站日志分析项目案例（二）数据清洗
网站日志分析项目案例(一)项目介绍:http://www.cnblogs.com/edisonchou/p/4449082.html 网站日志分析项目案例(二)数据清洗:当前页面网站日志分析项目案例 ...
一、基于hadoop的nginx访问日志分析---解析日志篇
前一阵子,搭建了ELK日志分析平台,用着挺爽的,再也不用给开发拉各种日志,节省了很多时间. 这篇博文是介绍用python代码实现日志分析的,用MRJob实现hadoop上的mapreduce,可以直接 ...

随机推荐

高仿精仿快播应用android源码下载
今天给大家在网上找到的一款高仿精仿快播应用android源码,分享给大家,希望大家功能喜欢. 说明源码更新中.... 源码即将上传也可以到这个网站下载:download
ASP.NET - 在类中如何使用 Server.MapPath
直接在类中使用 Server.MapPath 会出现错误,这是由于类中不能直接使用 System.Web.UI.Page 的非静态函数造成的.解决方法有两种: 方法一.为类增加继承 class CFo ...
Shell 文件包含
和其他语言一样,Shell 也可以包含外部脚本.这样可以很方便的封装一些公用的代码作为一个独立的文件. Shell 文件包含的语法格式如下: . filename # 注意点号(.)和文件名中间有一空 ...
python 网络爬虫（二） BFS不断抓URL并放到文件中
上一篇的python 网络爬虫(一) 简单demo 还不能叫爬虫,只能说基础吧,因为它没有自动化抓链接的功能. 本篇追加如下功能: [1]广度优先搜索不断抓URL,直到队列为空 [2]把所有的URL写 ...
Java中ArrayList和LinkedList差别
一般大家都知道ArrayList和LinkedList的大致差别: 1.ArrayList是实现了基于动态数组的数据结构,LinkedList基于链表的数据结构. 2.对于随机訪问get和set.A ...
5.中文问题（自身，操作系统级别，应用软件的本身），mysql数据库备份
第一层因素: mysql的自身的设置 mysql有六处使用了字符集.分别为:client .connection.database.results.server .system. mysql&g ...
基础知识（3）- Java的基本程序设计结构
3.1 一个简单的Java应用程序 3.2 注释 3.3 数据类型 3.3.1 整型 3.3.2 浮点类型 3.3.3 char类型 3.3.4 boolean类型 3.4 变量 3.4.1 ...
[读书笔记]黑客与画家[Hackers.and.Painters]
(书生注:这本书写的不错.针对程序员,可以带来不同角度的想法,有助于反思自己的程序员工作.我甚至从中发现了自己爱用铅笔的原因... 尤其是其中关于黑客的定义,包括黑客认为的乐趣和目的,让人更深层次思 ...
线段树菜鸟一题+归并排序【求逆序数】POJ2299
题目链接:http://poj.org/problem?id=2299 归并排序解法链接:http://blog.csdn.net/lyy289065406/article/details/66473 ...
正則表達式验证邮箱，qq，座机，手机，网址
手机: var reg=/^1[34578]\d{9}$/; if(reg.test("你输入的手机号码") ) { alert("手机号码输入正确") } e ...

hadoop日志分析

hadoop日志分析的更多相关文章

随机推荐

热门专题