日志分析_使用shell完整日志分析案例

一、需求分析

 1. 日志文件每天生成一份(需要将日志文件定时上传至hdfs)

 2. 分析日志文件中包含的字段:访问IP,访问时间,访问URL,访问状态,访问流量
 3. 现在有"昨日"的日志文件即logclean.jar

 3. 需求指标

      a. 统计PV值

      b. 统计注册人数

      c. 统计IP数

      d. 统计跳出率

      f. 统计二跳率

二、数据分析

1. 数据采集 使用shell脚本定时上传

2. 数据清洗 过滤字段 格式化时间等字段

3. 数据分析 使用一级分区(date)

4. 数据导出 sqoop

5. 使用到的框架有: shell脚本 hdfs mapreduce hive sqoop mysql

期望结果

  pv    register  ip    jumpprob    two_jumpprob

三、实施

1. 自动上传到hdfs

     $HADOOP_HOME/bin/hdfs dfs -rm -r $HDFS_INPUT_PATH > /dev/null 2>&1

     $HADOOP_HOME/bin/hdfs dfs -mkdir -p $HDFS_INPUT_PATH/$yesterday > /dev/null 2>&1

     $HADOOP_HOME/bin/hdfs dfs -put $LOG_PATH  $HDFS_INPUT_PATH/$yesterday > /dev/null 2>&1

2. 数据清洗(使用mapreduce过滤脏数据与不需要的静态数据及去双引号,转换date)

     $HADOOP_HOME/bin/hdfs dfs -rm -r $HDFS_OUTPUT_PATH > /dev/null 2>&1

     $HADOOP_HOME/bin/yarn jar $JAR_PATH $ENTRANCE $HDFS_INPUT_PATH/$yesterday $HDFS_OUTPUT_PATH/date=$yesterday

3. 在Hive中创建日志数据库和分区表并将清洗后的文件加入分区

     $HIVE_HOME/bin/hive -e "create database if not exists $HIVE_DATABASE" > /dev/null 2>&1

     $HIVE_HOME/bin/hive --database $HIVE_DATABASE -e "create external table if not exists $HIVE_TABLE(

     ip string,day string,url string) partitioned by (date string)

     row format delimited fields terminated by '\t' location '$HDFS_OUTPUT_PATH' "

     $HIVE_HOME/bin/hive --database $HIVE_DATABASE -e "alter table $HIVE_TABLE add partition (date='$yesterday')"

4. 分析数据并使用sqoop导出至mysql

     pv:

        create table if not exists pv_tb(pv string) row format delimited fields terminated by '\t';

        insert overwrite table pv_tb select count(1) from weblog_clean where date='2016_11_13';

     register:

        create table if not exists register_tb(register string) row format delimited fields terminated by '\t';

        insert overwrite table register_tb select count(1) from weblog_clean where date='2016_11_13' and instr(url,'member.php?mod=register') > 0;

     ip:

        create table if not exists ip_tb(ip string) row format delimited fields terminated by '\t';

        insert overwrite table ip_tb select count(distinct ip) from weblog_clean where date='2016_11_13';

     jumpprob:

        create table if not exists jumpprob_tb(jump double) row format delimited fields terminated by '\t';

        insert overwrite table jumpprob_tb

        select ghip.singleip/aip.ips from (select count(1) singleip from(select count(ip) ips from weblog_clean where date='2016_11_13' group by ip having ips <2) gip) ghip,

        (select count(ip) ips from weblog_clean where date='2016_11_13') aip;

     two_jumpprob:

        create table if not exists two_jumpprob_tb(jump double) row format delimited fields terminated by '\t';

        insert overwrite table two_jumpprob_tb

        select ghip.singleip/aip.ips from (select count(1) singleip from(select count(ip) ips from weblog_clean where date='2016_11_13' group by ip having ips >=2) gip) ghip,

        (select count(ip) ips from weblog_clean where date='2016_11_13') aip;

     merge table # 注意上面几个表是分开创建,效率比下面高,但存储消耗上面较高

        create table if not exists log_result(pv string,register string,ip string,jumpprob double,two_jumpprob double ) row format delimited fields terminated by '\t';

        insert overwrite table log_result

        select log_pv.pv,log_register.register,log_ip.ip,log_jumpprob.jumpprob,log_two_jumpprob.two_jumpprob from (select count(1) pv from weblog_clean where date='2016_11_13') log_pv,

        (select count(1) register from weblog_clean where date='2016_11_13' and instr(url,'member.php?mod=register') > 0) log_register,

        (select count(distinct ip) ip from weblog_clean where date='2016_11_13') log_ip,

        (select ghip.singleip/aip.ips jumpprob from (select count(1) singleip from(select count(ip) ips from weblog_clean where date='2016_11_13' group by ip having ips <2) gip) ghip,

        (select count(ip) ips from weblog_clean where date='2016_11_13') aip) log_jumpprob,

        (select ghip.singleip/aip.ips two_jumpprob from (select count(1) singleip from(select count(ip) ips from weblog_clean where date='2016_11_13' group by ip having ips >=2) gip) ghip,

        (select count(ip) ips from weblog_clean where date='2016_11_13') aip) log_two_jumpprob;

四、结果展示

mysql> select * from weblog_result;

       +--------+----------+-------+----------+--------------+

       | pv     | register | ip    | jumpprob | two_jumpprob |

       +--------+----------+-------+----------+--------------+

       | 169857 | 28       | 10411 |     0.02 |         0.04 |

       +--------+----------+-------+----------+--------------+

       1 row in set (0.00 sec)

五、logclean.jar(过滤日志字段:日期转换,去除双引号,过去根url)

package org.apache.hadoop.log.project;

import java.net.URI;

import java.text.ParseException;

import java.text.SimpleDateFormat;

import java.util.Date;

import java.util.Locale;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

public class LogClean extends Configured implements Tool {

    public static void main(String[] args) {

        Configuration conf = new Configuration();

        try {

            int res = ToolRunner.run(conf, new LogClean(), args);

            System.exit(res);

        } catch (Exception e) {

            e.printStackTrace();

        }

    }

    public int run(String[] args) throws Exception {

    	Configuration conf = new Configuration();

        Job job = Job.getInstance(conf, "logclean");

        // 设置为可以打包运行

        job.setJarByClass(LogClean.class);

        FileInputFormat.setInputPaths(job, args[0]);

        job.setMapperClass(MyMapper.class);

        job.setMapOutputKeyClass(LongWritable.class);

        job.setMapOutputValueClass(Text.class);

        job.setReducerClass(MyReducer.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(NullWritable.class);

        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // 清理已存在的输出文件

        FileSystem fs = FileSystem.get(new URI(args[0]), getConf());

        Path outPath = new Path(args[1]);

        if (fs.exists(outPath)) {

            fs.delete(outPath, true);

        }

        boolean success = job.waitForCompletion(true);

        if(success){

            System.out.println("Clean process success!");

        }

        else{

            System.out.println("Clean process failed!");

        }

        return 0;

    }

    static class MyMapper extends

            Mapper<LongWritable, Text, LongWritable, Text> {

        LogParser logParser = new LogParser();

        Text outputValue = new Text();

        protected void map(

                LongWritable key,

                Text value,

                org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, LongWritable, Text>.Context context)

                throws java.io.IOException, InterruptedException {

            final String[] parsed = logParser.parse(value.toString());

            // step1.过滤掉静态资源访问请求

            if (parsed[2].startsWith("GET /static/")

                    || parsed[2].startsWith("GET /uc_server")) {

                return;

            }

            // step2.过滤掉开头的指定字符串

            if (parsed[2].startsWith("GET /")) {

                parsed[2] = parsed[2].substring("GET /".length());

            } else if (parsed[2].startsWith("POST /")) {

                parsed[2] = parsed[2].substring("POST /".length());

            }

            // step3.过滤掉结尾的特定字符串

            if (parsed[2].endsWith(" HTTP/1.1")) {

                parsed[2] = parsed[2].substring(0, parsed[2].length()

                        - " HTTP/1.1".length());

            }

            // step4.只写入前三个记录类型项

            outputValue.set(parsed[0] + "\t" + parsed[1] + "\t" + parsed[2]);

            context.write(key, outputValue);

        }

    }

    static class MyReducer extends

            Reducer<LongWritable, Text, Text, NullWritable> {

        protected void reduce(

                LongWritable k2,

                java.lang.Iterable<Text> v2s,

                org.apache.hadoop.mapreduce.Reducer<LongWritable, Text, Text, NullWritable>.Context context)

                throws java.io.IOException, InterruptedException {

            for (Text v2 : v2s) {

                context.write(v2, NullWritable.get());

            }

        };

    }

    /*

     * 日志解析类

     */

    static class LogParser {

        public static final SimpleDateFormat FORMAT = new SimpleDateFormat(

                "d/MMM/yyyy:HH:mm:ss", Locale.ENGLISH);

        public static final SimpleDateFormat dateformat1 = new SimpleDateFormat(

                "yyyyMMddHHmmss");

        public static void main(String[] args) throws ParseException {

            final String S1 = "27.19.74.143 - - [30/May/2013:17:38:20 +0800] \"GET /static/image/common/faq.gif HTTP/1.1\" 200 1127";

            LogParser parser = new LogParser();

            final String[] array = parser.parse(S1);

            System.out.println("样例数据： " + S1);

            System.out.format(

                    "解析结果：  ip=%s, time=%s, url=%s, status=%s, traffic=%s",

                    array[0], array[1], array[2], array[3], array[4]);

        }

        /**

         * 解析英文时间字符串

         *

         * @param string

         * @return

         * @throws ParseException

         */

        private Date parseDateFormat(String string) {

            Date parse = null;

            try {

                parse = FORMAT.parse(string);

            } catch (ParseException e) {

                e.printStackTrace();

            }

            return parse;

        }

        /**

         * 解析日志的行记录

         *

         * @param line

         * @return 数组含有5个元素，分别是ip、时间、url、状态、流量

         */

        public String[] parse(String line) {

            String ip = parseIP(line);

            String time = parseTime(line);

            String url = parseURL(line);

            String status = parseStatus(line);

            String traffic = parseTraffic(line);

            return new String[] { ip, time, url, status, traffic };

        }

        private String parseTraffic(String line) {

            final String trim = line.substring(line.lastIndexOf("\"") + 1)

                    .trim();

            String traffic = trim.split(" ")[1];

            return traffic;

        }

        private String parseStatus(String line) {

            final String trim = line.substring(line.lastIndexOf("\"") + 1)

                    .trim();

            String status = trim.split(" ")[0];

            return status;

        }

        private String parseURL(String line) {

            final int first = line.indexOf("\"");

            final int last = line.lastIndexOf("\"");

            String url = line.substring(first + 1, last);

            return url;

        }

        private String parseTime(String line) {

            final int first = line.indexOf("[");

            final int last = line.indexOf("+0800]");

            String time = line.substring(first + 1, last).trim();

            Date date = parseDateFormat(time);

            return dateformat1.format(date);

        }

        private String parseIP(String line) {

            String ip = line.split("- -")[0].trim();

            return ip;

        }

    }

}

六、完整shell,注意准备logclean.jar(用于日志过滤MR程序),与"昨日"的日志文件和文件位置

#!/bin/bash

echo -ne | cat <<eot

#############################################################################

##########################   普   度   众   生    ###########################

                                  _oo0oo_

                                 088888880

                                 88" . "88

                                 (| -_- |)

                                  0\ = /0

                               ___/'---'\___

                             .' \\\\|     |// '.

                            / \\\\|||  :  |||// \\

                           /_ ||||| -:- |||||- \\

                          |   | \\\\\\  -  /// |   |

                          | \_|  ''\---/''  |_/ |

                          \  .-\__  '-'  __/-.  /

                        ___'. .'  /--.--\  '. .'___

                     ."" '<  '.___\_<|>_/___.' >'  "".

                    | | : '-  \'.;'\ _ /';.'/ - ' : | |

                    \  \ '_.   \_ __\ /__ _/   .-' /  /

                ====='-.____'.___ \_____/___.-'____.-'=====

                                  '=---='                                    

              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

                        佛祖保佑    iii    永不出错

eot

##get yesterday date

yesterday=`date -d '-1 day' +'%Y_%m_%d'`

echo $yesterday

############

## define ##

############

HADOOP_HOME=/opt/cdh-5.6.3/hadoop-2.5.0-cdh5.3.6

HIVE_HOME=/opt/cdh-5.6.3/hive-0.13.1-cdh5.3.6

SQOOP_HOME=/opt/cdh-5.6.3/sqoop-1.4.5-cdh5.2.6

HIVE_DATABASE=weblog

HIVE_TABLE=weblog_clean

HIVE_RSTABLE=weblog_result

MYSQL_USERNAME=root

MYSQL_PASSWORD=root

EXPORT_DIR=/user/hive/warehouse/weblog.db/weblog_result

NUM_MAPPERS=1

#########################

##  get logfile path   ##

#########################

LOG_PATH=/home/liuwl/opt/datas/weblog/access_$yesterday.log

JAR_PATH=/home/liuwl/opt/datas/logclean.jar

ENTRANCE=org.apache.hadoop.log.project.LogClean

HDFS_INPUT_PATH=/weblog/source

HDFS_OUTPUT_PATH=/weblog/clean

SQOOP_JDBC=jdbc:mysql://hadoop09-linux-01.ibeifeng.com:3306/$HIVE_DATABASE

############################

## upload logfile to hdfs ##

############################

echo "start to upload logfile"

#$HADOOP_HOME/bin/hdfs dfs -rm -r $HDFS_INPUT_PATH > /dev/null 2>&1

HSFiles=`$HADOOP_HOME/bin/hdfs dfs -ls $HDFS_INPUT_PATH/$yesterday`

if [ -z "$HSFiles" ]; then

$HADOOP_HOME/bin/hdfs dfs -mkdir -p $HDFS_INPUT_PATH/$yesterday > /dev/null 2>&1

$HADOOP_HOME/bin/hdfs dfs -put $LOG_PATH  $HDFS_INPUT_PATH/$yesterday > /dev/null 2>&1

echo "upload ok"

else

echo "exists"

fi

###########################

## clean the source file ##

###########################

echo "start to clean logfile"

HCFiles=`$HADOOP_HOME/bin/hdfs dfs -ls $HDFS_OUTPUT_PATH`

if [ -z "$HCFiles" ]; then

$HADOOP_HOME/bin/yarn jar $JAR_PATH $ENTRANCE $HDFS_INPUT_PATH/$yesterday $HDFS_OUTPUT_PATH/date=$yesterday

echo "clean ok"

fi

###########################

## create the hive table ##

###########################

echo "start to create the hive table"

$HIVE_HOME/bin/hive -e "create database if not exists $HIVE_DATABASE" > /dev/null 2>&1

$HIVE_HOME/bin/hive --database $HIVE_DATABASE -e "create external table if not exists $HIVE_TABLE(ip string,day string,url string) partitioned by (date string) row format delimited fields terminated by '\t' location '$HDFS_OUTPUT_PATH' "

echo "add patition to hive table"

$HIVE_HOME/bin/hive --database $HIVE_DATABASE -e "alter table $HIVE_TABLE add partition (date='$yesterday')"

##################################

## create the hive reslut table ##

##################################

echo "start to create the hive reslut table"

$HIVE_HOME/bin/hive --database $HIVE_DATABASE -e "create table if not exists $HIVE_RSTABLE(pv string,register string,ip string,jumpprob double,two_jumpprob double ) row format delimited fields terminated by '\t';"

#################

## insert data ##

#################

echo "start to insert data"

HTFiles=`$HADOOP_HOME/bin/hdfs dfs -ls $EXPORT_DIR`

if [ -z "$HTFiles" ]; then

$HIVE_HOME/bin/hive --database $HIVE_DATABASE -e "insert overwrite table $HIVE_RSTABLE select log_pv.pv,log_register.register,log_ip.ip,log_jumpprob.jumpprob,log_two_jumpprob.two_jumpprob from (select count(1) pv from $HIVE_TABLE where date='$yesterday') log_pv,(select count(1) register from $HIVE_TABLE where date='$yesterday' and instr(url,'member.php?mod=register') > 0) log_register,(select count(distinct ip) ip from $HIVE_TABLE where date='$yesterday') log_ip,(select ghip.singleip/aip.ips jumpprob from (select count(1) singleip from(select count(ip) ips from $HIVE_TABLE where date='$yesterday' group by ip having ips <2) gip) ghip,(select count(ip) ips from $HIVE_TABLE where date='$yesterday') aip) log_jumpprob,(select ghip.singleip/aip.ips two_jumpprob from (select count(1) singleip from(select count(ip) ips from $HIVE_TABLE where date='$yesterday' group by ip having ips >=2) gip) ghip,(select count(ip) ips from $HIVE_TABLE where date='$yesterday') aip) log_two_jumpprob"

fi

###################################

## create the mysql reslut table ##

###################################

mysql -u$MYSQL_USERNAME -p$MYSQL_PASSWORD -e "

create database if not exists $HIVE_DATABASE DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;

use $HIVE_DATABASE;

create table if not exists $HIVE_RSTABLE(pv varchar(20) not null,register varchar(20) not null,ip varchar(20) not null,jumpprob double(6,2) not null,two_jumpprob double(6,2) not null) DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;

truncate table if exists $HIVE_RSTABLE;

quit"

#######################################

## export hive result table to mysql ##

#######################################

echo "start to export hive result table to mysql"

$SQOOP_HOME/bin/sqoop export --connect $SQOOP_JDBC --username $MYSQL_USERNAME --password $MYSQL_PASSWORD --table $HIVE_RSTABLE --export-dir $EXPORT_DIR --num-mappers $NUM_MAPPERS --input-fields-terminated-by '\t'

echo "shell finished"

日志分析_使用shell完整日志分析案例的更多相关文章

SQL Server中的事务日志管理(6/9)：大容量日志恢复模式里的日志管理
当一切正常时,没有必要特别留意什么是事务日志,它是如何工作的.你只要确保每个数据库都有正确的备份.当出现问题时,事务日志的理解对于采取修正操作是重要的,尤其在需要紧急恢复数据库到指定点时.这系列文章会 ...
ELK实时日志分析平台环境部署--完整记录
在日常运维工作中,对于系统和业务日志的处理尤为重要.今天,在这里分享一下自己部署的ELK(+Redis)-开源实时日志分析平台的记录过程(仅依据本人的实际操作为例说明,如有误述,敬请指出)~ ==== ...
ELK实时日志分析平台环境部署--完整记录(转)
在日常运维工作中,对于系统和业务日志的处理尤为重要.今天,在这里分享一下自己部署的ELK(+Redis)-开源实时日志分析平台的记录过程(仅依据本人的实际操作为例说明,如有误述,敬请指出)~ ==== ...
如何通过友盟分析发布后App崩溃日志－b
要分析崩溃日志,首先需要保留发布时的编译出来的.xcarchive文件.这个文件包含了.DSYM文件. 我一般的做法是,发布成功后,把这个文件.xcarchive直接提交到代码版本库对应的版本分支里, ...
如何分析apache日志[access_log（访问日志）和error_log（错误日志）]
如何分析apache日志[access_log(访问日志)和error_log(错误日志)] 发布时间: 2013-12-17 浏览次数:205 分类: 服务器默认Apache运行会access_l ...
如何通过友盟分析发布后App崩溃日志
http://blog.csdn.net/totogo2010/article/details/39892467 要分析崩溃日志,首先需要保留发布时的编译出来的.xcarchive文件.这个文件包含了 ...
tomcat不能多次startup.sh，异常时直接，分析logs目录下的日志。
tomcat不能多次startup.sh,异常时直接干掉其进程. 分析logs目录下的日志.
[日志分析]Graylog2采集mysql慢日志
之前聊了一下graylog如何采集nginx日志,为此我介绍了两种采集方法(主动和被动),让大家对graylog日志采集有了一个大致的了解. 从日志收集这个角度,graylog提供了多样性和灵活性,大 ...
Python 日志打印之logging.getLogger源码分析
日志打印之logging.getLogger源码分析 By:授客 QQ:1033553122 #实践环境 WIN 10 Python 3.6.5 #函数说明 logging.getLogger(nam ...

随机推荐

Pandas-数据聚合与分组运算
目录图解"split-apply-combine" 数据的分类split: groupby() 以column进行分组以index进行分组分组遍历数据的应用apply: a ...
MyEclispe发布web项目-遁地龙卷风
(-1)写在前面我用的是MyEclipse8.5. 还记得以前帮助一个女同学解决问题的时候,特意情调了要先启动服务在发布项目,其实单独的时候都是知道的,总和起来后就容易片面的给出结论.因为不会发生问 ...
LINQ for XML简单示例
LINQ,语言集成查询(Language Integrated Query)是一组用于c#和Visual Basic语言的扩展.它允许开发人员以与查询数据库相同的方式操作内存数据.从技术角度而言,LI ...
Eclipse 代码显示不全的问题
Eclipse中的"Show Source of Selected Element Only"功能引起的, 定位到: Window->Customize Perspectiv ...
Linux 网络子系统
今天记录一下Linux网络子系统相关的东西. 因为感觉对这一块还是有一个很大的空白,这件事情太可怕了. 摘抄多份博客进行总结一下Linux网络子系统的相关东西. 一. Linux网络子系统体系结构 L ...
Nubia Z9 mini使用体验
前续用的手机:荣耀6 想换的理由: 1, 充电不方便,除了原装的充电器和小米移动电源,其他的充电器和移动电源约有一半都只能以USB方式慢充,即使是2.0A以上输出电流的: 2, 拍照太渣. Z9 mi ...
Nginx的继续深入（日志轮询切割，重写，负载均衡等）
Nginx的访问日志轮询切割通常什么情况Nginx会把所有的访问日志生成到一个制定的访问日志文件access.log里面,但时间一长,日志个头很大不利于日志的分析和处理. 有必要对Nginx日志进行 ...
JS判断是否是微信页面，判断手机操作系统（ios或android）并跳转到不同下载页面
JS判断客户端是否是iOS或者Android 参考:http://caibaojian.com/browser-ios-or-android.html function is_weixin() { v ...
js指定分隔符连接数组元素join()
指定分隔符连接数组元素join() join()方法用于把数组中的所有元素放入一个字符串.元素是通过指定的分隔符进行分隔的. 语法: arrayObject.join(分隔符) 参数说明: 注意:返回 ...
SSO 单点登录实现
.NET基于Redis缓存实现单点登录SSO的解决方案 http://www.cnblogs.com/yinrq/p/5276628.html 共享cookie的方案 http://www.codep ...

日志分析_使用shell完整日志分析案例

日志分析_使用shell完整日志分析案例的更多相关文章

随机推荐

热门专题