五、基于hadoop的nginx访问日志分析--userAgent和spider
useragent:
代码(不包含蜘蛛):
# cat top_10_useragent.py
#!/usr/bin/env python
# coding=utf-8 from mrjob.job import MRJob
from mrjob.step import MRStep
from nginx_accesslog_parser import NginxLineParser import heapq class UserAgent(MRJob): nginx_line_parser = NginxLineParser() def mapper(self, _, line): self.nginx_line_parser.parse(line)
field_item = self.nginx_line_parser.http_user_agent
if field_item is not None:
yield field_item, 1 def reducer_sum(self, key, values): yield None, (sum(values), key) def reducer_top100(self, _, values):
for count, path in heapq.nlargest(10, values):
yield count, path
# for count, path in sorted(values, reverse=True)[:10]:
# yield count, path def steps(self):
return (
MRStep(mapper=self.mapper,
reducer=self.reducer_sum
),
MRStep(reducer=self.reducer_top100)
) def main():
UserAgent.run() if __name__ == '__main__':
main()
结果:
# python3 top_10_useragent.py access_all.log-20161227
No configs found; falling back on auto-configuration
Creating temp directory /tmp/top_10_useragent.root.20161228.090725.308144
Running step 1 of 2...
Running step 2 of 2...
Streaming final output from /tmp/top_10_useragent.root.20161228.090725.308144/output...
85262 "IE"
79611 "Chrome"
48560 "Other"
10662 "Firefox"
7927 "Mobile Safari UI/WKWebView"
7182 "Sogou Explorer"
6681 "QQ Browser"
1988 "Mobile Safari"
1781 "Maxthon"
1404 "Edge"
Removing temp directory /tmp/top_10_useragent.root.20161228.090725.308144...
蜘蛛:
#!/usr/bin/env python
# coding=utf-8 from mrjob.job import MRJob
from mrjob.step import MRStep
from nginx_accesslog_parser import NginxLineParser import heapq class Spider(MRJob): nginx_line_parser = NginxLineParser() def mapper(self, _, line): self.nginx_line_parser.parse(line)
field_item = self.nginx_line_parser.user_agent_type
if field_item is not None:
yield field_item, 1 def reducer_sum(self, key, values): yield None, (sum(values), key) def reducer_top100(self, _, values):
for count, path in heapq.nlargest(10, values):
yield count, path
# for count, path in sorted(values, reverse=True)[:10]:
# yield count, path def steps(self):
return (
MRStep(mapper=self.mapper,
reducer=self.reducer_sum
),
MRStep(reducer=self.reducer_top100)
) def main():
Spider.run() if __name__ == '__main__':
main()
执行结果:
# python3 top_10_spider.py access_all.log-20161227
No configs found; falling back on auto-configuration
Creating temp directory /tmp/top_10_spider.root.20161228.091326.295972
Running step 1 of 2...
Running step 2 of 2...
Streaming final output from /tmp/top_10_spider.root.20161228.091326.295972/output...
33542 "magpie-crawler"
25880 "Other"
16578 "Sogou web spider"
6383 "bingbot"
3688 "Baiduspider"
1487 "Yahoo! Slurp"
1096 "JikeSpider"
731 "YisouSpider"
648 "Baiduspider-image"
470 "Googlebot"
Removing temp directory /tmp/top_10_spider.root.20161228.091326.295972...
五、基于hadoop的nginx访问日志分析--userAgent和spider的更多相关文章
- 一、基于hadoop的nginx访问日志分析---解析日志篇
前一阵子,搭建了ELK日志分析平台,用着挺爽的,再也不用给开发拉各种日志,节省了很多时间. 这篇博文是介绍用python代码实现日志分析的,用MRJob实现hadoop上的mapreduce,可以直接 ...
- 四、基于hadoop的nginx访问日志分析---top 10 request
代码: # cat top_10_request.py #!/usr/bin/env python # coding=utf-8 from mrjob.job import MRJob from mr ...
- 二、基于hadoop的nginx访问日志分析---计算日pv
代码: # pv_day.py#!/usr/bin/env python # coding=utf-8 from mrjob.job import MRJob from nginx_accesslog ...
- 三、基于hadoop的nginx访问日志分析--计算时刻pv
代码: # cat pv_hour.py #!/usr/bin/env python # coding=utf-8 from mrjob.job import MRJob from nginx_acc ...
- nginx访问日志分析,筛选时间大于1秒的请求
处理nginx访问日志,筛选时间大于1秒的请求 #!/usr/bin/env python ''' 处理访问日志,筛选时间大于1秒的请求 ''' with open('test.log','a+' ...
- Nginx 访问日志分析
0:Nginx日志格式配置 # vim nginx.conf ## # Logging Settings ## log_format access '$remote_addr - $remote_us ...
- Nginx访问日志分析
nginx默认的日志格式 log_format main '$remote_addr - $remote_user [$time_local] "$request" ' '$sta ...
- 13 Nginx访问日志分析
#!/bin/bash export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin # Nginx 日志格式: # ...
- 采集并分析Nginx访问日志
日志服务支持通过数据接入向导配置采集Nginx日志,并自动创建索引和Nginx日志仪表盘,帮助您快速采集并分析Nginx日志. 许多个人站长选取了Nginx作为服务器搭建网站,在对网站访问情况进行分析 ...
随机推荐
- Android 手机卫士--导航界面1的布局编写
本文地址:http://www.cnblogs.com/wuyudong/p/5943005.html,转载请注明出处. 本文实现导航界面1的布局的实现,效果如下图所示: 首先分析所使用的布局样式: ...
- 慎用mutableCopy
因为逻辑需要,我在present到一个页面时,将一个存放uiimage的数组mutablecopy了过去(因为再返回的时候防止对数组做了改动),时间长了也忘了这事儿,后来发现添加多张图片上传时,app ...
- Would Your Work Habits Change if You Were Paid by the Job?
原文地址:http://success-sys.com/2016/09/26/would-your-work-habits-change-if-you-were-paid-by-the-job/ A ...
- WPF 自定义IconButton
自定义一个按钮控件 按钮控件很简单,我们在项目中有时把样式封装起来,添加依赖属性,也是为了统一. 这里举例,单纯的图标控件怎么设置 1.UserControl界面样式 <UserControl ...
- 使用jOrgChart插件实现组织架构图的展示
项目要做组织架构图,要把它做成自上而下的树形结构. 一.说明 (1)通过后台查询数据库,生成树形数组结构,返回到前台. (2)需要引入的js插件和css文件: ①jquery.jOrgChart.cs ...
- 使用NuGet发布自己的类库包(Library Package)
STEP 1:注册并获取API Key 首先,你需要到NuGet上注册一个新的账号,然后在My Account页面,获取一个API Key,这个过程很简单,我就不作说明了. STEP 2:下载NuGe ...
- virtualBox安装Centos7之后
之前用vmware装虚拟机的时候,直接配置好网卡就可以ping通,可以用ssh登录,然后配置yum源,万事大吉. 但是virtualBox配置却有不同,需要按下面的方法配置: 选中虚拟机->设置 ...
- [LeetCode] Number of Segments in a String 字符串中的分段数量
Count the number of segments in a string, where a segment is defined to be a contiguous sequence of ...
- [LeetCode] 3Sum 三数之和
Given an array S of n integers, are there elements a, b, c in S such that a + b + c = 0? Find all un ...
- web兼容学习分析笔记-margin 和padding浏览器解析差异
二.margin 和padding浏览器解析差异 只有默认margin的元素 <body>margin:8px margin:15px 10px 15px 10px(IE7) <b ...