CS100.1x-lab2_apache_log

这次的作业主要用PySpark来分析Web Server Log。主要分成4个部分。相关ipynb文件见我github。

Part 1 Apache Web Server Log file format

这部分主要是了解log file的格式，然后处理它。我们处理的日志格式符合Common Log Format（CLF）标准。其一行记录长这样：

127.0.0.1 - - [01/Aug/1995:00:00:01 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1839

这次的作业数据来源于NASA Kennedy Space Center WWW server。其完整的数据在这个网址可以免费得到（http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html）

Parsing Each Log Line

我们的第一步当然是解析数据，从原始数据中选出有用的数据来。不过这段代码作业为我们准备好了。

import re

import datetime

from pyspark.sql import Row

month_map = {'Jan': 1, 'Feb': 2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6, 'Jul':7,

    'Aug':8,  'Sep': 9, 'Oct':10, 'Nov': 11, 'Dec': 12}

def parse_apache_time(s):

    """ Convert Apache time format into a Python datetime object

    Args:

        s (str): date and time in Apache time format

    Returns:

        datetime: datetime object (ignore timezone for now)

    """

    return datetime.datetime(int(s[7:11]),

                             month_map[s[3:6]],

                             int(s[0:2]),

                             int(s[12:14]),

                             int(s[15:17]),

                             int(s[18:20]))

def parseApacheLogLine(logline):

    """ Parse a line in the Apache Common Log format

    Args:

        logline (str): a line of text in the Apache Common Log format

    Returns:

        tuple: either a dictionary containing the parts of the Apache Access Log and 1,

               or the original invalid log line and 0

    """

    match = re.search(APACHE_ACCESS_LOG_PATTERN, logline)

    if match is None:

        return (logline, 0)

    size_field = match.group(9)

    if size_field == '-':

        size = long(0)

    else:

        size = long(match.group(9))

    return (Row(

        host          = match.group(1),

        client_identd = match.group(2),

        user_id       = match.group(3),

        date_time     = parse_apache_time(match.group(4)),

        method        = match.group(5),

        endpoint      = match.group(6),

        protocol      = match.group(7),

        response_code = int(match.group(8)),

        content_size  = size

    ), 1)

# A regular expression pattern to extract fields from the log line

APACHE_ACCESS_LOG_PATTERN = '^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+) (\S+)\s*(\S*)" (\d{3}) (\S+)'

Configuration and Initial RDD Creation

import sys

import os

from test_helper import Test

baseDir = os.path.join('data')

inputPath = os.path.join('cs100', 'lab2', 'apache.access.log.PROJECT')

logFile = os.path.join(baseDir, inputPath)

def parseLogs():

    """ Read and parse log file """

    parsed_logs = (sc

                   .textFile(logFile)

                   .map(parseApacheLogLine)

                   .cache())

    access_logs = (parsed_logs

                   .filter(lambda s: s[1] == 1)

                   .map(lambda s: s[0])

                   .cache())

    failed_logs = (parsed_logs

                   .filter(lambda s: s[1] == 0)

                   .map(lambda s: s[0]))

    failed_logs_count = failed_logs.count()

    if failed_logs_count > 0:

        print 'Number of invalid logline: %d' % failed_logs.count()

        for line in failed_logs.take(20):

            print 'Invalid logline: %s' % line

    print 'Read %d lines, successfully parsed %d lines, failed to parse %d lines' % (parsed_logs.count(), access_logs.count(), failed_logs.count())

    return parsed_logs, access_logs, failed_logs

parsed_logs, access_logs, failed_logs = parseLogs()

这段代码是把log file数据转换成RDD，然后用上一节里的解析函数来解析，最后把结果缓存起来。因为后面还需要用这个结果。通过上面的步骤，我们发现，有大量的记录无法解析。

Number of invalid logline: 108

Invalid logline: ix-sac6-20.ix.netcom.com - - [08/Aug/1995:14:43:39 -0400] "GET / HTTP/1.0 " 200 7131

Invalid logline: ix-sac6-20.ix.netcom.com - - [08/Aug/1995:14:43:57 -0400] "GET /images/ksclogo-medium.gif HTTP/1.0 " 200 5866

Invalid logline: ix-sac6-20.ix.netcom.com - - [08/Aug/1995:14:44:07 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0 " 200 786

Invalid logline: ix-sac6-20.ix.netcom.com - - [08/Aug/1995:14:44:11 -0400] "GET /images/MOSAIC-logosmall.gif HTTP/1.0 " 200 363

Invalid logline: ix-sac6-20.ix.netcom.com - - [08/Aug/1995:14:44:13 -0400] "GET /images/USA-logosmall.gif HTTP/1.0 " 200 234

Invalid logline: ix-sac6-20.ix.netcom.com - - [08/Aug/1995:14:44:15 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0 " 200 669

Invalid logline: ix-sac6-20.ix.netcom.com - - [08/Aug/1995:14:44:31 -0400] "GET /shuttle/countdown/ HTTP/1.0 " 200 4673

Invalid logline: ix-sac6-20.ix.netcom.com - - [08/Aug/1995:14:44:41 -0400] "GET /shuttle/missions/sts-69/count69.gif HTTP/1.0 " 200 46053

Invalid logline: ix-sac6-20.ix.netcom.com - - [08/Aug/1995:14:45:34 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0 " 200 1204

Invalid logline: ix-sac6-20.ix.netcom.com - - [08/Aug/1995:14:45:46 -0400] "GET /cgi-bin/imagemap/countdown69?293,287 HTTP/1.0 " 302 85

Invalid logline: ix-sac6-20.ix.netcom.com - - [08/Aug/1995:14:45:48 -0400] "GET /htbin/cdt_main.pl HTTP/1.0 " 200 3714

Invalid logline: ix-sac6-20.ix.netcom.com - - [08/Aug/1995:14:45:52 -0400] "GET /shuttle/countdown/images/countclock.gif HTTP/1.0 " 200 13994

Invalid logline: ix-li1-14.ix.netcom.com - - [08/Aug/1995:14:46:22 -0400] "GET / HTTP/1.0 " 200 7131

Invalid logline: ix-li1-14.ix.netcom.com - - [08/Aug/1995:14:46:29 -0400] "GET /images/ksclogo-medium.gif HTTP/1.0 " 200 5866

Invalid logline: ix-li1-14.ix.netcom.com - - [08/Aug/1995:14:46:35 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0 " 200 786

Invalid logline: ix-li1-14.ix.netcom.com - - [08/Aug/1995:14:46:37 -0400] "GET /images/MOSAIC-logosmall.gif HTTP/1.0 " 200 363

Invalid logline: ix-li1-14.ix.netcom.com - - [08/Aug/1995:14:46:38 -0400] "GET /images/USA-logosmall.gif HTTP/1.0 " 200 234

Invalid logline: ix-li1-14.ix.netcom.com - - [08/Aug/1995:14:46:40 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0 " 200 669

Invalid logline: ix-li1-14.ix.netcom.com - - [08/Aug/1995:14:47:41 -0400] "GET /shuttle/missions/sts-70/mission-sts-70.html HTTP/1.0 " 200 20304

Invalid logline: ix-sac6-20.ix.netcom.com - - [08/Aug/1995:14:47:48 -0400] "GET /shuttle/countdown/count.html HTTP/1.0 " 200 73231

Read 1043177 lines, successfully parsed 1043069 lines, failed to parse 108 lines

Data Cleaning

我们发现有108行解析失败。所以我们要重新写正则表达式以此来把数据全部解析成功。

# TODO: Replace <FILL IN> with appropriate code

# This was originally '^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+) (\S+)\s*(\S*)" (\d{3}) (\S+)'

APACHE_ACCESS_LOG_PATTERN = '^(\S+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(\S+) (\S+)\s*(\S*).*" (\d{3}) (\S+)'

parsed_logs, access_logs, failed_logs = parseLogs()

这个规则能把日志文件全部解析出来。

Part 2 Sample Analyses on the Web Server Log File

Example: Content Size Statistics

这里的操作就是一些常见的统计分析，出现了两个之前没出现的方法min()和max()。

# Calculate statistics based on the content size.

content_sizes = access_logs.map(lambda log: log.content_size).cache()

print 'Content Size Avg: %i, Min: %i, Max: %s' % (

    content_sizes.reduce(lambda a, b : a + b) / content_sizes.count(),

    content_sizes.min(),

    content_sizes.max())

Example: Response Code Analysis

这里是分析response code，主要是统计有几种不同的response code和各种有多少个。

# Response Code to Count

responseCodeToCount = (access_logs

                       .map(lambda log: (log.response_code, 1))

                       .reduceByKey(lambda a, b : a + b)

                       .cache())

responseCodeToCountList = responseCodeToCount.take(100)

print 'Found %d response codes' % len(responseCodeToCountList)

print 'Response Code Counts: %s' % responseCodeToCountList

assert len(responseCodeToCountList) == 7

assert sorted(responseCodeToCountList) == [(200, 940847), (302, 16244), (304, 79824), (403, 58), (404, 6185), (500, 2), (501, 17)]

Example: Response Code Graphing with matplotlib

这里主要是在上面的结果基础上，算每个code的百分比，画出其扇形百分比的图。

labels = responseCodeToCount.map(lambda (x, y): x).collect()

print labels

count = access_logs.count()

fracs = responseCodeToCount.map(lambda (x, y): (float(y) / count)).collect()

print fracs

import matplotlib.pyplot as plt

def pie_pct_format(value):

    """ Determine the appropriate format string for the pie chart percentage label

    Args:

        value: value of the pie slice

    Returns:

        str: formated string label; if the slice is too small to fit, returns an empty string for label

    """

    return '' if value < 7 else '%.0f%%' % value

fig = plt.figure(figsize=(4.5, 4.5), facecolor='white', edgecolor='white')

colors = ['yellowgreen', 'lightskyblue', 'gold', 'purple', 'lightcoral', 'yellow', 'black']

explode = (0.05, 0.05, 0.1, 0, 0, 0, 0)

patches, texts, autotexts = plt.pie(fracs, labels=labels, colors=colors,

                                    explode=explode, autopct=pie_pct_format,

                                    shadow=False,  startangle=125)

for text, autotext in zip(texts, autotexts):

    if autotext.get_text() == '':

        text.set_text('')  # If the slice is small to fit, don't show a text label

plt.legend(labels, loc=(0.80, -0.1), shadow=True)

pass

Example: Frequent Hosts

这里是研究host，统计每个host出现的次数，选出出现次数超过十次的host（其实和前面的工作差不多）

# Any hosts that has accessed the server more than 10 times.

hostCountPairTuple = access_logs.map(lambda log: (log.host, 1))

hostSum = hostCountPairTuple.reduceByKey(lambda a, b : a + b)

hostMoreThan10 = hostSum.filter(lambda s: s[1] > 10)

hostsPick20 = (hostMoreThan10

               .map(lambda s: s[0])

               .take(20))

print 'Any 20 hosts that have accessed more then 10 times: %s' % hostsPick20

# An example: [u'204.120.34.185', u'204.243.249.9', u'slip1-32.acs.ohio-state.edu', u'lapdog-14.baylor.edu', u'199.77.67.3', u'gs1.cs.ttu.edu', u'haskell.limbex.com', u'alfred.uib.no', u'146.129.66.31', u'manaus.bologna.maraut.it', u'dialup98-110.swipnet.se', u'slip-ppp02.feldspar.com', u'ad03-053.compuserve.com', u'srawlin.opsys.nwa.com', u'199.202.200.52', u'ix-den7-23.ix.netcom.com', u'151.99.247.114', u'w20-575-104.mit.edu', u'205.25.227.20', u'ns.rmc.com']

Example: Visualizing Endpoints

这次是研究endpoints。。和上面几乎一样。

endpoints = (access_logs

             .map(lambda log: (log.endpoint, 1))

             .reduceByKey(lambda a, b : a + b)

             .cache())

ends = endpoints.map(lambda (x, y): x).collect()

counts = endpoints.map(lambda (x, y): y).collect()

fig = plt.figure(figsize=(8,4.2), facecolor='white', edgecolor='white')

plt.axis([0, len(ends), 0, max(counts)])

plt.grid(b=True, which='major', axis='y')

plt.xlabel('Endpoints')

plt.ylabel('Number of Hits')

plt.plot(counts)

pass

Example: Top Endpoints

在上面的结果基础上，排序，选出前十个。

# Top Endpoints

endpointCounts = (access_logs

                  .map(lambda log: (log.endpoint, 1))

                  .reduceByKey(lambda a, b : a + b))

topEndpoints = endpointCounts.takeOrdered(10, lambda s: -1 * s[1])

print 'Top Ten Endpoints: %s' % topEndpoints

assert topEndpoints == [(u'/images/NASA-logosmall.gif', 59737), (u'/images/KSC-logosmall.gif', 50452), (u'/images/MOSAIC-logosmall.gif', 43890), (u'/images/USA-logosmall.gif', 43664), (u'/images/WORLD-logosmall.gif', 43277), (u'/images/ksclogo-medium.gif', 41336), (u'/ksc.html', 28582), (u'/history/apollo/images/apollo-logo1.gif', 26778), (u'/images/launch-logo.gif', 24755), (u'/', 20292)], 'incorrect Top Ten Endpoints'

Part 3 Analyzing Web Server Log File

part2几乎没有自己写代码。这部分开始，主要是我们写代码实现要求。

Top Ten Error Endpoints

这个要求是返回排名前10的endpoints且response code不是200.我们需要考虑实现的步骤。

过滤掉response code为200的记录
统计每个endpoint的记录
选出排名前十的endpiont

# TODO: Replace <FILL IN> with appropriate code

# HINT: Each of these <FILL IN> below could be completed with a single transformation or action.

# You are welcome to structure your solution in a different way, so long as

# you ensure the variables used in the next Test section are defined (ie. endpointSum, topTenErrURLs).

not200 = access_logs.map(lambda log: (log.endpoint, log.response_code)).filter(lambda x:x[1] != 200)

endpointCountPairTuple = not200.map(lambda x: (x[0], 1))

endpointSum = endpointCountPairTuple.reduceByKey(lambda a, b : a + b)

topTenErrURLs = endpointSum.takeOrdered(10, lambda s: -1 * s[1])

print 'Top Ten failed URLs: %s' % topTenErrURLs

Number of Unique Hosts

统计有多少个不同的hosts。这个太简单了：提取host，distinct()后直接count()

# TODO: Replace <FILL IN> with appropriate code

# HINT: Do you recall the tips from (3a)? Each of these <FILL IN> could be an transformation or action.

hosts = access_logs.map(lambda log:log.host)

uniqueHosts = hosts.distinct()

uniqueHostCount = uniqueHosts.count()

print 'Unique hosts: %d' % uniqueHostCount

Number of Unique Daily Hosts

这里要统计每天出现的unique hosts数量，要求假定在了一个月内，所以我们只需要提取days的信息就行，最后别忘了把结果缓存起来。

# TODO: Replace <FILL IN> with appropriate code

dayToHostPairTuple = access_logs.map(lambda log:(log.date_time.day,log.host)).distinct()

dayGroupedHosts = dayToHostPairTuple.groupByKey()

dayHostCount = dayGroupedHosts.mapValues(len)

dailyHosts = (dayHostCount.sortByKey().cache())

dailyHostsList = dailyHosts.take(30)

print 'Unique hosts per day: %s' % dailyHostsList

Visualizing the Number of Unique Daily Hosts

把上面的结果画图，这里说一下的就是，因为我们用的是list来画图，稍微注意前面代码的人就应该知道collect()可以把RDD转为list。

# TODO: Replace <FILL IN> with appropriate code

daysWithHosts = dailyHosts.map(lambda x : x[0]).collect()

hosts = dailyHosts.map(lambda x : x[1]).collect()

fig = plt.figure(figsize=(8,4.5), facecolor='white', edgecolor='white')

plt.axis([min(daysWithHosts), max(daysWithHosts), 0, max(hosts)+500])

plt.grid(b=True, which='major', axis='y')

plt.xlabel('Day')

plt.ylabel('Hosts')

plt.plot(daysWithHosts, hosts)

pass

Average Number of Daily Requests per Hosts

这次我们要计算每天平均每个host有多少次请求。所以我们要用到join()了，把每天的host数目和每天request数据放到一个tuple里，然后相除得到均值。其中每天host数目我们算好了，即dailyHosts，这里需要算每天的请求数，其实就是host的总数。join后的结果大概是(key, (request, host))这样的结构。

# TODO: Replace <FILL IN> with appropriate code

dayAndHostTuple = access_logs.map(lambda log:(log.date_time.day,log.host))

groupedByDay = dayAndHostTuple.groupByKey()

sortedByDay = groupedByDay.mapValues(len).join(dailyHosts)

avgDailyReqPerHost = sortedByDay.map(lambda x :(x[0], x[1][0]/x[1][1])).sortByKey().cache() 

avgDailyReqPerHostList = avgDailyReqPerHost.take(30)

print 'Average number of daily requests per Hosts is %s' % avgDailyReqPerHostList

Visualizing the Average Daily Requests per Unique Host

又是可视化。。。

# TODO: Replace <FILL IN> with appropriate code

daysWithAvg = avgDailyReqPerHost.map(lambda x : x[0]).collect()

avgs = avgDailyReqPerHost.map(lambda x : x[1]).collect()

fig = plt.figure(figsize=(8,4.2), facecolor='white', edgecolor='white')

plt.axis([0, max(daysWithAvg), 0, max(avgs)+2])

plt.grid(b=True, which='major', axis='y')

plt.xlabel('Day')

plt.ylabel('Average')

plt.plot(daysWithAvg, avgs)

pass

Part 4 Exploring 404 Response Codes

这次主要研究response code为404的记录。

Counting 404 Response Codes

统计记录条数

# TODO: Replace <FILL IN> with appropriate code

badRecords = access_logs.filter(lambda log: log.response_code== 404).cache()

print 'Found %d 404 URLs' % badRecords.count()

Listing 404 Response Code Records

# TODO: Replace <FILL IN> with appropriate code

badEndpoints = badRecords.map(lambda log: log.endpoint)

badUniqueEndpoints = badEndpoints.distinct()

badUniqueEndpointsPick40 = badUniqueEndpoints.take(40)

print '404 URLS: %s' % badUniqueEndpointsPick40

这个也没什么意思，就瞅瞅返回404的url长啥样。

Listing the Top Twenty 404 Response Code Endpoints

# TODO: Replace <FILL IN> with appropriate code

badEndpointsCountPairTuple = badRecords.map(lambda log: (log.endpoint,1))

badEndpointsSum = badEndpointsCountPairTuple.reduceByKey(lambda a,b: a+b)

badEndpointsTop20 = badEndpointsSum.takeOrdered(20,key=lambda x: -x[1])

print 'Top Twenty 404 URLs: %s' % badEndpointsTop20

排序问题，碰到好多次了。

Listing the Top Twenty-five 404 Response Code Hosts

# TODO: Replace <FILL IN> with appropriate code

errHostsCountPairTuple = badRecords.map(lambda log: (log.host,1))

errHostsSum = errHostsCountPairTuple.reduceByKey(lambda a,b: a+b)

errHostsTop25 = errHostsSum.takeOrdered(25,key=lambda x: -x[1])

print 'Top 25 hosts that generated errors: %s' % errHostsTop25

同上，只是把endpoint改成了host（你们是不是发现pyspark真的很简单。。。）

Listing 404 Response Codes per Day

# TODO: Replace <FILL IN> with appropriate code

errDateCountPairTuple = badRecords.map(lambda log:(log.date_time.day,1))

errDateSum = errDateCountPairTuple.reduceByKey(lambda a,b : a+b)

errDateSorted = errDateSum.sortByKey().cache()

errByDate = errDateSorted.collect()

print '404 Errors by day: %s' % errByDate

统计每天的记录数目

Visualizing the 404 Response Codes by Day

# TODO: Replace <FILL IN> with appropriate code

daysWithErrors404 = errDateSorted.map(lambda x :x[0]).collect()

errors404ByDay = errDateSorted.map(lambda x :x[1]).collect()

fig = plt.figure(figsize=(8,4.2), facecolor='white', edgecolor='white')

plt.axis([0, max(daysWithErrors404), 0, max(errors404ByDay)])

plt.grid(b=True, which='major', axis='y')

plt.xlabel('Day')

plt.ylabel('404 Errors')

plt.plot(daysWithErrors404, errors404ByDay)

pass

Top Five Days for 404 Response Codes

# TODO: Replace <FILL IN> with appropriate code

topErrDate = errDateSorted.takeOrdered(5,key=lambda x: -x[1])

print 'Top Five dates for 404 requests: %s' % topErrDate

又是top n的问题。。

Hourly 404 Response Codes

# TODO: Replace <FILL IN> with appropriate code

hourCountPairTuple = badRecords.map(lambda log:(log.date_time.hour,log.response_code))

hourRecordsSum = hourCountPairTuple.groupByKey().mapValues(len)

hourRecordsSorted = hourRecordsSum.sortByKey().cache()

errHourList = hourRecordsSorted.collect()

print 'Top hours for 404 requests: %s' % errHourList

和前面几乎一样，只不过把天换成了小时。

Visualizing the 404 Response Codes by Hour

老朋友了，可视化。。

# TODO: Replace <FILL IN> with appropriate code

hoursWithErrors404 = hourRecordsSorted.map(lambda x :x[0]).collect()

errors404ByHours = hourRecordsSorted.map(lambda x :x[1]).collect()

fig = plt.figure(figsize=(8,4.2), facecolor='white', edgecolor='white')

plt.axis([0, max(hoursWithErrors404), 0, max(errors404ByHours)])

plt.grid(b=True, which='major', axis='y')

plt.xlabel('Hour')

plt.ylabel('404 Errors')

plt.plot(hoursWithErrors404, errors404ByHours)

pass

如果有任何问题，可以去我的github上下载源文件自己做一做，欢迎交流。