CS100.1x-lab0

这是CS100.1x第一个提交的作业，是给我们测试用的。相关ipynb文件见我github。本来没什么好说的。我在这里简单讲一下，后面会更详细的讲解。主要分成5个部分。

Part 1: Test Spark functionality

Parallelize, filter, and reduce

# Check that Spark is working

largeRange = sc.parallelize(xrange(100000))

reduceTest = largeRange.reduce(lambda a, b: a + b)

filterReduceTest = largeRange.filter(lambda x: x % 7 == 0).sum()

print reduceTest

print filterReduceTest

# If the Spark jobs don't work properly these will raise an AssertionError

assert reduceTest == 4999950000

assert filterReduceTest == 714264285

前三行代码的作用分别是，把一个python的集合转化为RDD，把列表里的值相加，把列表里对7整除的数相加

Loading a text file

# Check loading data with sc.textFile

import os.path

baseDir = os.path.join('data')

inputPath = os.path.join('cs100', 'lab1', 'shakespeare.txt')

fileName = os.path.join(baseDir, inputPath)

rawData = sc.textFile(fileName)

shakespeareCount = rawData.count()

print shakespeareCount

# If the text file didn't load properly an AssertionError will be raised

assert shakespeareCount == 122395

这段代码第一段是构造文件路径，第二段是读取文本文件，然后统计行数。

Part 2: Check class testing library

Compare with hash

# TEST Compare with hash (2a)

# Check our testing library/package

# This should print '1 test passed.' on two lines

from test_helper import Test

twelve = 12

Test.assertEquals(twelve, 12, 'twelve should equal 12')

Test.assertEqualsHashed(twelve, '7b52009b64fd0a2a49e6d8a939753077792b0554',

                        'twelve, once hashed, should equal the hashed value of 12')

测试哈希比较，没什么好说的

Compare lists

# TEST Compare lists (2b)

# This should print '1 test passed.'

unsortedList = [(5, 'b'), (5, 'a'), (4, 'c'), (3, 'a')]

Test.assertEquals(sorted(unsortedList), [(3, 'a'), (4, 'c'), (5, 'a'), (5, 'b')],

                  'unsortedList does not sort properly')

排序的操作

Part 3: Check plotting

Our first plot

# Check matplotlib plotting

import matplotlib.pyplot as plt

import matplotlib.cm as cm

from math import log

# function for generating plot layout

def preparePlot(xticks, yticks, figsize=(10.5, 6), hideLabels=False, gridColor='#999999', gridWidth=1.0):

    plt.close()

    fig, ax = plt.subplots(figsize=figsize, facecolor='white', edgecolor='white')

    ax.axes.tick_params(labelcolor='#999999', labelsize='10')

    for axis, ticks in [(ax.get_xaxis(), xticks), (ax.get_yaxis(), yticks)]:

        axis.set_ticks_position('none')

        axis.set_ticks(ticks)

        axis.label.set_color('#999999')

        if hideLabels: axis.set_ticklabels([])

    plt.grid(color=gridColor, linewidth=gridWidth, linestyle='-')

    map(lambda position: ax.spines[position].set_visible(False), ['bottom', 'top', 'left', 'right'])

    return fig, ax

# generate layout and plot data

x = range(1, 50)

y = [log(x1 ** 2) for x1 in x]

fig, ax = preparePlot(range(5, 60, 10), range(0, 12, 1))

plt.scatter(x, y, s=14**2, c='#d6ebf2', edgecolors='#8cbfd0', alpha=0.75)

ax.set_xlabel(r'$range(1, 50)$'), ax.set_ylabel(r'$\log_e(x^2)$')

pass

熟悉matplotlib的人应该知道，这个就是自己生成数据，然后画出来。

运行完代码后，得到如下图片。

CS100.1x-lab0_student的更多相关文章

CS100.1x Introduction to Big Data with Apache Spark
CS100.1x简介这门课主要讲数据科学,也就是data science以及怎么用Apache Spark去分析大数据. Course Software Setup 这门课主要介绍如何编写和调试Py ...
CS190.1x Scalable Machine Learning
这门课是CS100.1x的后续课,看课程名字就知道这门课主要讲机器学习.难度也会比上一门课大一点.如果你对这门课感兴趣,可以看看我这篇博客,如果对PySpark感兴趣,可以看我分析作业的博客. Cou ...
CS100.1x-lab1_word_count_student
这是CS100.1x第一个提交的有意义的作业,自己一遍做下来对PySpark的基本应用应该是可以掌握的.相关ipynb文件见我github. 这次作业的目的如题目一样--word count,作业分成 ...
Introduction to Big Data with PySpark
起因大数据时代大数据最近太热了,其主要有数据量大(Volume),数据类别复杂(Variety),数据处理速度快(Velocity)和数据真实性高(Veracity)4个特点,合起来被称为4V. ...
Ubuntu16.04 802.1x 有线连接输入账号密码，为什么连接不上？
ubuntu16.04,在网络配置下找到802.1x安全性,输入账号密码,为什么连接不上? 这是系统的一个bug解决办法:假设你有一定的ubuntu基础,首先你先建立好一个不能用的协议,就是按照之 ...
解压版MySQL5.7.1x的安装与配置
解压版MySQL5.7.1x的安装与配置 MySQL安装文件分为两种,一种是msi格式的,一种是zip格式的.如果是msi格式的可以直接点击安装,按照它给出的安装提示进行安装(相信大家的英文可以看懂英 ...
RTImageAssets 自动生成 AppIcon 和 @2x @1x 比例图片
下载地址:https://github.com/rickytan/RTImageAssets 此插件用来生成 @3x 的图片资源对应的 @2x 和 @1x 版本,只要拖拽高清图到 @3x 的位置上,然 ...
802.1x协议&eap类型
EAP: 0,扩展认证协议 1,一个灵活的传输协议,用来承载任意的认证信息(不包括认证方式) 2,直接运行在数据链路层,如ppp或以太网 3,支持多种类型认证注:EAP 客户端---服务器之间一个协 ...
脱壳脚本_手脱壳ASProtect 2.1x SKE -> Alexey Solodovnikov
脱壳ASProtect 2.1x SKE -> Alexey Solodovnikov 用脚本.截图 1:查壳 2:od载入 3:用脚本然后打开脚本文件Aspr2.XX_unpacker_v1. ...

随机推荐

Error:Could not find com.android.tools.build:gradle:3.0.0
Error:Could not find com.android.tools.build:gradle:3.0.Searched in the following locations: file ...
铁乐学python_Day42_锁和队列
铁乐学python_Day42_锁和队列例:多个线程抢占资源的情况 from threading import Thread import time def work(): global n tem ...
数据库启动丢失MSVCP120.dll
在自己第一次安装数据库的时候发生了很多问题,,首当其冲的就是数据库启动时丢失MSVCP120.dll,这里就不配图了(安装好了才想起来写一篇博客). 为什么安装不了? 这是因为系统缺失必要的运行库导致 ...
November 21st 2016 Week 48th Monday
A bird is known by its note, and a man by his talk. 闻其声而知鸟,听其言而知人. Listen to what a man talks, watch ...
Spring Boot Mybatis-Plus
Mybatis-Plus 是对 Mybatis-Plus 的一些扩充. 在 Spring Boot 中进行集成的时候其实基本上和 mybatis 是一致的. 在你的配置文件中.配置你的 entity ...
前段js初学总结
常用的js整理 confirm("此次修改操作会清空所有基础数据!!!您确定要修改吗?") <a onclick="delBasisData('${data['_i ...
Android混合式开发（Hybrid）
安卓混合式开发(Hybrid) 1 环境搭建 1.1 首先,下载 Android Studio (Intellij Idea) 下载地址:http://www.android-studio.org/ ...
python队列
先入先出队列: import queue q = queue.Queue(10) # 10为队列长度 for i in range(5): q.put(i, block=False) # block= ...
解决hibernate双向关系造成的一方重复执行SQl，或者死循环的问题
系统采用struts-json hibernate3. 在对关联表配置manytoone onetomany双向关联的时候,在执行一方的时候,会发现打印出来的SQL语句多执行了一次或者多次.经过调试, ...
VM下，装centos7系统，配置nginx的问题
一.流程 1.先安装nginx依赖的包 (1)yum install gcc-c++ (2)yum install -y pcre pcre-devel (3)yum install -y zlib ...

CS100.1x-lab0_student