入门文章学习(一)-Beginner Tutorial
Abstract: 参照“背景知识查阅”一文的学习路径,对几篇文章的学习做了记录。这是"Beginner Tutorial"一文的学习笔记。
文章链接: https://www.datacamp.com/community/tutorials/apache-spark-python
1. 背景知识
1.1 Spark:
General engine for big data processing
Modules: Streaming, SQL, machine learning, graph processing
Pro: Speed, ease of use, generality, virtual running environment
1.2 Content
Programming language; Spark with python; RDD vs DataFrame API vs DataSet API; Spark Dataframes vs Pandas DataFrames; RDD actions and transformations; Cache/Persist RDD, Broadcast variables; Intro to Spark practice with DataFrame and Spark UI; TUrn off the logging for PySaprk.
1.3 Spark Performance: Scala or Python?
1.3.1 Scala faster than Python, recommended for streaming data. Though structured streaming in Spark seems to reduce the gap already.
1.3.2 For DataFrame API, the differences b2n Python and Scala are not obvious.
- Favor built-in expressions if working with Python: for the User Defined Functions (UDFs) are less efficient than the Scala equivalents.
- Not to pass the data b2n Dtaframe and RDD unnecessarily: the serialization (object -> bytes) and deserialization (bytes -> object) of the data transfer are expensive.
1.3.3 Scala
- Play framework -> clean + performant async code;
- Play is fully asynchronous -> have concurrent connections without dealing with threads -> Easier I/O calls in parallel to improve performance + the use of real-time, streaming, and server push technologies;
1.3.4 Type Safety
Python: Good for smaller ad hoc experiments - Dynamically typed language
Each variable name is bound only to an object unless it is null;
Type checking happens at run time;
No need to specify types every time;
e.g, Ruby, Python
Scale: Bigger projects - Statically typed language
Each variable name is bound both to a type and an object;
Type checking at compile time;
Easier and hassle-free when refactoring.
1.3.5 Advanced Features
Tools for machine learning and NLP - SparkMLib
2. Spark Installation
教程里提供了本地Installation以及结合使用Notebook和本地Spark的方法。
以及Notebook+Spark Kernel的方法。还有DockerHub的方法。都没太看懂。
个人的安装方法是:
1. 本地Spark-shell的安装。显示的是Scala。貌似用的是Scala语言。懵。
2. Anaconda上给相应环境配置Pyspark和Py4j的安装包,然后在Jupyter notebook里使用相应Kernel运行代码。配置了两个环境,有一个出错了,没找到出错原因。另外一个有condEnv prefix的可以使用。
3. Spark APIs: RDDs ,Dataset and DataFrame
3.1 RDD
- The building blocks of Spark
- A set of java or Scala objects representing data
- 3 main characteristics: compile-time type safe + lazy (只计算一次,然后缓存起来,之后都用缓存数据) + based on Scala collections API
- Cons: inefficient and un-readable transformation chains; slow with non-JVM languages such as Python and can not be opyimized by Spark
3.2 DataFrames
- Enable a higher level abstraction allowing the uers to use query language to manipulate the data.
- "Higher level abstraction": A logical plan that represents data and a schema. 建构了包装RDD的数据概念(Spark写了这种数据机制的代码,所以用户可以直接用了),基于此可视化对RDD的处理。
- Remeber! The DataFrames are still built on top of RDDs!
- DataFrames can be optimized with
* Custom memory management (project Tungsten) - Make sure the Spark jobs much faster given constraints.
* Optimized execution plans (Catalyst optimizer) - Logical plan of the DtaFrame is a part.
- For Python is dynamically typed, only the untyped DataFrame API and uncripted Dataset API are available.
3.3 Datasets
DataFrames lost the compile-time type safety, so the code was more prone to errors.
Datasets was raised for a combination of the type safety/lambda functions given by RDDs, and the optimalizations offered by the DataFrames.
- Dataset API
* A strongly-typed API
* An untyped API
* A DataFrame is a synonym for Dataset[Row] in Scala ; Row is a generic untyped JVM object.
The Dataset is a collection of strongly-typed JVM objects.
- DataSet API: static typing and the runtime type safert
3.4 Summary
The higher level abstraction over the data, the more performance and optimization. Help forces work with more strcutured data and easier use of APIs.
3.5 When to use?
- Advised to use DataFrames when working with PySpark, because they are close to the DtaFrame strcuture from the pandas library.
- To use DatasetAPI: want use high-level expressions/SQL queries/columnar access/ use of lambda functions...on semi-strcutured data. (untyped API)
- To use RDDs: low-level transformations and actions on the unstructured data. Don't care about imposing a schema when accessing the atrributes by name. Do not need optimization and performace benifits from DF and Datasets for (semi-)strcutured data. Wnt to functional programming constructs rather than domain speciic expressions.
4. Diffrence between Spark DataFrames and Pandas DataFrames
DataFrames & Relational database
Spark DataFrames carry the specific optimalization under the hood and can use distributed memory to handle big data.
Pandas DataFrames and R data frames can only run on one computer.
Spark DF and Pandas DF integrates quite well: df.toPandas(). Wide range of external libraries and APIs can be used.
5. RDD: Ations and Transformations
5.1 RDDs support two types of operations
- Transformations: create a new dataset from an existing one
e.g. map() - A transformation passing each dataset element through a function and returns a new RDD representing the results.
Lazy Transformation:They just remember the transformations applied to some base dataset. The transformations are only computed when an action requires a result to be returned to the driver program.
- Actions: return a value to the driver program after the computation on the dataset
e.g reduce() - An action aggregating all the elements of the RDD and returns the final result to the driver program.
5.2 Advantages
Spark can run more efficiently: a dataset created through map()
operation will be used in a consequent reduce()
operation and will return only the result of the the last reduce function to the driver. That way, the reduced data set rather than the larger mapped data set will be returned to the user. This is more efficient, without a doubt!
6. RDD: Cache/Persist & Variable:Persist/Broadcast
6.1 Cache: By default, each transformed RDD may be recomputed each time you run an action on it. But by perssting an RDD in memory/disk/multiple nodes, the Spark will keep the e;ements around on the cluster for much faster access the next time you query it.
A couple of use cases for caching or persisting RDDs are the use of iterative algorithms and fast interactive RDD use.
6.2 Persist or Broadcast Variable:
Entire RDD -> Partitions
When executing the Spark program, each partition gets sent to a worker. Each worker can cache the data if the RDD needs to be re-iterated: stroe the patrtition in memory and be reused in other actions.
Variable: when pass a function to a Spark operation, the variable inside the function will be sent to each cluster node.
Broadcast variables: When redistribuing intermediate results of operations such as the trained models or a composed static lookup table. Broadcasting variables to send immutable state once to each worker, can avoid vreating a copy of the variable for each machine. A cached read-only variable can be kept in every machine, and these variables can be used when needing a local copy of a variable.
You can create a broadcast variable with SparkContext.broadcast(variable)
. This will return the reference of the broadcast variable.
7. Best Practices in Spark
- Spark DataFrames are optimized and faster than RDDs. esp. for strcutured data.
- Better not call collect() on large RDDs, for it drags data back to the appilication from the nodes. The RDD element will be copied onto the single driver program, which will result in running out of memory and crash.
- Build efficient transformation chain:filter and reduce data before joining them rather than after them.
- Avoid groupByKey()
on large RDDs: A lot of unnecessary data is being transferred over the network. Additionally, this also means that if more data is shuffled onto a single machine than can fit in memory, the data will be spilled to disk. This heavily impacts the performance of your Spark job.
没消化完。。时间不够了,先看代码把作业写了。
Spark Cheat Sheet:
https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_Cheat_Sheet_Python.pdf
走神感想:在咖啡馆学习,身边的人们好像是做媒体的。dbq,但是我只是觉得这样的行业可有可无,所以当他们谈论公事的时候我总觉得这种小孩子过家家式的一本正经有一点让人无语。当然,这是我浅薄的体现。还有一些大叔在外放抖音(捂面),总之Keep Learning确实是人保持弹性和活力的关键所在,再提醒自己要和老于(还有老师!)一样,做这样的人。
入门文章学习(一)-Beginner Tutorial的更多相关文章
- 【转载】Ogre:Beginner Tutorial 1: SceneNode, Entity,和SceneManager 结构
原文:Beginner Tutorial 1: SceneNode, Entity,和SceneManager 结构 先决条件 这个教程假设你有C++编程的基础并且可以配置并编译OGRE应用程序 ...
- Python 初学者 入门 应该学习 python 2 还是 python 3?
许多刚入门 Python 的朋友都在纠结的的问题是:我应该选择学习 python2 还是 python3? 对此,咪博士的回答是:果断 Python3 ! 可是,还有许多小白朋友仍然犹豫:那为什么还是 ...
- 【Zigbee技术入门教程-01】Zigbee无线组网技术入门的学习路线
[Zigbee技术入门教程-01]Zigbee无线组网技术入门的学习路线 广东职业技术学院 欧浩源 一.引言 在物联网技术应用的知识体系中,Zigbee无线组网技术是非常重要的一环,也是大家感 ...
- Ansible 入门指南 - 学习总结
概述 这周在工作中需要去修改 nginx 的配置,发现了同事在使用 ansible 管理者系统几乎所有的配置,从数据库的安装.nginx 的安装及配置.于是这周研究起了 ansible 的基础用法.回 ...
- 正则表达式入门之学习路线&七个问题
由于工作需求,需要使用正则表达式查找满足某种模式的字符串,但因为之前都没有接触过相关内容,最开始的时候看了一些已经被别人写好了的正则表达式,本来打算可能可以直接使用: 最全的常用正则表达式大全——包括 ...
- Zipline Beginner Tutorial
Zipline Beginner Tutorial Basics Zipline is an open-source algorithmic trading simulator written in ...
- (转)零基础入门深度学习(6) - 长短时记忆网络(LSTM)
无论即将到来的是大数据时代还是人工智能时代,亦或是传统行业使用人工智能在云上处理大数据的时代,作为一个有理想有追求的程序员,不懂深度学习(Deep Learning)这个超热的技术,会不会感觉马上就o ...
- .NetCore微服务Surging新手傻瓜式 入门教程 学习日志---先让程序跑起来(一)
原文:.NetCore微服务Surging新手傻瓜式 入门教程 学习日志---先让程序跑起来(一) 写下此文章只为了记录Surging微服务学习过程,并且分享给广大想学习surging的基友,方便广大 ...
- Webpack新手入门教程(学习笔记)
p.p1 { margin: 0.0px 0.0px 0.0px 0.0px; text-align: center; font: 30.0px Helvetica; color: #000000 } ...
- 初学者福音——10个最佳APP开发入门在线学习网站
根据Payscale的调查显示,现在的APP开发人员的年薪达到:$66,851.这也是为什么那么多初学的开发都想跻身到APP开发这行业的主要原因之一.每当你打开App Store时候,看着琳琅满目的A ...
随机推荐
- 超2T硬盘使用gpt分区及做成lvm
1.超过2T分区不能用fdisk了,用parted 分区格式化后对新的分区做lvm
- 三步建立自己域名的主页,Github Pages功能简明手册
[task]把自己的页面上传到git上,用github pages功能托管网页. 准备工作: 1.自己的网页文件 2.有个自己的git账号 3.下载安装git.下载地址https://git-scm. ...
- No.1.1
认识网页 问题1:网页由哪些部分构成? 文字.图片.音频.视频.超链接 问题2:我们看到的网页背后本质是什么? 前端程序员写的代码 问题3:前端的代码是通过什么软件转换成用户眼中的页面? 通过浏览器转 ...
- redis+token实现一个账号只能一个人登录
自己在闲着没事的时候,突然想到了这么一个小功能,于是决定练习一下,首先想到的是如果一个账号只能一个人登录,可能会出现两个情况,一种是后登录者把前者的账号顶替掉,还有一种就是后者登录的时候会有提示当前账 ...
- TinyRadius客户端java登录认证
jar包:TinyRadius-1.0.jar 依赖:commons-logging.jar radius配置文件: <?xml version="1.0" encoding ...
- requests断点续传功能
requests取消ssl验证会出现告警InsecureRequestWarning,取消告警如下: import urllib3urllib3.disable_warnings(urllib3.ex ...
- allure安装成功后,执行未生成报告解决
在搜索了很多方法尝试后,执行依然没有生成测试报告,在尝试在pycharm里面修改配置解决了: file>setting>tools>Python integrated tools&g ...
- pycharm 默认添加# -*-coding: utf-8 -*-
备忘 pycharm创建py文件时,默认输入表头信息 1.点击[File]-[Settins] 2.点击[Editor]-[File and Code Templates] 3.点击[Python S ...
- M1 安装homebrew详解
1.打开终端创建安装目录 sudo mkdir -p /opt/homebrew 2.修改目录权限 sudo chown -R $(whoami) /opt/homebrew 3.进入opt文件夹 c ...
- UnityAndroid 获取根目录文件
1. 在Unity打包时获取SD权限 2. Android根目录为 "/storage/emulated/0"; 代码: if (Directory.Exists("/s ...