Abstract: 参照“背景知识查阅”一文的学习路径,对几篇文章的学习做了记录。这是"Beginner Tutorial"一文的学习笔记。

文章链接: https://www.datacamp.com/community/tutorials/apache-spark-python

1. 背景知识

1.1 Spark:

General engine for big data processing

Modules: Streaming, SQL, machine learning, graph processing

Pro: Speed, ease of use, generality, virtual running environment

1.2 Content

Programming language; Spark with python; RDD vs DataFrame API vs DataSet API; Spark Dataframes vs Pandas DataFrames; RDD actions and transformations; Cache/Persist RDD, Broadcast variables; Intro to Spark practice with DataFrame and Spark UI; TUrn off the logging for PySaprk.

1.3 Spark Performance: Scala or Python?

1.3.1 Scala faster than Python, recommended for streaming data. Though structured streaming in Spark seems to reduce the gap already.

1.3.2 For DataFrame API, the differences b2n Python and Scala are not obvious.

- Favor built-in expressions if working with Python:  for the User Defined Functions (UDFs) are less efficient than the Scala equivalents.

- Not to pass the data b2n Dtaframe and RDD unnecessarily: the serialization (object -> bytes) and deserialization (bytes -> object) of the data transfer are expensive.

1.3.3 Scala

- Play framework -> clean + performant async code;

- Play is fully asynchronous -> have concurrent connections without dealing with threads -> Easier I/O calls in parallel to improve performance + the use of real-time, streaming, and server push technologies;

1.3.4 Type Safety

Python: Good for smaller ad hoc experiments - Dynamically typed language

    Each variable name is bound only to an object unless it is null;

    Type checking happens at run time;

    No need to specify types every time;

    e.g, Ruby, Python

Scale: Bigger projects - Statically typed language

   Each variable name is bound both to a type and an object;

   Type checking at compile time;

   Easier and hassle-free when refactoring.

1.3.5 Advanced Features

Tools for machine learning and NLP - SparkMLib

2. Spark Installation

教程里提供了本地Installation以及结合使用Notebook和本地Spark的方法。

以及Notebook+Spark Kernel的方法。还有DockerHub的方法。都没太看懂。

个人的安装方法是:

1. 本地Spark-shell的安装。显示的是Scala。貌似用的是Scala语言。懵。

2. Anaconda上给相应环境配置Pyspark和Py4j的安装包,然后在Jupyter notebook里使用相应Kernel运行代码。配置了两个环境,有一个出错了,没找到出错原因。另外一个有condEnv prefix的可以使用。

3. Spark APIs: RDDs ,Dataset and DataFrame

3.1 RDD

- The building blocks of Spark

- A set of java or Scala objects representing data

- 3 main characteristics: compile-time type safe + lazy (只计算一次,然后缓存起来,之后都用缓存数据) + based on Scala collections API

- Cons: inefficient and un-readable transformation chains; slow with non-JVM languages such as Python and can not be opyimized by Spark

3.2 DataFrames

- Enable a higher level abstraction allowing the uers to use query language to manipulate the data.

- "Higher level abstraction": A logical plan that represents data and a schema. 建构了包装RDD的数据概念(Spark写了这种数据机制的代码,所以用户可以直接用了),基于此可视化对RDD的处理。

- Remeber! The DataFrames are still built on top of RDDs!

- DataFrames can be optimized with

  * Custom memory management (project Tungsten) - Make sure the Spark jobs much faster given constraints.

* Optimized execution plans (Catalyst optimizer) - Logical plan of the DtaFrame is a part.

- For Python is dynamically typed, only the untyped DataFrame API and uncripted Dataset API are available.

3.3 Datasets

DataFrames lost the compile-time type safety, so the code was more prone to errors.

Datasets was raised for a combination of the type safety/lambda functions given by RDDs, and the optimalizations offered by the DataFrames.

- Dataset API

  * A strongly-typed API

  * An untyped API

  * A DataFrame is a synonym for Dataset[Row] in Scala ; Row is a generic untyped JVM object.

    The Dataset is a collection of strongly-typed JVM objects.

- DataSet API: static typing and the runtime type safert

3.4 Summary

The higher level abstraction over the data, the more performance and optimization. Help forces work with more strcutured data and easier use of APIs.

3.5 When to use?

- Advised to use DataFrames when working with PySpark, because they are close to the DtaFrame strcuture from the pandas library.

- To use DatasetAPI: want use high-level expressions/SQL queries/columnar access/ use of lambda functions...on semi-strcutured data. (untyped API)

- To use RDDs: low-level transformations and actions on the unstructured data. Don't care about imposing a schema when accessing the atrributes by name. Do not need optimization and performace benifits from DF and Datasets for (semi-)strcutured data. Wnt to functional programming constructs rather than domain speciic expressions.

4. Diffrence between Spark DataFrames and Pandas DataFrames

DataFrames & Relational database

Spark DataFrames carry the specific optimalization under the hood and can use distributed memory to handle big data.

Pandas DataFrames and R data frames can only run on one computer.

Spark DF and Pandas DF integrates quite well: df.toPandas(). Wide range of external libraries and APIs can be used.

5. RDD: Ations and Transformations

5.1 RDDs support two types of operations

- Transformations: create a new dataset from an existing one

e.g. map() - A transformation passing each dataset element through a function and returns a new RDD representing the results.

Lazy Transformation:They just remember the transformations applied to some base dataset. The transformations are only computed when an action requires a result to be returned to the driver program.

- Actions: return a value to the driver program after the computation on the dataset

e.g reduce() - An action aggregating all the elements of the RDD and returns the final result to the driver program.

5.2 Advantages

Spark can run more efficiently: a dataset created through map() operation will be used in a consequent reduce() operation and will return only the result of the the last reduce function to the driver. That way, the reduced data set rather than the larger mapped data set will be returned to the user. This is more efficient, without a doubt!

6. RDD: Cache/Persist & Variable:Persist/Broadcast

6.1 Cache: By default, each transformed RDD may be recomputed each time you run an action on it. But by perssting an RDD in memory/disk/multiple nodes, the Spark will keep the e;ements around on the cluster for much faster access the next time you query it.

A couple of use cases for caching or persisting RDDs are the use of iterative algorithms and fast interactive RDD use.

6.2 Persist or Broadcast Variable:

Entire RDD -> Partitions

When executing the Spark program, each partition gets sent to a worker. Each worker can cache the data if the RDD needs to be re-iterated: stroe the patrtition in memory and be reused in other actions.

Variable: when pass a function to a Spark operation, the variable inside the function will be sent to each cluster node.

Broadcast variables: When redistribuing intermediate results of operations such as the trained models or a composed static lookup table. Broadcasting variables to send immutable state once to each worker, can avoid vreating a copy of the variable for each machine. A cached read-only variable can be kept in every machine, and these variables can be used when needing a local copy of a variable.

You can create a broadcast variable with SparkContext.broadcast(variable). This will return the reference of the broadcast variable.

7. Best Practices in Spark

- Spark DataFrames are optimized and faster than RDDs. esp. for strcutured data.

- Better not call collect() on large RDDs, for it drags data back to the appilication from the nodes. The RDD element will be copied onto the single driver program, which will result in running out of memory and crash.

- Build efficient transformation chain:filter and reduce data before joining them rather than after them.

- Avoid groupByKey() on large RDDs: A lot of unnecessary data is being transferred over the network. Additionally, this also means that if more data is shuffled onto a single machine than can fit in memory, the data will be spilled to disk. This heavily impacts the performance of your Spark job.

没消化完。。时间不够了,先看代码把作业写了。

Spark Cheat Sheet:

https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_Cheat_Sheet_Python.pdf

走神感想:在咖啡馆学习,身边的人们好像是做媒体的。dbq,但是我只是觉得这样的行业可有可无,所以当他们谈论公事的时候我总觉得这种小孩子过家家式的一本正经有一点让人无语。当然,这是我浅薄的体现。还有一些大叔在外放抖音(捂面),总之Keep Learning确实是人保持弹性和活力的关键所在,再提醒自己要和老于(还有老师!)一样,做这样的人。

入门文章学习(一)-Beginner Tutorial的更多相关文章

  1. 【转载】Ogre:Beginner Tutorial 1: SceneNode, Entity,和SceneManager 结构

    原文:Beginner Tutorial 1: SceneNode, Entity,和SceneManager 结构   先决条件 这个教程假设你有C++编程的基础并且可以配置并编译OGRE应用程序 ...

  2. Python 初学者 入门 应该学习 python 2 还是 python 3?

    许多刚入门 Python 的朋友都在纠结的的问题是:我应该选择学习 python2 还是 python3? 对此,咪博士的回答是:果断 Python3 ! 可是,还有许多小白朋友仍然犹豫:那为什么还是 ...

  3. 【Zigbee技术入门教程-01】Zigbee无线组网技术入门的学习路线

    [Zigbee技术入门教程-01]Zigbee无线组网技术入门的学习路线 广东职业技术学院  欧浩源 一.引言    在物联网技术应用的知识体系中,Zigbee无线组网技术是非常重要的一环,也是大家感 ...

  4. Ansible 入门指南 - 学习总结

    概述 这周在工作中需要去修改 nginx 的配置,发现了同事在使用 ansible 管理者系统几乎所有的配置,从数据库的安装.nginx 的安装及配置.于是这周研究起了 ansible 的基础用法.回 ...

  5. 正则表达式入门之学习路线&七个问题

    由于工作需求,需要使用正则表达式查找满足某种模式的字符串,但因为之前都没有接触过相关内容,最开始的时候看了一些已经被别人写好了的正则表达式,本来打算可能可以直接使用: 最全的常用正则表达式大全——包括 ...

  6. Zipline Beginner Tutorial

    Zipline Beginner Tutorial Basics Zipline is an open-source algorithmic trading simulator written in ...

  7. (转)零基础入门深度学习(6) - 长短时记忆网络(LSTM)

    无论即将到来的是大数据时代还是人工智能时代,亦或是传统行业使用人工智能在云上处理大数据的时代,作为一个有理想有追求的程序员,不懂深度学习(Deep Learning)这个超热的技术,会不会感觉马上就o ...

  8. .NetCore微服务Surging新手傻瓜式 入门教程 学习日志---先让程序跑起来(一)

    原文:.NetCore微服务Surging新手傻瓜式 入门教程 学习日志---先让程序跑起来(一) 写下此文章只为了记录Surging微服务学习过程,并且分享给广大想学习surging的基友,方便广大 ...

  9. Webpack新手入门教程(学习笔记)

    p.p1 { margin: 0.0px 0.0px 0.0px 0.0px; text-align: center; font: 30.0px Helvetica; color: #000000 } ...

  10. 初学者福音——10个最佳APP开发入门在线学习网站

    根据Payscale的调查显示,现在的APP开发人员的年薪达到:$66,851.这也是为什么那么多初学的开发都想跻身到APP开发这行业的主要原因之一.每当你打开App Store时候,看着琳琅满目的A ...

随机推荐

  1. Java 18-方法 认识方法与方法定义

    1.认识方法 1)什么是方法 Java方法是语句的集合,它们在一起执行一个功能; 方法是解决一类问题的步骤的有序组合; 方法一般包含于类中; 方法在程序中被创建,在其他地方被引用2 2)方法的有点 使 ...

  2. SAP 登入增强EXIT_SAPLSUSF_001

    启用方式 SUSR0001->执行->激活

  3. 解决uniapp 使用自带 switch 双向绑定视图不更新的问题

    使用  this.$set( a, b, c) a:需要更新视图属性对象 b:具体的属性值(就是你要更新视图的属性值) c:传递的参数 this.$set(this.gwjSelet,this.gwj ...

  4. 数据库管理工具naicat+DG

    DG 参考链接:https://www.cnblogs.com/zuge/p/7397255.html 自我感觉: 亲切,万能,idea用多了... 石皮 解 用学生账号登陆就可以(我用的这一种) 工 ...

  5. JS学习-Canvas

    Canvas Canvas API 提供了一个通过JavaScript 和 HTML的<canvas>元素来绘制图形的方式.它可以用于动画.游戏画面.数据可视化.图片编辑以及实时视频处理等 ...

  6. 2022-4-8内部群每日三题-清辉PMP

    1.在创建最小可行产品(MVP)时,哪种方法至关重要? A.冒烟测试. B.演示. C.按版本发布. D.客户访谈. 2.敏捷项目团队决定修改使用中的测试过程,这一决定在哪一次会议上产生的? A.sp ...

  7. Qt6以上安装速度慢解决-国内镜像加速

    1.安装抓包软件Fiddler最新版本 Fiddler30天试用 2.在软件下方输入栏内粘贴以下命令 urlreplace download.qt.io mirrors.tuna.tsinghua.e ...

  8. unity 发布WebGL版本找不到unity自带的类

    加载asset bundle的时候出现Could not produce class with ID XXX的错误 在asset 文件夹下建一个Link的XML,内容如下: <?xml vers ...

  9. 杭电 oj 第几天?

    Problem Description 给定一个日期,输出这个日期是该年的第几天.   Input 输入数据有多组,每组占一行,数据格式为YYYY/MM/DD组成,具体参见sample input , ...

  10. Linux基础第六章:逻辑卷的使用、扩容和磁盘配额

    一.逻辑卷的使用及扩容 1.概念优点及注意事项 2.使用命令及基本格式 3.创建逻辑卷 ①创建物理卷 ②创建卷组 ③创建逻辑卷 ④格式化.挂载yk26逻辑卷在/mnt下并在逻辑卷yk26下创建文件a. ...