Apache Spark 2.0: Faster, Easier, and Smarter

http://blog.madhukaraphatak.com/categories/spark-two/

https://amplab.cs.berkeley.edu/technical-preview-of-apache-spark-2-0-easier-faster-and-smarter/

 

 

Dataset - New Abstraction of Spark

For long, RDD was the standard abstraction of Spark.
But from Spark 2.0, Dataset will become the new abstraction layer for spark. Though RDD API will be available, it will become low level API, used mostly for runtime and library development. All user land code will be written against the Dataset abstraction and it’s subset Dataframe API.

 

2.0中,最关键的是在RDD这个low level抽象层上,又加了一组DataSet的high level的抽象层,让用户可以跟方便的开发

 

From Definition, ” A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each dataset also has an untyped view called a DataFrame, which is a Dataset of Row. “

which sounds similar to RDD definition

” RDD represents an immutable,partitioned collection of elements that can be operated on in parallel “

 

Dataset is a superset of Dataframe API which is released in Spark 1.3.

Dataframe是一种特殊化的Dataset,Dataframe = Dataset[row]

 

SparkSession - New entry point of Spark

In earlier versions of spark, spark context was entry point for Spark. As RDD was main API, it was created and manipulated using context API’s. For every other API,we needed to use different contexts.For streaming, we needed StreamingContext, for SQL sqlContext and for hive HiveContext. But as DataSet and Dataframe API’s are becoming new standard API’s we need an entry point build for them. So in Spark 2.0, we have a new entry point for DataSet and Dataframe API’s called as Spark Session.

因为有了新的抽象层,所以需要加新的入口,SparkSession

就像RDD对应于SparkContext

 

更强大的SQL支持

On the SQL side, we have significantly expanded the SQL capabilities of Spark, with the introduction of a new ANSI SQL parser and support for subqueries.

Spark 2.0 can run all the 99 TPC-DS queries, which require many of the SQL:2003 features.

 

Tungsten 2.0

Spark 2.0 ships with the second generation Tungsten engine.

This engine builds upon ideas from modern compilers and MPP databases and applies them to data processing.

The main idea is to emit optimized bytecode at runtime that collapses the entire query into a single function, eliminating virtual function calls and leveraging CPU registers for intermediate data. We call this technique “whole-stage code generation.”

在内存管理和基于CPU的性能优化上,faster

 

Structured Streaming

Spark 2.0’s Structured Streaming APIs is a novel way to approach streaming.
It stems from the realization that the simplest way to compute answers on streams of data is to not having to reason about the fact that it is a stream.

 

意思就是说,不要意识到流,流就是个无限的DataFrames

这个是典型的spark的思路,batch是一切的根本,和Flink截然相反

Spark 2.0的更多相关文章

  1. Spark 1.0.0 横空出世 Spark on Yarn 部署(Hadoop 2.4)

    就在昨天,北京时间5月30日20点多.Spark 1.0.0最终公布了:Spark 1.0.0 released 依据官网描写叙述,Spark 1.0.0支持SQL编写:Spark SQL Progr ...

  2. APACHE SPARK 2.0 API IMPROVEMENTS: RDD, DATAFRAME, DATASET AND SQL

    What’s New, What’s Changed and How to get Started. Are you ready for Apache Spark 2.0? If you are ju ...

  3. Apache Spark 3.0 将内置支持 GPU 调度

    如今大数据和机器学习已经有了很大的结合,在机器学习里面,因为计算迭代的时间可能会很长,开发人员一般会选择使用 GPU.FPGA 或 TPU 来加速计算.在 Apache Hadoop 3.1 版本里面 ...

  4. spark 2.0.0集群安装与hive on spark配置

    1. 环境准备: JDK1.8 hive 2.3.4 hadoop 2.7.3 hbase 1.3.3 scala 2.11.12 mysql5.7 2. 下载spark2.0.0 cd /home/ ...

  5. Spark 2.0 PCA主成份分析

    PCA在Spark2.0中用法比较简单,只需要设置: .setInputCol(“features”)//保证输入是特征值向量 .setOutputCol(“pcaFeatures”)//输出 .se ...

  6. Apache Spark 2.0三种API的传说:RDD、DataFrame和Dataset

    Apache Spark吸引广大社区开发者的一个重要原因是:Apache Spark提供极其简单.易用的APIs,支持跨多种语言(比如:Scala.Java.Python和R)来操作大数据. 本文主要 ...

  7. Spark 2.0 DataFrame map操作中Unable to find encoder for type stored in a Dataset.问题的分析与解决

    转载:http://blog.csdn.net/sparkexpert/article/details/52871000 随着新版本的spark已经逐渐稳定,最近拟将原有框架升级到spark 2.0. ...

  8. Spark 2.0.0 SPARK-SQL returns NPE Error

    com.esotericsoftware.kryo.KryoException: java.lang.NullPointerExceptionSerialization trace:underlyin ...

  9. Apache Spark 3.0 预览版正式发布,多项重大功能发布

    2019年11月08日 数砖的 Xingbo Jiang 大佬给社区发了一封邮件,宣布 Apache Spark 3.0 预览版正式发布,这个版本主要是为了对即将发布的 Apache Spark 3. ...

随机推荐

  1. Sql server之路 (六)上传服务器图片

    原理: 上传图片的名字 插入到数据库里 上传图片的内容(二进制数据) 写到服务器指定的目录下 下次读取图片的时候 从数据库里的指定字段里读取图片文件名 从数据库的指定路径下 拼串成完成的路径 就可以下 ...

  2. 让/etc/profile文件修改后立即生效(转)

    方法1:让/etc/profile文件修改后立即生效 ,可以使用如下命令:# .  /etc/profile注意: . 和 /etc/profile 有空格方法2:让/etc/profile文件修改后 ...

  3. loj 1034(最小点基)

    题目链接:http://acm.hust.edu.cn/vjudge/problem/viewProblem.action?id=25911 思路:强连通缩点,在新图中找入度为0的点的个数即可. #i ...

  4. SQL Server:查询当前服务器有多少连接请求

    有时DBA需要检查当前服务器有多少连接请求,以及连接请求的登录名,客户端版本,主机名,程序名等等之类的信息,我们就可以对服务器的连接状况有所了解,防止不明用户使用. SQL Server本身提供了这么 ...

  5. CodeForces 222B Cosmic Tables

    Cosmic Tables Time Limit:3000MS     Memory Limit:262144KB     64bit IO Format:%I64d & %I64u Subm ...

  6. Ajax跨域通信方法

    1.Jsonp.document.domain+iframe.window.name.window.postMessage.服务器上设置代理页面. 2.Jsonp(json with padding) ...

  7. 排序+逆向思维 ACdream 1205 Disappeared Block

    题目传送门 /* 从大到小排序,逆向思维,从最后开始考虑,无后向性 每找到一个没被淹没的,对它左右的楼层查询是否它是孤立的,若是++,若不是-- 复杂度 O(n + m),还以为 O(n^2)吓得写了 ...

  8. win7家庭版更改桌面图标

    电脑 Win7家庭普通版 方法/步骤   1 选择左下角开始. 2 在搜索栏中,输入“ico”,选择“显示或隐藏桌面上的通用图标”. 3 出现桌面图标设置,选择“更改图标”.

  9. BZOJ4013 : [HNOI2015]实验比较

    首先用并查集将等号缩点,然后拓扑排序判断有没有环,有环则无解,否则通过增加超级源点$0$,可以得到一棵树. 设$f[x][y]$表示$x$子树里有$y$种不同的数字的方案数,由底向上DP. 对于当前点 ...

  10. BZOJ2757 : [SCOI2012]Blinker的仰慕者

    BZOJ AC900题纪念~~ 若K>0,则 设f[i][j]表示i位数字,积为j的数字的个数 g[i][j]表示i位数字,积为j的数字的和 DP+Hash预处理 查询时枚举LCP然后统计贡献 ...