文章标题

Introducing Apache Spark Datasets

作者介绍

Michael ArmbrustWenchen FanReynold Xin and Matei Zaharia

文章正文

Developers have always loved Apache Spark for providing APIs that are simple yet powerful, a combination of traits that makes complex analysis possible with minimal programmer effort.  At Databricks, we have continued to push Spark’s usability and performance envelope through the introduction of DataFrames and Spark SQL. These are high-level APIs for working with structured data (e.g. database tables, JSON files), which let Spark automatically optimize both storage and computation. Behind these APIs, the Catalyst optimizer and Tungsten execution engine optimize applications in ways that were not possible with Spark’s object-oriented (RDD) API, such as operating on data in a raw binary form.

Today we’re excited to announce Spark Datasets, an extension of the DataFrame API that provides a type-safe, object-oriented programming interfaceSpark 1.6 includes an API preview of Datasets, and they will be a development focus for the next several versions of Spark. Like DataFrames, Datasets take advantage of Spark’s Catalyst optimizer by exposing expressions and data fields to a query planner.  Datasets also leverage Tungsten’s fast in-memory encoding.  Datasets extend these benefits with compile-time type safety – meaning production applications can be checked for errors before they are run. They also allow direct operations over user-defined classes.

In the long run, we expect Datasets to become a powerful way to write more efficient Spark applications. We have designed them to work alongside the existing RDD API, but improve efficiency when data can be represented in a structured form.  Spark 1.6 offers the first glimpse at Datasets, and we expect to improve them in future releases.

1、Working with Datasets

A Dataset is a strongly-typed, immutable collection of objects that are mapped to a relational schema.  At the core of the Dataset API is a new concept called an encoder, which is responsible for converting between JVM objects and tabular representation. The tabular representation is stored using Spark’s internal Tungsten binary format, allowing for operations on serialized data and improved memory utilization.  Spark 1.6 comes with support for automatically generating encoders for a wide variety of types, including primitive types (e.g. String, Integer, Long), Scala case classes, and Java Beans.

Users of RDDs will find the Dataset API quite familiar, as it provides many of the same functional transformations (e.g. map, flatMap, filter).  Consider the following code, which reads lines of a text file and splits them into words:

// RDDs

val lines = sc.textFile("/wikipedia")
val words = lines
.flatMap(_.split(" "))
.filter(_ != "")
// Datasets

val lines = sqlContext.read.text("/wikipedia").as[String]
val words = lines
.flatMap(_.split(" "))
.filter(_ != "")

Both APIs make it easy to express the transformation using lambda functions. The compiler and your IDE understand the types being used, and can provide helpful tips and error messages while you construct your data pipeline.

While this high-level code may look similar syntactically, with Datasets you also have access to all the power of a full relational execution engine. For example, if you now want to perform an aggregation (such as counting the number of occurrences of each word), that operation can be expressed simply and efficiently as follows:

// RDDs

val counts = words
.groupBy(_.toLowerCase)
.map(w => (w._1, w._2.size))
// Datasets

val counts = words
.groupBy(_.toLowerCase)
.count()

Since the Dataset version of word count can take advantage of the built-in aggregate count, this computation can not only be expressed with less code, but it will also execute significantly faster.  As you can see in the graph below, the Dataset implementation runs much faster than the naive RDD implementation.  In contrast, getting the same performance using RDDs would require users to manually consider how to express the computation in a way that parallelizes optimally.

Another benefit of this new Dataset API is the reduction in memory usage. Since Spark understands the structure of data in Datasets, it can create a more optimal layout in memory when caching Datasets. In the following example, we compare caching several million strings in memory using Datasets as opposed to RDDs. In both cases, caching data can lead to significant performance improvements for subsequent queries.  However, since Dataset encoders provide more information to Spark about the data being stored, the cached representation can be optimized to use 4.5x less space.

To help you get started, we’ve put together some example notebooks: Working with ClassesWord Count.

2、Lightning-fast Serialization with Encoders

Encoders are highly optimized and use runtime code generation to build custom bytecode for serialization and deserialization.  As a result, they can operate significantly faster than Java or Kryo serialization.

In addition to speed, the resulting serialized size of encoded data can also be significantly smaller (up to 2x), reducing the cost of network transfers.  Furthermore, the serialized data is already in the Tungsten binary format, which means that many operations can be done in-place, without needing to materialize an object at all.  Spark has built-in support for automatically generating encoders for primitive types (e.g. String, Integer, Long), Scala case classes, and Java Beans.  We plan to open up this functionality and allow efficient serialization of custom types in a future release.

3、Seamless Support for Semi-Structured Data

The power of encoders goes beyond performance.  They also serve as a powerful bridge between semi-structured formats (e.g. JSON) and type-safe languages like Java and Scala.

For example, consider the following dataset about universities:

{"name": "UC Berkeley", "yearFounded": 1868, numStudents: 37581}

{"name": "MIT", "yearFounded": 1860, numStudents: 11318}

…

Instead of manually extracting fields and casting them to the desired type, you can simply define a class with the expected structure and map the input data to it.  Columns are automatically lined up by name, and the types are preserved.

case class University(name: String, numStudents: Long, yearFounded: Long)

val schools = sqlContext.read.json("/schools.json").as[University]

schools.map(s => s"${s.name} is ${2015 – s.yearFounded} years old")

Encoders eagerly check that your data matches the expected schema, providing helpful error messages before you attempt to incorrectly process TBs of data. For example, if we try to use a datatype that is too small, such that conversion to an object would result in truncation (i.e. numStudents is larger than a byte, which holds a maximum value of 255) the Analyzer will emit an AnalysisException.

case class University(numStudents: Byte)

val schools = sqlContext.read.json("/schools.json").as[University]

org.apache.spark.sql.AnalysisException: Cannot upcast `yearFounded` from bigint to smallint as it may truncate

When performing the mapping, encoders will automatically handle complex types, including nested classes, arrays, and maps.

4、A Single API for Java and Scala

Another goal to the Dataset API is to provide a single interface that is usable in both Scala and Java. This unification is great news for Java users as it ensure that their APIs won’t lag behind the Scala interfaces, code examples can easily be used from either language, and libraries no longer have to deal with two slightly different types of input.  The only difference for Java users is they need to specify the encoder to use since the compiler does not provide type information.  For example, if wanted to process json data using Java you could do it as follows:

public class University implements Serializable {
private String name;
private long numStudents;
private long yearFounded; public void setName(String name) {...}
public String getName() {...}
public void setNumStudents(long numStudents) {...}
public long getNumStudents() {...}
public void setYearFounded(long yearFounded) {...}
public long getYearFounded() {...}
} class BuildString implements MapFunction {
public String call(University u) throws Exception {
return u.getName() + " is " + (2015 - u.getYearFounded()) + " years old.";
}
} Dataset schools = context.read().json("/schools.json").as(Encoders.bean(University.class));
Dataset strings = schools.map(new BuildString(), Encoders.STRING());

5、Looking Forward

While Datasets are a new API, we have made them interoperate easily with RDDs and existing Spark programs. Simply calling the rdd() method on a Dataset will give an RDD. In the long run, we hope that Datasets can become a common way to work with structured data, and we may converge the APIs even further.

As we look forward to Spark 2.0, we plan some exciting improvements to Datasets, specifically:

  • Performance optimizations – In many cases, the current implementation of the Dataset API does not yet leverage the additional information it has and can be slower than RDDs. Over the next several releases, we will be working on improving the performance of this new API.
  • Custom encoders – while we currently autogenerate encoders for a wide variety of types, we’d like to open up an API for custom objects.
  • Python Support.
  • Unification of DataFrames with Datasets – due to compatibility guarantees, DataFrames and Datasets currently cannot share a common parent class. With Spark 2.0, we will be able to unify these abstractions with minor changes to the API, making it easy to build libraries that work with both.

If you’d like to try out Datasets yourself, they are already available in Databricks.  We’ve put together a few example notebooks for you to try out: Working with ClassesWord Count.

Spark 1.6 is available on Databricks today, sign up for a free 14-day trial.

参考文献

  • https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html
  • https://blog.csdn.net/sunbow0/article/details/50723233

Introducing Apache Spark Datasets(中英双语)的更多相关文章

  1. 我在 B 站学机器学习(Machine Learning)- 吴恩达(Andrew Ng)【中英双语】

    我在 B 站学机器学习(Machine Learning)- 吴恩达(Andrew Ng)[中英双语] 视频地址:https://www.bilibili.com/video/av9912938/ t ...

  2. Introducing DataFrames in Apache Spark for Large Scale Data Science(中英双语)

    文章标题 Introducing DataFrames in Apache Spark for Large Scale Data Science 一个用于大规模数据科学的API——DataFrame ...

  3. A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets(中英双语)

    文章标题 A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets 且谈Apache Spark的API三剑客:RDD.Dat ...

  4. What’s new for Spark SQL in Apache Spark 1.3(中英双语)

    文章标题 What’s new for Spark SQL in Apache Spark 1.3 作者介绍 Michael Armbrust 文章正文 The Apache Spark 1.3 re ...

  5. Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop(中英双语)

    文章标题 Apache Spark as a Compiler: Joining a Billion Rows per Second on a Laptop Deep dive into the ne ...

  6. One SQL to Rule Them All – an Efficient and Syntactically Idiomatic Approach to Management of Streams and Tables(中英双语)

    文章标题 One SQL to Rule Them All – an Efficient and Syntactically Idiomatic Approach to Management of S ...

  7. Matalb中英双语手册-年少无知翻译版本

    更新: 20171207: 这是大学期间参加数模翻译的手册 正文: 愚人节快乐,突然发现自己在博客园的一篇文章.摘取如下: MATLAB 语言是一种工程语言,语法很像 VB 和 C,比 R 语言容易学 ...

  8. Spark 论文篇-RDD:一种为内存化集群计算设计的容错抽象(中英双语)

    论文内容: 待整理 参考文献: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster C ...

  9. Deep Dive into Spark SQL’s Catalyst Optimizer(中英双语)

    文章标题 Deep Dive into Spark SQL’s Catalyst Optimizer 作者介绍 Michael Armbrust, Yin Huai, Cheng Liang, Rey ...

随机推荐

  1. UVA136 Ugly Numbers【set】【优先队列】

    丑数 丑数是指不能被2,3,5以外的其他素数整除的数.把丑数从小到大排列起来,结果如下: 1,2,3,4,5,6,8,9,10,12,15,… 求第1500个丑数. 提示:从小到大生成各个丑数.最小的 ...

  2. Ef 自动迁移,日志

    Ef 迁移 在vs打开程序控制台 2,选择程序集 ,如果是初次,输入 Enable-Migrations,启动迁徙 3  添加迁移,完成修改 4,之后会自动生成迁移配置文件. 然后再上下文类中加入 两 ...

  3. 分布式事务?咱先弄明白本地事务再说 - ACID

      过去一段时间面试的同学,对于数据库事务,可以按照配置正常使用,但很多都无法讲清楚和理解数据库事务这个东西真正的意义,以及互联网兴起以后,当今数据库在ACID面前面临怎样的问题和抉择. 事务,是各大 ...

  4. Nginx访问权限配置

    最近建个人网站,在服务器上新建了一个用户zengfp,并且把网站的目录放到了/home/zengfp/www目录下,配置的nginx: server { listen 80 default_serve ...

  5. 【Java并发核心九】并发集合框架

    1.List接口:ArrayList 和 Vector ArrayList不是线程安全的,Vector是线程安全的,Vector有一个子类,可实现后进先出(LIFO)的对象堆栈(LinkedList ...

  6. Linux使用nexus搭建maven私服

    一.准备工作  系统:LINUX           JDK:已安装(未安装详见jdk安装教程:http://www.cnblogs.com/muzi1994/p/5818099.html)     ...

  7. 自己总结的C#编码规范--3.特定场景下的命名最佳实践

    特定场景下的命名最佳实践 命名空间 要使用PascalCasing,并用点号来分隔名字空间中的各个部分. 如Microsof.Office.PowerPoint 要用公司名作为命名空间的前缀,这样就可 ...

  8. 2017-9-24-Linux移植:ubuntu server 16.04无法联网&无法apt-get update解决

    无法上网!!!不能忍.. 现象:ifconfig 毛都没有,想找一下ip都找不到. ifconfig –a 可以列出所有网卡设备,确认VM VirtualBox网卡开对了,已经给到了虚拟机. 编辑/e ...

  9. Scrapy基础(九)————将不定长度的URL进行固定长度写入Item中

    前面讲到将每篇文章的URL写入Item,但是每个url的长度是不同的,可以在Item中设置一个字段怎样使得每个URL的长度相同,这就需要对每个URL进行md5运算,使得长度统一,再加入到设定的Item ...

  10. Cheapest Palindrome [POJ3280] [区间DP] [经典]

    一句话题意:每个字母添加和删除都相应代价(可以任意位置 增加/删除),求把原串变成回文串的最小代价 Description 保持对所有奶牛的跟踪是一项棘手的任务,因此农场主约翰已经安装了一个系统来实现 ...