See Apache Spark 2.0 API Improvements: RDD, DataFrame, DataSet and SQL here.

Apache Spark is evolving at a rapid pace, including changes and additions to core APIs. One of the most disruptive areas of change is around the representation of data sets. Spark 1.0 used the RDD API but in the past twelve months, two new alternative and incompatible APIs have been introduced. Spark 1.3 introduced the radically different DataFrame API and the recently released Spark 1.6 release introduces a preview of the new Dataset API.

Many existing Spark developers will be wondering whether to jump from RDDs directly to the Dataset API, or whether to first move to the DataFrame API. Newcomers to Spark will have to choose which API to start learning with.

This article provides an overview of each of these APIs, and outlines the strengths and weaknesses of each one. A companion github repository provides working examples that are a good starting point for experimentation with the approaches outlined in this article.

Talk to a Spark expert.Contact Us.

The RDD (Resilient Distributed Dataset) API has been in Spark since the 1.0 release. This interface and its Java equivalent, JavaRDD, will be familiar to any developers who have worked through the standard Spark tutorials. From a developer’s perspective, an RDD is simply a set of Java or Scala objects representing data.

The RDD API provides many transformation methods, such as map()filter(), and reduce() for performing computations on the data. Each of these methods results in a new RDD representing the transformed data. However, these methods are just defining the operations to be performed and the transformations are not performed until an action method is called. Examples of action methods are collect() and saveAsObjectFile().

Example of RDD transformations and actions

Scala:

 
1
2
3
rdd.filter(_.age > 21)              // transformation
   .map(_.last)                     // transformation                    
   .saveAsObjectFile("under21.bin") // action

Java:

 
1
2
3
rdd.filter(p -> p.getAge() < 21)     // transformation
   .map(p -> p.getLast())            // transformation
   .saveAsObjectFile("under21.bin"); // action

The main advantage of RDDs is that they are simple and well understood because they deal with concrete classes, providing a familiar object-oriented programming style with compile-time type-safety. For example, given an RDD containing instances of Person we can filter by age by referencing the age attribute of each Person object:

Example: Filter by attribute with RDD

Scala:

 
1
rdd.filter(_.age > 21)

Java:

 
1
rdd.filter(person -> person.getAge() > 21)

The main disadvantage to RDDs is that they don’t perform particularly well. Whenever Spark needs to distribute the data within the cluster, or write the data to disk, it does so using Java serialization by default (although it is possible to use Kryo as a faster alternative in most cases). The overhead of serializing individual Java and Scala objects is expensive and requires sending both data and structure between nodes (each serialized object contains the class structure as well as the values). There is also the overhead of garbage collection that results from creating and destroying individual objects.

DataFrame API

Spark 1.3 introduced a new DataFrame API as part of the Project Tungsten initiative which seeks to improve the performance and scalability of Spark. The DataFrame API introduces the concept of a schema to describe the data, allowing Spark to manage the schema and only pass data between nodes, in a much more efficient way than using Java serialization. There are also advantages when performing computations in a single process as Spark can serialize the data into off-heap storage in a binary format and then perform many transformations directly on this off-heap memory, avoiding the garbage-collection costs associated with constructing individual objects for each row in the data set. Because Spark understands the schema, there is no need to use Java serialization to encode the data.

The DataFrame API is radically different from the RDD API because it is an API for building a relational query plan that Spark’s Catalyst optimizer can then execute. The API is natural for developers who are familiar with building query plans, but not natural for the majority of developers. The query plan can be built from SQL expressions in strings or from a more functional approach using a fluent-style API.

Example: Filter by attribute with DataFrame

Note that these examples have the same syntax in both Java and Scala

SQL Style

 
1
df.filter("age > 21");

Expression builder style:

 
1
df.filter(df.col("age").gt(21));

Because the code is referring to data attributes by name, it is not possible for the compiler to catch any errors. If attribute names are incorrect then the error will only detected at runtime, when the query plan is created.

Another downside with the DataFrame API is that it is very scala-centric and while it does support Java, the support is limited. For example, when creating a DataFrame from an existing RDD of Java objects, Spark’s Catalyst optimizer cannot infer the schema and assumes that any objects in the DataFrame implement the scala.Product interface. Scala case classes work out the box because they implement this interface.

Dataset API

The Dataset API, released as an API preview in Spark 1.6, aims to provide the best of both worlds; the familiar object-oriented programming style and compile-time type-safety of the RDD API but with the performance benefits of the Catalyst query optimizer. Datasets also use the same efficient off-heap storage mechanism as the DataFrame API.

When it comes to serializing data, the Dataset API has the concept of encoders which translate between JVM representations (objects) and Spark’s internal binary format. Spark has built-in encoders which are very advanced in that they generate byte code to interact with off-heap data and provide on-demand access to individual attributes without having to de-serialize an entire object. Spark does not yet provide an API for implementing custom encoders, but that is planned for a future release.

Additionally, the Dataset API is designed to work equally well with both Java and Scala. When working with Java objects, it is important that they are fully bean-compliant. In writing the examples to accompany this article, we ran into errors when trying to create a Dataset in Java from a list of Java objects that were not fully bean-compliant.

Need help with Spark APIs? Contact Us.

Example: Creating Dataset from a list of objects

Scala

 
1
2
3
4
5
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val sampleData: Seq[ScalaPerson] = ScalaData.sampleData()
val dataset = sqlContext.createDataset(sampleData)

Java

 
1
2
3
4
JavaSparkContext sc = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new SQLContext(sc);
List data = JavaData.sampleData();
Dataset dataset = sqlContext.createDataset(data, Encoders.bean(JavaPerson.class));

Transformations with the Dataset API look very much like the RDD API and deal with the Person class rather than an abstraction of a row.

Example: Filter by attribute with Dataset

Scala

 
1
dataset.filter(_.age < 21);

Java

 
1
dataset.filter(person -> person.getAge() < 21);

Despite the similarity with RDD code, this code is building a query plan, rather than dealing with individual objects, and if age is the only attribute accessed, then the rest of the the object’s data will not be read from off-heap storage.

Get started with your Big Data Strategy. Contact Us.

Conclusions

If you are developing primarily in Java then it is worth considering a move to Scala before adopting the DataFrame or Dataset APIs. Although there is an effort to support Java, Spark is written in Scala and the code often makes assumptions that make it hard (but not impossible) to deal with Java objects.

If you are developing in Scala and need your code to go into production with Spark 1.6.0 then the DataFrame API is clearly the most stable option available and currently offers the best performance.

However, the Dataset API preview looks very promising and provides a more natural way to code. Given the rapid evolution of Spark it is likely that this API will mature very quickly through 2016 and become the de-facto API for developing new applications.

See Apache Spark 2.0 API Improvements: RDD, DataFrame, DataSet and SQL here.

There Are Now 3 Apache Spark APIs. Here’s How to Choose the Right One的更多相关文章

  1. A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets(中英双语)

    文章标题 A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets 且谈Apache Spark的API三剑客:RDD.Dat ...

  2. 且谈 Apache Spark 的 API 三剑客:RDD、DataFrame 和 Dataset

    作者:Jules S. Damji 译者:足下 本文翻译自 A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets ,翻译已 ...

  3. Why Apache Spark is a Crossover Hit for Data Scientists [FWD]

    Spark is a compelling multi-purpose platform for use cases that span investigative, as well as opera ...

  4. Apache Spark 2.2.0 中文文档 - 概述 | ApacheCN

    Spark 概述 Apache Spark 是一个快速的, 多用途的集群计算系统. 它提供了 Java, Scala, Python 和 R 的高级 API,以及一个支持通用的执行图计算的优化过的引擎 ...

  5. Apache Spark 2.2.0 中文文档 - Spark SQL, DataFrames and Datasets Guide | ApacheCN

    Spark SQL, DataFrames and Datasets Guide Overview SQL Datasets and DataFrames 开始入门 起始点: SparkSession ...

  6. 分享一个.NET平台开源免费跨平台的大数据分析框架.NET for Apache Spark

    今天早上六点半左右微信群里就看到张队发的关于.NET Spark大数据的链接https://devblogs.microsoft.com/dotnet/introducing-net-for-apac ...

  7. APACHE SPARK 2.0 API IMPROVEMENTS: RDD, DATAFRAME, DATASET AND SQL

    What’s New, What’s Changed and How to get Started. Are you ready for Apache Spark 2.0? If you are ju ...

  8. Introducing Apache Spark Datasets(中英双语)

    文章标题 Introducing Apache Spark Datasets 作者介绍 Michael Armbrust, Wenchen Fan, Reynold Xin and Matei Zah ...

  9. Introducing DataFrames in Apache Spark for Large Scale Data Science(中英双语)

    文章标题 Introducing DataFrames in Apache Spark for Large Scale Data Science 一个用于大规模数据科学的API——DataFrame ...

随机推荐

  1. C++中int与string的转化

    C++中int与string的转化 int本身也要用一串字符表示,前后没有双引号,告诉编译器把它当作一个数解释.缺省情况下,是当成10进制(dec)来解释,如果想用8进制,16进制,怎么办?加上前缀, ...

  2. 关于HotSpot VM以及Java语言的动态编译 你可能想知道这些

    目录 1 HotSpot VM的历史 2 HotSpot VM 概述 2.1 编译器 2.2 解释器 2.3 解释型语言 VS 编译型语言 3 动态编译 3.1 什么是动态编译 3.2 HotSpot ...

  3. 给vs2015添加EF

    今天做EF的小例子时,发现需要添加实体数据模型,但是不管怎么找在新建项中都找不到这个选项,这是怎么回事,于是就开始百度吧,有的说可能是VS安装时没有全选,也有的人说可能是重装VS时,没有将注册表清除, ...

  4. 解决端口冲突问题(查询端口占用进程并kill)

    1. 查看端口占用 在windows命令行窗口下执行: netstat -aon|findstr "8080" TCP 127.0.0.1:80 0.0.0.0:0 LISTENI ...

  5. [二十六]JavaIO之再回首恍然(如梦? 大悟?)

    流分类回顾 本文是JavaIO告一段落的总结帖 希望对JavaIO做一个基础性的总结(不涉及NIO) 从实现的角度进行简单的介绍 下面的这两个表格,之前出现过 数据源形式 InputStream Ou ...

  6. Eclipse查看JDK源码(非常详细)

    Eclipse查看源码的方式其实很简单,打开项目,然后按着ctrl,然后把鼠标光标移动到你想查看的方法或者对象上,这时会出现一条下划线,然后点击鼠标左键就可以进入那个方法或者对象了.但是有的情况下会出 ...

  7. Oracle最新的Sql笔试题及答案

    部门表(SM_DEPT) 字段名称 数据类型 是否主键 注释 DEPT_ID NUMBER Y 部门ID PARENT_DEPARTMENT_ID NUMBER N 上级部门 DEPARTMENT_N ...

  8. Linux学习笔记之MySql的安装(CentOS)

    一.移除mariadb 由于CentOS默认安装了mariadb,所以在安装MySql之前先移除mariadb,使用命令:yum remove mariadb-libs.x86_64,如下图所示: 二 ...

  9. Android项目刮刮奖详解(二)

    Android项目刮刮奖详解(一) 前言 上期我们简单地实现了一个画板的功能,用户可以在上面乱写乱画,其实,刮刮奖也是如此,用户刮奖的时候也是乱写乱画的. 刮刮奖原理 一共有两层画布,底层画布存放中奖 ...

  10. JAVA设计模式——开篇

    设计模式很重要,重要性我就不再复述了.最主要的是,通常我们在写一定量代码后,常用的方法什么的都熟悉后,想再提高代码能力,我找到的最好的方法还是去学习,理解设计模式.不理解设计模式,看一些开源框架和ja ...