What’s New, What’s Changed and How to get Started.

Are you ready for Apache Spark 2.0?

If you are just getting started with Apache Spark, the 2.0 release is the one to start with as the APIs have just gone through a major overhaul to improve ease-of-use.

If you are using an older version and want to learn what has changed then this article will give you the low down on why you should upgrade and what the impact to your code will be.

What’s new with Apache Spark 2.0?

Let’s start with the good news, and there’s plenty.

  • There are really only two programmatic APIs now; RDD and Dataset. For backwards compatibility, DataFrame still exists but is just a synonym for a Dataset.
  • Spark SQL has been improved to support a wider range of queries, including correlated subqueries. This was largely led by an effort to run TPC-DS benchmarks in Spark.
  • Performance is once again significantly improved thanks to advanced “whole stage code generation” when compiling query plans

CSV support is now built-in and based on the DataBricks spark-csv project, making it a breeze to create Datasets from CSV data with little coding.

Spark 2.0 is a major release, and there are some breaking changes that mean you may need to rewrite some of your code. Hereare some things we ran into when updating our apache-spark-examples.

  • For Scala users, SparkSession replaces SparkContext and SQLContext as the top-level context, but still provides access to SQLContext and SQLContext for backwards compatibility
  • DataFrame is now a synonym for Dataset[Row] and you can use these two types interchangeably,  although we recommend using the latter.
  • Performing a map() operation on a Dataset now returns a Dataset rather than an RDD, reducing the need to keep switching between the two APIs, and improving performance.
  • Some Java functional interfaces, such as FlatMapFunction, have been updated to return Iterator<T>rather than Iterable<T>.

Get help upgrading to Apache Spark 2.0 or making the transition from Java to Scala. Contact Us!

RDD vs. Dataset 2.0

Both the RDD API and the Dataset API represent data sets of a specific class. For instance, you can create an RDD[Person] as well as a Dataset[Person] so both can provide compile-time type-safety. Both can also be used with the generic Row structure provided in Spark for cases where classes might not exist that represent the data being manipulated, such as when reading CSV files.

RDDs can be used with any Java or Scala class and operate by manipulating those objects directly with all of the associated costs of object creation, serialization and garbage collection.

Datasets are limited to classes that implement the Scala Product trait, such as case classes. There is a very good reason for this limitation. Datasets store data in an optimized binary format, often in off-heap memory, to avoid the costs of deserialization and garbage collection. Even though it feels like you are coding against regular objects, Spark is really generating its own optimized byte-code for accessing the data directly.

RDD

 
1
2
3
// raw object manipulation
val rdd: RDD[Person] = …
val rdd2: RDD[String] = rdd.map(person => person.lastName)

Dataset

 
1
2
3
// optimized direct access to off-heap memory without deserializing objects
val ds: Dataset[Person] = …
val ds2: Dataset[String] = ds.map(person => person.lastName)

Getting Started with Scala

Here are some code samples to help you get started fast with Apache Spark 2.0 and Scala.

Creating SparkSession

SparkSession is now the starting point for a Spark driver program, instead of creating a SparkContext and a SQLContext.

 
1
2
3
4
5
6
7
8
val spark = SparkSession.builder
      .master("local[*]")
      .appName("Example")
      .getOrCreate()
 
// accessing legacy SparkContext and SQLContext
spark.sparkContext
spark.sqlContext

Creating a Dataset from a collection

SparkSession provides a createDataset method that accepts a collection.

 
1
var ds: Dataset[String] = spark.createDataset(List("one","two","three"))

Converting an RDD to a Dataset

SparkSession provides a createDataset method for converting an RDD to a Dataset. This only works if you import spark.implicits_ (where spark is the name of the SparkSession variable).

 
1
2
3
4
5
// always import implicits so that Spark can infer types when creating Datasets
import spark.implicits._
 
val rdd: RDD[Person] = ??? // assume this exists
val dataset: Dataset[Person] = spark.createDataset[Person](rdd)

Converting a DataFrame to a Dataset

A DataFrame (which is really a Dataset[Row]) can be converted to a Dataset of a specific class by performing a map() operation.

 
1
2
3
4
5
6
7
8
// read a text file into a DataFrame a.k.a. Dataset[Row]
var df: Dataset[Row] = spark.read.text("people.txt")
 
// use map() to convert to a Dataset of a specific class
var ds: Dataset[Person] = spark.read.text("people.txt")
      .map(row => parsePerson(row))
 
def parsePerson(row: Row) : Person = ??? // fill in parsing logic here

Reading a CSV directly as a Dataset

The built-in CSV support makes it easy to read a CSV and return a Dataset of a specific case class. This only works if the CSV contains a header row and the field names match the case class.

 
1
2
3
4
val ds: Dataset[Person] = spark.read
    .option("header","true")
    .csv("people.csv")
    .as[Person]

Getting Started with Java

Here are some code samples to help you get started fast with Spark 2.0 and Java.

Creating SparkSession

 
1
2
3
4
5
6
7
SparkSession spark = SparkSession.builder()
  .master("local[*]")
  .appName("Example")
  .getOrCreate();
 
  // Java still requires of the JavaSparkContext
  JavaSparkContext sc = new JavaSparkContext(spark.sparkContext());

Creating a Dataset from a collection

SparkSession provides a createDataset method that accepts a collection.

 
1
2
3
4
Dataset<Person> ds = spark.createDataset(
    Collections.singletonList(new Person(1, "Joe", "Bloggs")),
    Encoders.bean(Person.class)
);

Converting an RDD to a Dataset

SparkSession provides a createDataset method for converting an RDD to a Dataset.

 
1
2
3
4
Dataset<Person> ds = spark.createDataset(
  javaRDD.rdd(), // convert a JavaRDD to an RDD
  Encoders.bean(Person.class)
);

Converting a DataFrame to a Dataset

A DataFrame (which is really a Dataset[Row]) can be converted to a Dataset of a specific class by performing a map() operation.

 
1
2
3
4
5
6
7
8
Dataset<Person> ds = df.map(new MapFunction<Row, Person>() {
  @Override
  public Person call(Row value) throws Exception {
    return new Person(Integer.parseInt(value.getString(0)),
                      value.getString(1),
                      value.getString(2));
  }
}, Encoders.bean(Person.class));

Reading a CSV directly as a Dataset

The built-in CSV support makes it easy to read a CSV and return a Dataset of a specific case class. This only works if the CSV contains a header row and the field names match the case class.

 
1
2
3
4
Dataset<Person> ds = spark.read()
  .option("header", "true")
  .csv("testdata/people.csv")
  .as(Encoders.bean(Person.class));

Spark+Scala beats Spark+Java

Using Apache Spark with Java is harder than using Apache Spark with Scala and we spent significantly longer upgrading our Java examples than we did with our Scala examples, including running into some confusing runtime errors that were hard to track down (for example, we hit a runtime error with Spark’s code generation because one of our Java classes was not declared as public).

Also, we weren’t always able to use concise lambda functions even though we are using Java 8, and had to revert to anonymous inner classes with verbose (and confusing) syntax.

Conclusion

Spark 2.0 represents a significant milestone in the evolution of this open source project and provides cleaner APIs and improved performance compared to the 1.6 release.

The Scala API is a joy to code with, but the Java API can often be frustrating. It’s worth biting the bullet and switching to Scala.

Full source code for a number of examples is available from our github repo here.

Get help upgrading to Spark 2.0 or making the transition from Java to Scala. Contact Us!

APACHE SPARK 2.0 API IMPROVEMENTS: RDD, DATAFRAME, DATASET AND SQL的更多相关文章

  1. Apache Spark 2.0三种API的传说:RDD、DataFrame和Dataset

    Apache Spark吸引广大社区开发者的一个重要原因是:Apache Spark提供极其简单.易用的APIs,支持跨多种语言(比如:Scala.Java.Python和R)来操作大数据. 本文主要 ...

  2. Apache Spark 3.0 预览版正式发布,多项重大功能发布

    2019年11月08日 数砖的 Xingbo Jiang 大佬给社区发了一封邮件,宣布 Apache Spark 3.0 预览版正式发布,这个版本主要是为了对即将发布的 Apache Spark 3. ...

  3. Apache Spark 3.0 将内置支持 GPU 调度

    如今大数据和机器学习已经有了很大的结合,在机器学习里面,因为计算迭代的时间可能会很长,开发人员一般会选择使用 GPU.FPGA 或 TPU 来加速计算.在 Apache Hadoop 3.1 版本里面 ...

  4. spark的数据结构 RDD——DataFrame——DataSet区别

    转载自:http://blog.csdn.net/wo334499/article/details/51689549 RDD 优点: 编译时类型安全 编译时就能检查出类型错误 面向对象的编程风格 直接 ...

  5. Spark注册UDF函数,用于DataFrame DSL or SQL

    import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions._ object Test2 { def ...

  6. sparkSQL中RDD——DataFrame——DataSet的区别

    spark中RDD.DataFrame.DataSet都是spark的数据集合抽象,RDD针对的是一个个对象,但是DF与DS中针对的是一个个Row RDD 优点: 编译时类型安全 编译时就能检查出类型 ...

  7. There Are Now 3 Apache Spark APIs. Here’s How to Choose the Right One

    See Apache Spark 2.0 API Improvements: RDD, DataFrame, DataSet and SQL here. Apache Spark is evolvin ...

  8. RDD, DataFrame or Dataset

    总结: 1.RDD是一个Java对象的集合.RDD的优点是更面向对象,代码更容易理解.但在需要在集群中传输数据时需要为每个对象保留数据及结构信息,这会导致数据的冗余,同时这会导致大量的GC. 2.Da ...

  9. 且谈 Apache Spark 的 API 三剑客:RDD、DataFrame 和 Dataset

    作者:Jules S. Damji 译者:足下 本文翻译自 A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets ,翻译已 ...

随机推荐

  1. Spring中bean实例化的三种方式

    之前我已经有好几篇博客介绍Spring框架了,不过当时我们都是使用注解来完成注入的,具体小伙伴可以参考这几篇博客(Spring&SpringMVC框架案例).那么今天我想来说说如何通过xml配 ...

  2. WinSocket同时接入量的疑惑(求解...)

    在写TCP应用的时候一般都通过Accept来接入连接的接入,但对于Socket来说这个Accept同时能处理多大的量一般都没有明确说明,在应用中主要根据自己的需要设置Listen的队列数量.那List ...

  3. PHP自动加载机制

    类的载入共经历了三个阶段 第一阶段是面向过程式的做法,整个项目里到处都是include或require. 问题:难看,不易维护. . 第二阶段是魔术方法__autoload,当new一个类的时候,如果 ...

  4. 五一之起一台服务器玩玩-centosl系统搭建lamp

    昨天看到有的扫描软件会扫描公网主机开放的端口,我现在就用了ssh远程登录21和22端口那不是很容易被猜想到.so,我决定要改这个端口,同时这个改端口给我打开了防火墙和ip协议和网络安全的大门. 我之前 ...

  5. Spring Boot(十二)单元测试JUnit

    一.介绍 JUnit是一款优秀的开源Java单元测试框架,也是目前使用率最高最流行的测试框架,开发工具Eclipse和IDEA对JUnit都有很好的支持,JUnit主要用于白盒测试和回归测试. 白盒测 ...

  6. Spring Boot (八)MyBatis + Docker + MongoDB 4.x

    一.MongoDB简介 1.1 MongoDB介绍 MongoDB是一个强大.灵活,且易于扩展的通用型数据库.MongoDB是C++编写的文档型数据库,有着丰富的关系型数据库的功能,并在4.0之后添加 ...

  7. 【.NET Core项目实战-统一认证平台】第八章 授权篇-IdentityServer4源码分析

    [.NET Core项目实战-统一认证平台]开篇及目录索引 上篇文章我介绍了如何在网关上实现客户端自定义限流功能,基本完成了关于网关的一些自定义扩展需求,后面几篇将介绍基于IdentityServer ...

  8. 使用微软PinYinConverter查询汉字拼音

    通过汉字,如何查询拼音? 微软有相应的DLL可直接使用 引用方式 Nuget包管理安装 DLL下载后,引用 可以从微软的网站上下载相关文字处理的类库,下载地址如下: http://download.m ...

  9. 内部类访问局部变量为什么必须要用final修饰

    内部类访问局部变量为什么必须要用final修饰 看了大概五六篇博客, 讲的内容都差不多, 讲的内容也都很对, 但我觉得有些跑题了 略叙一下 String s = "hello"; ...

  10. netfilter及iptables基本概念

    网络访问控制 网络访问控制可以简单理解为防火墙,常用的网络访问控制有:哪些IP可以访问服务器, 可以使用哪些协议,哪些接口,是否需要对数据包进行修改等. netfilter netfilter是通过i ...