The versatility of Apache Spark’s API for both batch/ETL and streaming workloads brings the promise of lambda architecture to the real world.

Few things help you concentrate like a last-minute change to a major project.

One time, after working with a customer for three weeks to design and implement a proof-of-concept data ingest pipeline, the customer’s chief architect told us:

You know, I really like the design – I like how data is validated on arrival. I like how we store the raw data to allow for exploratory analysis while giving the business analysts pre-computed aggregates for faster response times. I like how we automatically handle data that arrives late and changes to the data structure or algorithms.

But, he continued, I really wish there was a real-time component here. There is a one-hour delay between the point when data is collected until it’s available in our dashboards. I understand that this is to improve efficiency and protect us from unclean data. But for some of our use cases, being able to react immediately to new data is more important than being 100% certain of data validity.

Can we quickly add a real-time component to the POC? It will make the results much more impressive for our users.

Without directly articulating it, the architect was referring to what we call the lambda architecture – originally proposed by Nathan Marz – which usually combines batch and real-time components. One often needs both because data arriving in real-time has inherent issues: there is no guarantee that each event will arrive exactly once, so there may be duplicates that will add noise to the data. Data that arrives late due to network or server instability also routinely causes problems. The lambda architecture handles these issues by processing the data twice — once in the real-time view, and a second time in the batch process – to give you one view that is fast, and one that is reliable.

Why Spark?

But this approach comes with a cost: you’ll have to implement and maintain the same business logic in two different systems. For example, if your batch system is implemented with Apache Hive or Apache Pig and your real-time system is implemented with Apache Storm, you need to write and maintain the same aggregates in SQL and in Java. As Jay Kreps noted in his article “Questioning the Lambda Architecture,” this situation very quickly becomes a maintenance nightmare.

Had we implemented the customer’s POC system in Hive, I would have had to tell him: “No, there is not enough time left to re-implement our entire aggregation logic in Storm.” But fortunately, we were using Apache Spark, not Hive, for the customer’s aggregation logic.

Spark is well known as a framework for machine learning, but it is also quite capable for ETL tasks, as well. Spark has clean and easy-to-use APIs (far more readable and with less boilerplate code than MapReduce), and its REPL interface allows for fast prototyping of logic with business users. Obviously, no one complains when the aggregates execute significantly faster than they would with MapReduce.

But the biggest advantage Spark gave us in this case was Spark Streaming, which allowed us to re-use the same aggregates we wrote for our batch application on a real-time data stream. We didn’t need to re-implement the business logic, nor test and maintain a second code base. As a result, we could rapidly deploy a real-time component in the limited time left — and impress not just the users but also the developers and their management.

DIY

Here’s a quick and simple example of how this was done. (For simplicity, only the most important steps are included.) You can see the complete source code here.

  1. First, we wrote a function to implement business logic. In this example, we want to count the number of errors per day in a collection of log events. The log events comprise date and time, followed by a log level, the logging process, and the actual message:

    // :: INFO Executor: Finished task ID 

    To count the number of errors per day, we need to filter by the log level and then count the number of messages for each day:

    def countErrors(rdd: RDD[String]): RDD[(String, Int)] = {
    rdd
    .filter(_.contains("ERROR")) // Keep "ERROR" lines
    .map( s => (s.split(" ")(), ) ) // Return tuple with date & count
    .reduceByKey(_+_) // Sum counts for each date
    }

    In the function we filter all lines that contain “ERROR”, then use a map function to set the first word in the line (the date) as the key. Then we run reduce by key to count the number of errors we got for each day.

    As you can see, the function transforms one RDD into another. RDD’sare Spark’s main data structure– essentially partitioned, replicated collections. Spark hides the complexity of handling distributed collections from us, and we can work with them like we would with any other collection.

  2. We can use this function in a Spark ETL process to read data from HDFS to an RDD, count errors, and save the results to HDFS: 
    val sc = new SparkContext(conf)
    
    val lines = sc.textFile(...)
    val errCount = countErrors(lines)
    errCount.saveAsTextFile(...) 

    In this example we initialized a SparkContext to execute our code within a Spark cluster. (Note that this is not necessary if you use the Spark REPL, where the SparkContext is initialized automatically.) Once the SparkContext is initialized, we use it to read lines from a file into an RDD and then execute our error count function and save the result back to a file.

    The URLs in spark.textFile and errCount.saveAsTextFile can be placed in HDFS by using hdfs://…or to files in local filesystem, Amazon S3, and so on.

  3. Now, suppose we can’t wait an entire day for the error counts, and need to publish updated results every minute during the day. We don’t have to re-implement the aggregation — we can just reuse it in our streaming code: 
    val ssc = new StreamingContext(sparkConf, )
    
    // Create the DStream from data sent over the network
    val dStream = ssc.socketTextStream(args(), args().toInt, StorageLevel.MEMORY_AND_DISK_SER) // Counting the errors in each RDD in the stream
    val errCountStream = dStream.transform(rdd => ErrorCount.countErrors(rdd)) // printing out the current error count
    errCountStream.foreachRDD(rdd => {
    System.out.println("Errors this minute:%d".format(rdd.first()._2))
    }) // creating a stream with running error count
    val stateStream = errCountStream.updateStateByKey[Int](updateFunc) // printing the running error count
    stateStream.foreachRDD(rdd => {
    System.out.println("Errors today:%d".format(rdd.first()._2))
    })

    Once again, we are initializing a context – this time, it’s a SteamingContextStreamingContext takes a stream of events (in this case from a network socket; production architecture will use a reliable service like Apache Kafka instead) and turns them into a stream of RDDs.

    Each RDD represents a micro-batching of the stream. The duration of each micro-batch is configurable (in this case 60-second batches), and can serve to balance between throughput (larger batches) and latency (smaller batches).

    We run a map job on the DStream, using our countErrors function to transform each RDD of lines from the stream into an RDD of (date, errorCount).

    For each RDD we output the error count for this specific batch, and use the same RDD to update a stream with running totals of the counts. We use this stream to print the running totals.

For simplicity you could print the output to screen, but you can also save it to HDFS, Apache HBase, or Kafka, where real-time applications and users can use it.

Conclusion

To recap: Spark Streaming lets you implement your business logic function once, and then reuse the code in a batch ETL process as well as a streaming process. In the customer engagement I described previously, this versatility allowed us to very quickly implement (within hours) a real-time layer to complement the batch-processing one, impress users and management with a snazzy demo, and make our flight home. But its not just a short term POC win. In the long term, our architecture will require less maintenance overhead and have lower risk for errors resulting from duplicate code bases.

Acknowledgements

Thanks to Hari Shreedharan, Ted Malaska, Grant Henke, and Sean Owen for their valuable input and feedback.

Gwen Shapira is a Software Engineer (and former Solutions Architect) at Cloudera. She is also a co-author of the forthcoming book Hadoop Application Architectures from O’Reilly Media.

Building Lambda Architecture with Spark Streaming的更多相关文章

  1. Spark Streaming官方文档学习--下

    Accumulators and Broadcast Variables 这些不能从checkpoint重新恢复 如果想启动检查点的时候使用这两个变量,就需要创建这写变量的懒惰的singleton实例 ...

  2. Spark Streaming官方文档学习--上

    官方文档地址:http://spark.apache.org/docs/latest/streaming-programming-guide.html Spark Streaming是spark ap ...

  3. Spark Streaming连接TCP Socket

    1.Spark Streaming是什么 Spark Streaming是在Spark上建立的可扩展的高吞吐量实时处理流数据的框架,数据可以是来自多种不同的源,例如kafka,Flume,Twitte ...

  4. How Cigna Tuned Its Spark Streaming App for Real-time Processing with Apache Kafka

    Explore the configuration changes that Cigna’s Big Data Analytics team has made to optimize the perf ...

  5. [Spark][Streaming]Spark读取网络输入的例子

    Spark读取网络输入的例子: 参考如下的URL进行试验 https://stackoverflow.com/questions/46739081/how-to-get-record-in-strin ...

  6. 大数据技术之_19_Spark学习_04_Spark Streaming 应用解析 + Spark Streaming 概述、运行、解析 + DStream 的输入、转换、输出 + 优化

    第1章 Spark Streaming 概述1.1 什么是 Spark Streaming1.2 为什么要学习 Spark Streaming1.3 Spark 与 Storm 的对比第2章 运行 S ...

  7. <Spark><Spark Streaming>

    Overview Spark Streaming为用户提供了一套与batch jobs十分相似的API,以编写streaming应用 与Spark的基本概念RDDs类似,Spark Streaming ...

  8. How to implement connection pool in spark streaming

    在spark streaming的文档里,有这么一段: def sendPartition(iter): # ConnectionPool is a static, lazily initialize ...

  9. Spark之 Spark Streaming整合kafka(Java实现版本)

    pom依赖 <properties> <scala.version>2.11.8</scala.version> <hadoop.version>2.7 ...

随机推荐

  1. 2016年,总结篇 续 如何从 JQ 转到 VueJS 开发(一)

    接着 2016 年的总结,我们来看看 2016年 国内最火且没有之一的前端MVVM 框架 VueJs 虽然 到写文章的这个时间点,VueJs已经发布了 2.1.x 了, 但是对于很多 Vuejs 的初 ...

  2. 一个简单的案例带你入门Dubbo分布式框架

    相信有很多小伙伴都知道,dubbo是一个分布式.高性能.透明化的RPC服务框架,提供服务自动注册.自动发现等高效服务治理方案,dubbo的中文文档也是非常全的,中文文档可以参考这里dubbo.io.由 ...

  3. Asp.Net SignalR Hub中的上下文对象

    Hub中的 Context 使用了集线器后,会发现对比持久连接类少了OnConnectioned这样的事件,事实上是有的.需要我们去override .这下似乎发现了什么问题,记得持久连接类中有con ...

  4. 客户端缓存机制 - Cookie详解

    Cookie 作者:Stanley 罗昊 [转载请注明出处和署名,谢谢!] Cookie不是内置对象,所以用的时候需要new出来,Cookie是由服务端产生的,再发送给客户端保存,它不是内置对象,却是 ...

  5. 初学Java Web(3)——第一个Servlet

    这学期 Java Web 课程的第一节课就简短复习了一下 Java 的一些基础知识,所以觉得 Java 的基础知识还是很重要的,但当我想要去写一篇 Java 回顾的文章的时候发现很难,因为坑实在太多了 ...

  6. web进修之—Hibernate起步(1)(2)

    想开始写博客了,尝试了CSDN和cnblog之后还是觉得cnblog更加简洁.专注(不过cnblog不支持搬家),所以把刚刚写的两篇学习博客链接放在这儿,这样这个系列也算是完整了: web进修之—Hi ...

  7. REST API设计指导——译自Microsoft REST API Guidelines(二)

    由于文章内容较长,只能拆开发布.翻译的不对之处,请多多指教. 另外:最近团队在做一些技术何架构的研究,视频教程只能争取周末多录制一点,同时预计在下周我们会展开一次直播活动,内容围绕容器技术这块. 所有 ...

  8. [七]基础数据类型之Float详解

        Float 基本数据类型float  的包装类 Float 类型的对象包含一个 float 类型的字段    属性简介 用来以二进制补码形式表示 float 值的比特位数 public sta ...

  9. 痞子衡嵌入式:ARM Cortex-M文件那些事(6)- 可执行文件(.out/.elf)

    大家好,我是痞子衡,是正经搞技术的痞子.今天痞子衡给大家讲的是嵌入式开发里的executable文件(elf). 第四.五节课里,痞子衡已经给大家介绍了2种output文件,本文继续给大家讲proje ...

  10. vue webpack打包背景图片

    vue的背景图 和 img标签图大于10KB都不会转成base64处理,可以设置limit(不推荐),所以要设置一个公共路径,解决办法如下