How Cigna Tuned Its Spark Streaming App for Real-time Processing with Apache Kafka
Explore the configuration changes that Cigna’s Big Data Analytics team has made to optimize the performance of its real-time architecture.
Real-time stream processing with Apache Kafka as a backbone provides many benefits. For example, this architectural pattern can handle massive, organic data growth via the dynamic addition of streaming sources such as mobile devices, web servers, system logs, and wearable device data (aka, “Internet of Things”). Kafka can also help capture data in real-time and enable the proactive analysis of that data through Spark Streaming.
At Cigna Corporation, we have implemented such a real-time architecture based on Kafka and Apache Spark/Spark Streaming. As a result, we can capture many different types of events and react to them in real-time, and utilize machine-learning algorithms that constantly learn from new data as it arrives. We can also enrich data that arrives in batch with data that was just created in real-time.
This approach required significant tuning of our Spark Streaming application for optimal performance, however. In the remainder of this post, we’ll describe the details of that tuning. (Note: these results are specific to Cigna’s environment; your mileage may vary. But, they provide a good starting point for experimentation.)
Architecture Overview
As you may know, Spark Streaming enables the creation of real-time complex event processing architectures, and Kafka is a real-time, fault-tolerant, highly scalable data pipeline (or pub-sub system) for data streams. Spark Streaming can be configured to consume topics from Kafka, and create corresponding Kafka DStreams. Each DStream batches incoming messages into an abstraction called the RDD, which is an immutable collection of incoming messages. Each RDD is a micro-batch of the incoming messages, and the micro-batching window is configurable.
As illustrated below, most of the events we track originate from different types of portals (1). When users visit monitored pages or links, those events are captured and sent to Kafka through an Apache Flume agent (2). A Spark Streaming application then reads that data asynchronously and in parallel (3), with the Kafka events arriving in mini-batches (one-minute window). The streaming application parses the semi-structured events, and then enriches them with other data from a large (125 million records) Apache Hive table (4). This table is read from Hive via Spark’s HiveContext and cached in memory. It also parses the rest of the event using the DataFrame API to build a structure around it. Once the record is built, it persists the DataFrame as a row in a Hive table. This same information can also be written back to a Kafka topic (5); while the physical layout of the files in the table is Apache Parquet, the rows are served up as a JSON response over HTTP to enterprise services. The enriched events can then become available to other applications or other systems in real time over Kafka.
This data is accessible through a RESTful API or JDBC/ODBC (6). For example, using the Impyla API for Apache Impala (incubating), we can make the data accessible as JSON over HTTP7mdash;a simple but effective low-level data service. Using dashboard creation tools like Looker and Tableau over Impala also helps users query and visualize live event data that has been enriched and processed in real time.

Summary of Optimizations
When we first deployed the entire solution, the Kafka and Flume components were performing well. But the Spark Streaming application was taking nearly 4-8 minutes, depending on resources allocated, to process a single batch. This latency was due to the use of DataFrames to enrich the data from the very large Hive table mentioned previously, and due to various undesirable configuration options.
To optimize our processing time, we started down two paths: First, to cache data and partition it where appropriate, and second to tune the Spark application via configuration changes. (We also packaged this application as a Cloudera Manager service by creating a Custom Service Descriptor and parcel, but that step is out of scope for this post.)
The spark-submit command we use to run the Spark app is shown below. It reflects all the options that, together with coding improvements, resulted in significantly less processing time: from 4-8 minutes to under 25 seconds.
/opt/app/dev/spark-1.5./bin/spark-submit \
--jars \
/opt/cloudera/parcels/CDH/jars/zkclient-0.3.jar,/opt/cloudera/parcels/CDH/jars/kafka_2.-0.8.1.1.jar,/opt/app/dev/jars/datanucleus-core-3.2..jar,/opt/app/dev/jars/datanucleus-api-jdo-3.2..jar,/opt/app/dev/jars/datanucleus-rdbms-3.2..jar \
--files /opt/app/dev/spark-1.5./conf/hive-site.xml,/opt/app/dev/jars/log4j-eir.properties \
--queue spark_service_pool \
--master yarn \
--deploy-mode cluster \
--conf "spark.ui.showConsoleProgress=false" \
--conf "spark.driver.extraJavaOptions=-XX:MaxPermSize=6G -XX:+UseConcMarkSweepGC -Dlog4j.configuration=log4j-eir.properties" \
--conf "spark.sql.tungsten.enabled=false" \
--conf "spark.eventLog.dir=hdfs://nameservice1/user/spark/applicationHistory" \
--conf "spark.eventLog.enabled=true" \
--conf "spark.sql.codegen=false" \
--conf "spark.sql.unsafe.enabled=false" \
--conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC -Dlog4j.configuration=log4j-eir.properties" \
--conf "spark.streaming.backpressure.enabled=true" \
--conf "spark.locality.wait=1s" \
--conf "spark.streaming.blockInterval=1500ms" \
--conf "spark.shuffle.consolidateFiles=true" \
--driver-memory 10G \
--executor-memory 8G \
--executor-cores \
--num-executors \
--class com.bigdata.streaming.OurApp \ /opt/app/dev/jars/OurStreamingApplication.jar external_props.conf
Next, we’ll describe the configuration changes and caching approach in detail.
Driver Options
Note that the driver is being run on the cluster and that we are running Spark on YARN. Because Spark Streaming apps are long running, the log files generated can be very large. To solve for this issue, we limited the number of messages written to the logs and used the RollingFileAppender to limit their maximum size. We also disabled console log messages by turning off the spark.ui.showConsoleProgress option.
Also, during testing, our driver frequently ran out of memory due to the permanent generation space filling up. (The permanent space is where the classes, methods, internalized strings, and similar objects used by the VM are stored and never de-allocated.) Increasing the permanent space to 6GB solved the problem:
spark.driver.extraJavaOptions=-XX:MaxPermSize=6G
Garbage Collection
Because our streaming application is a long-running process, after a period of processing time, we noticed long GC pauses that we wanted to either minimize or keep in the background. Adjusting the UseConcMarkSweepGC
parameter seemed to do the trick:
--conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC -Dlog4j.configuration=log4j-eir.properties" \
(The G1 collector is being considered for a future release.)
Disabling Tungsten
Tungsten is a major revamp of the Spark execution engine, and, as such, could prove to be problematic in its first release. Disabling Tungsten simplified our Spark SQL DAGs somewhat. We will re-evaluate Tungsten in the future when it is more hardened for version 1.6, especially if it optimizes shuffles.
We disabled the following flags:
spark.sql.tungsten.enabled=false
spark.sql.codegen=false
spark.sql.unsafe.enabled=false
Enabling Backpressure
Spark Streaming has trouble with situations where the batch-processing time is larger than the batch interval. In other words, Spark will not be able to read data from the topic faster than it arrives—the Kafka receiver for the executor won’t be able to keep up. If this throughput is sustained for long enough, it leads to an unstable situation where the memory of the receiver’s executor overflows.
Setting the following alleviated this issue:
spark.streaming.backpressure.enabled=true
Adjusting Locality and Blocks
These two options are complementary: One determines how long to wait for the data to be local to a task/executor, and the other is used by the Spark Streaming receivers to chunk data into blocks. The larger the data blocks the better, but if the data is not local to the executors, it will have to move over the network to wherever the task will be executed. We had to find a good balance between these two options because we don’t want the data blocks to be large, nor do we want to wait too long for locality. We want all the tasks in our application to finish within seconds.
Thus, we changed the locality option to 1 second from the default 3 seconds, thereby enabling one of the 20 executors to start after 1 second has passed. We also changed the block interval to 1.5 seconds.
--conf "spark.locality.wait=1s" \
--conf "spark.streaming.blockInterval=1500ms" \
Consolidating Intermediate Files
Enabling this flag is recommended for ext4 filesystems because it results in fewer intermediate files, thereby improving filesystem performance for shuffles.
--conf "spark.shuffle.consolidateFiles=true" \
Enabling Executor Performance
While configuring a Kafka DStream, you can specify the number of parallel consumer threads. However, the consumers of a DStream will run on the same Spark Driver node. Thus, to do parallel consumption of a Kafka topic from multiple machines, you have to instantiate multiple DStreams. Although one approach would be to union the corresponding RDDs before processing, we found it cleaner to run multiple instances of the application and make them part of the same Kafka consumer group.
To do that, we enabled 20 executors and 20 cores per executor.
--executor-memory 8G
--executor-cores
--num-executors
We provided about 8GB memory to each executor to ensure that cached data remains in memory, and that there is enough room for the heap to shrink and grow. Caching reference datasets in memory helps tremendously when we run heavy DataFrame joins. This approach also speeds up processing of batches: 7,000 of them in 17-20 seconds, according to a recent benchmark.
Caching Approach
Cache the RDD before using it, but remember to remove it from cache to make room for the next batched iteration. Also, caching any data that is used multiple times, beyond the foreach loop, helps a lot. In our case, we cached the 125 million records in our Hive table as a DataFrame, partitioned that data, and used it in multiple joins. That change shaved nearly 4 minutes from total batch-processing time.
However, don’t make the number of partitions too large. Rather, keeping the number of partitions low will reduce the number of tasks and keep scheduling delays to a minimum. It will also ensure that larger chunks of data are processed with minimal delays. To confirm that the number of executors is proportional to the partitions we have, we simply kept the partitions at:
# of executors * # of cores = # of partitions
For instance, (20 * 20) = 400 partitions. Once the RDD is no longer needed in memory, rdd.unpersist() is called to swap it back out to disk.
(Note: The DataFrame API, although very effective, leaves a lot of processing to the underlying Spark system. In testing, we found that using the RDD API reduced processing time and excessive shuffling.)
Conclusion
Thanks to these changes, we now have a Spark Streaming app that is long running, uses resources responsibly, and can process real-time data within a few seconds. It provides:
- Near real-time access to data + a view of history (batch) = all data
- The ability to handle organic growth of data
- The ability to proactively analyze data
- Access to “continuous” data from many sources
- The ability to detect an event when it actually occurs
- The ability to combine events into patterns of behavior in real-time
As next steps, we are now looking at ways to further optimize our joins, simplify our DAGs, and reduce the number of shuffles that can occur. We are also experimenting with Kafka Direct Streams, which may give us the ability to control data flow and implement redundancy and resilience through check-pointing.
While Cigna’s journey with Kafka and Spark Streaming is only beginning, we are excited about the doors it opens in the realm of real-time data analytics, and our quest to help people live healthier, happier lives.
Mohammad Quraishi (@AtifQ) is a Senior Principal Technologist at Cigna Corporation and has 20 years of experience in application architecture, design, and development. He has specific experience in mobile native applications, SOA platform implementation, web development, distributed applications, object-oriented analysis and design, requirements analysis, data modeling, and database design.
Jeff Shmain is a Senior Solutions Architect at Cloudera.
How Cigna Tuned Its Spark Streaming App for Real-time Processing with Apache Kafka的更多相关文章
- Kafka:ZK+Kafka+Spark Streaming集群环境搭建(十三)kafka+spark streaming打包好的程序提交时提示虚拟内存不足(Container is running beyond virtual memory limits. Current usage: 119.5 MB of 1 GB physical memory used; 2.2 GB of 2.1 G)
异常问题:Container is running beyond virtual memory limits. Current usage: 119.5 MB of 1 GB physical mem ...
- Spark Streaming fileStream实现原理
fileStream是Spark Streaming Basic Source的一种,用于“近实时”地分析HDFS(或者与HDFS API兼容的文件系统)指定目录(假设:dataDirectory)中 ...
- 基于Spark Streaming + Canal + Kafka对Mysql增量数据实时进行监测分析
Spark Streaming可以用于实时流项目的开发,实时流项目的数据源除了可以来源于日志.文件.网络端口等,常常也有这种需求,那就是实时分析处理MySQL中的增量数据.面对这种需求当然我们可以通过 ...
- 整合Kafka到Spark Streaming——代码示例和挑战
作者Michael G. Noll是瑞士的一位工程师和研究员,效力于Verisign,是Verisign实验室的大规模数据分析基础设施(基础Hadoop)的技术主管.本文,Michael详细的演示了如 ...
- Spark学习之路(十五)—— Spark Streaming 整合 Flume
一.简介 Apache Flume是一个分布式,高可用的数据收集系统,可以从不同的数据源收集数据,经过聚合后发送到分布式计算框架或者存储系统中.Spark Straming提供了以下两种方式用于Flu ...
- Spark 系列(十五)—— Spark Streaming 整合 Flume
一.简介 Apache Flume 是一个分布式,高可用的数据收集系统,可以从不同的数据源收集数据,经过聚合后发送到分布式计算框架或者存储系统中.Spark Straming 提供了以下两种方式用于 ...
- 4. Spark Streaming解析
4.1 初始化StreamingContext import org.apache.spark._ import org.apache.spark.streaming._ val conf = new ...
- Spark Streaming实践和优化
发表于:<程序员>杂志2016年2月刊.链接:http://geek.csdn.net/news/detail/54500 作者:徐鑫,董西成 在流式计算领域,Spark Streamin ...
- Spark Streaming 整合 Flume
Spark Streaming 整合 Flume 一.简介二.推送式方法 2.1 配置日志收集Flume 2.2 项目依赖 2.3 Spark Strea ...
随机推荐
- 精读《Serverless 给前端带来了什么》
1. 引言 Serverless 是一种 "无服务器架构",让用户无需关心程序运行环境.资源及数量,只要将精力 Focus 到业务逻辑上的技术. 现在公司已经实现 DevOps 化 ...
- ASP.NET MVC ETag & Cache等优化方法
背景 最近有一个项目是用SmartAdmin + Jquery + EasyUI 一个ASP.NET MVC5的项目,一直存在一个性能问题,加载速度比较慢,第一次加载需要(在没有cache的情况下)需 ...
- Spring Boot 2.x(五):整合Mybatis-Plus
简介 Mybatis-Plus是在Mybatis的基础上,国人开发的一款持久层框架. 并且荣获了2018年度开源中国最受欢迎的中国软件TOP5 同样以简化开发为宗旨的Spring Boot与Mybat ...
- 大数据利器Hive
序言:在大数据领域存在一个现象,那就是组件繁多,粗略估计一下轻松超过20种.如果你是初学者,瞬间就会蒙圈,不知道力往哪里使.那么,为什么会出现这种现象呢?在本文的开头笔者就简单的阐述一下这种现象出现的 ...
- Zabbix3.0基础教程之一:系统监控,zabbix安装与基本配置
一.安装环境: 系统环境: 系统版本:CentOS Linux release 7.4.1708 (Core) 内核版本:3.10.0-693.el7.x86_64 关闭Firewall与SELinu ...
- Centos7-yum部署配置LAMP-之LAMP及php-fpm实现反代动态资源
一.简介 LAMP:linux+apache+mysql(这里用mariadb)+php(perl,python) LAMMP:memcached缓存的 CGI:Common Gateway Inte ...
- https 加密、http2.0、keep-alive
原文地址:https://ainyi.com/44 HTTP:是互联网上应用最为广泛的一种网络协议,是一个客户端和服务器端请求和应答的标准(TCP),用于从WWW服务器传输超文本到本地浏览器的传输协议 ...
- Asp.NetCore程序发布到CentOs(含安装部署netcore)--最佳实践(二)
Asp.NetCore程序发布到CentOs(含安装部署netcore)--最佳实践(一) 接上一篇 3. Nginx配置反向代理 3.1 cnetos 安装nginx 首先,我们需要在服务器上安装N ...
- 罗汉果与Java虚拟机系列目录与说明
声 明 罗汉果与Java虚拟机系列博文仅为本银结构性整合Java虚拟机知识的笔记和日常JVM问题的DEBUG记录.放到网上主要是为了方便自己今后查看.顺带能帮助到别人就更奈斯了. 目 录 ...
- 分享一些 Windows 平台上的神器
下面分享一些 Windows 平台上日常开发使用的软件,有些软件我自认为是神器,可以大大提高效率. 编辑器类软件 IntelliJ IDEA IntelliJ IDEA 内部集成 Java 开发环境, ...