Explore the configuration changes that Cigna’s Big Data Analytics team has made to optimize the performance of its real-time architecture.

Real-time stream processing with Apache Kafka as a backbone provides many benefits. For example, this architectural pattern can handle massive, organic data growth via the dynamic addition of streaming sources such as mobile devices, web servers, system logs, and wearable device data (aka, “Internet of Things”). Kafka can also help capture data in real-time and enable the proactive analysis of that data through Spark Streaming.

At Cigna Corporation, we have implemented such a real-time architecture based on Kafka and Apache Spark/Spark Streaming. As a result, we can capture many different types of events and react to them in real-time, and utilize machine-learning algorithms that constantly learn from new data as it arrives. We can also enrich data that arrives in batch with data that was just created in real-time.

This approach required significant tuning of our Spark Streaming application for optimal performance, however. In the remainder of this post, we’ll describe the details of that tuning. (Note: these results are specific to Cigna’s environment; your mileage may vary. But, they provide a good starting point for experimentation.)

Architecture Overview

As you may know, Spark Streaming enables the creation of real-time complex event processing architectures, and Kafka is a real-time, fault-tolerant, highly scalable data pipeline (or pub-sub system) for data streams. Spark Streaming can be configured to consume topics from Kafka, and create corresponding Kafka DStreams. Each DStream batches incoming messages into an abstraction called the RDD, which is an immutable collection of incoming messages. Each RDD is a micro-batch of the incoming messages, and the micro-batching window is configurable.

As illustrated below, most of the events we track originate from different types of portals (1). When users visit monitored pages or links, those events are captured and sent to Kafka through an Apache Flume agent (2). A Spark Streaming application then reads that data asynchronously and in parallel (3), with the Kafka events arriving in mini-batches (one-minute window).  The streaming application parses the semi-structured events, and then enriches them with other data from a large (125 million records) Apache Hive table (4). This table is read from Hive via Spark’s HiveContext and cached in memory. It also parses the rest of the event using the DataFrame API to build a structure around it. Once the record is built, it persists the DataFrame as a row in a Hive table. This same information can also be written back to a Kafka topic (5); while the physical layout of the files in the table is Apache Parquet, the rows are served up as a JSON response over HTTP to enterprise services. The enriched events can then become available to other applications or other systems in real time over Kafka.

This data is accessible through a RESTful API or JDBC/ODBC (6). For example, using the Impyla API for Apache Impala (incubating), we can make the data accessible as JSON over HTTP7mdash;a simple but effective low-level data service. Using dashboard creation tools like Looker and Tableau over Impala also helps users query and visualize live event data that has been enriched and processed in real time.

Summary of Optimizations

When we first deployed the entire solution, the Kafka and Flume components were performing well. But the Spark Streaming application was taking nearly 4-8 minutes, depending on resources allocated, to process a single batch. This latency was due to the use of DataFrames to enrich the data from the very large Hive table mentioned previously, and due to various undesirable configuration options.

To optimize our processing time, we started down two paths: First, to cache data and partition it where appropriate, and second to tune the Spark application via configuration changes. (We also packaged this application as a Cloudera Manager service by creating a Custom Service Descriptor and parcel, but that step is out of scope for this post.)

The spark-submit command we use to run the Spark app is shown below. It reflects all the options that, together with coding improvements, resulted in significantly less processing time: from 4-8 minutes to under 25 seconds.

/opt/app/dev/spark-1.5./bin/spark-submit \
--jars \
/opt/cloudera/parcels/CDH/jars/zkclient-0.3.jar,/opt/cloudera/parcels/CDH/jars/kafka_2.-0.8.1.1.jar,/opt/app/dev/jars/datanucleus-core-3.2..jar,/opt/app/dev/jars/datanucleus-api-jdo-3.2..jar,/opt/app/dev/jars/datanucleus-rdbms-3.2..jar \
--files /opt/app/dev/spark-1.5./conf/hive-site.xml,/opt/app/dev/jars/log4j-eir.properties \
--queue spark_service_pool \
--master yarn \
--deploy-mode cluster \
--conf "spark.ui.showConsoleProgress=false" \
--conf "spark.driver.extraJavaOptions=-XX:MaxPermSize=6G -XX:+UseConcMarkSweepGC -Dlog4j.configuration=log4j-eir.properties" \
--conf "spark.sql.tungsten.enabled=false" \
--conf "spark.eventLog.dir=hdfs://nameservice1/user/spark/applicationHistory" \
--conf "spark.eventLog.enabled=true" \
--conf "spark.sql.codegen=false" \
--conf "spark.sql.unsafe.enabled=false" \
--conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC -Dlog4j.configuration=log4j-eir.properties" \
--conf "spark.streaming.backpressure.enabled=true" \
--conf "spark.locality.wait=1s" \
--conf "spark.streaming.blockInterval=1500ms" \
--conf "spark.shuffle.consolidateFiles=true" \
--driver-memory 10G \
--executor-memory 8G \
--executor-cores \
--num-executors \
--class com.bigdata.streaming.OurApp \ /opt/app/dev/jars/OurStreamingApplication.jar external_props.conf

Next, we’ll describe the configuration changes and caching approach in detail.

Driver Options

Note that the driver is being run on the cluster and that we are running Spark on YARN. Because Spark Streaming apps are long running, the log files generated can be very large. To solve for this issue, we limited the number of messages written to the logs and used the RollingFileAppender to limit their maximum size. We also disabled console log messages by turning off the spark.ui.showConsoleProgress option.

Also, during testing, our driver frequently ran out of memory due to the permanent generation space filling up. (The permanent space is where the classes, methods, internalized strings, and similar objects used by the VM are stored and never de-allocated.) Increasing the permanent space to 6GB solved the problem:

spark.driver.extraJavaOptions=-XX:MaxPermSize=6G

Garbage Collection

Because our streaming application is a long-running process, after a period of processing time, we noticed long GC pauses that we wanted to either minimize or keep in the background. Adjusting the UseConcMarkSweepGC
parameter seemed to do the trick:

--conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC -Dlog4j.configuration=log4j-eir.properties" \

(The G1 collector is being considered for a future release.)

Disabling Tungsten

Tungsten is a major revamp of the Spark execution engine, and, as such, could prove to be problematic in its first release. Disabling Tungsten simplified our Spark SQL DAGs somewhat. We will re-evaluate Tungsten in the future when it is more hardened for version 1.6, especially if it optimizes shuffles.

We disabled the following flags:

spark.sql.tungsten.enabled=false
spark.sql.codegen=false
spark.sql.unsafe.enabled=false

Enabling Backpressure

Spark Streaming has trouble with situations where the batch-processing time is larger than the batch interval. In other words, Spark will not be able to read data from the topic faster than it arrives—the Kafka receiver for the executor won’t be able to keep up. If this throughput is sustained for long enough, it leads to an unstable situation where the memory of the receiver’s executor overflows.

Setting the following alleviated this issue:

spark.streaming.backpressure.enabled=true

Adjusting Locality and Blocks

These two options are complementary: One determines how long to wait for the data to be local to a task/executor, and the other is used by the Spark Streaming receivers to chunk data into blocks. The larger the data blocks the better, but if the data is not local to the executors, it will have to move over the network to wherever the task will be executed. We had to find a good balance between these two options because we don’t want the data blocks to be large, nor do we want to wait too long for locality. We want all the tasks in our application to finish within seconds.

Thus, we changed the locality option to 1 second from the default 3 seconds, thereby enabling one of the 20 executors to start after 1 second has passed.  We also changed the block interval to 1.5 seconds.

--conf "spark.locality.wait=1s" \
--conf "spark.streaming.blockInterval=1500ms" \

Consolidating Intermediate Files

Enabling this flag is recommended for ext4 filesystems because it results in fewer intermediate files, thereby improving filesystem performance for shuffles.

--conf "spark.shuffle.consolidateFiles=true" \

Enabling Executor Performance

While configuring a Kafka DStream, you can specify the number of parallel consumer threads. However, the consumers of a DStream will run on the same Spark Driver node. Thus, to do parallel consumption of a Kafka topic from multiple machines, you have to instantiate multiple DStreams. Although one approach would be to union the corresponding RDDs before processing, we found it cleaner to run multiple instances of the application and make them part of the same Kafka consumer group.

To do that, we enabled 20 executors and 20 cores per executor.

--executor-memory 8G
--executor-cores
--num-executors

We provided about 8GB memory to each executor to ensure that cached data remains in memory, and that there is enough room for the heap to shrink and grow. Caching reference datasets in memory helps tremendously when we run heavy DataFrame joins. This approach also speeds up processing of batches: 7,000 of them in 17-20 seconds, according to a recent benchmark.

Caching Approach

Cache the RDD before using it, but remember to remove it from cache to make room for the next batched iteration. Also, caching any data that is used multiple times, beyond the foreach loop, helps a lot. In our case, we cached the 125 million records in our Hive table as a DataFrame, partitioned that data, and used it in multiple joins. That change shaved nearly 4 minutes from total batch-processing time.

However, don’t make the number of partitions too large. Rather, keeping the number of partitions low will reduce the number of tasks and keep scheduling delays to a minimum. It will also ensure that larger chunks of data are processed with minimal delays. To confirm that the number of executors is proportional to the partitions we have, we simply kept the partitions at:

# of executors * # of cores = # of partitions

For instance, (20 * 20) = 400 partitions. Once the RDD is no longer needed in memory, rdd.unpersist() is called to swap it back out to disk.

(Note: The DataFrame API, although very effective, leaves a lot of processing to the underlying Spark system. In testing, we found that using the RDD API reduced processing time and excessive shuffling.)

Conclusion

Thanks to these changes, we now have a Spark Streaming app that is long running, uses resources responsibly, and can process real-time data within a few seconds. It provides:

  • Near real-time access to data + a view of history (batch) = all data
  • The ability to handle organic growth of data
  • The ability to proactively analyze data
  • Access to “continuous” data from many sources
  • The ability to detect an event when it actually occurs
  • The ability to combine events into patterns of behavior in real-time

As next steps, we are now looking at ways to further optimize our joins, simplify our DAGs, and reduce the number of shuffles that can occur.  We are also experimenting with Kafka Direct Streams, which may give us the ability to control data flow and implement redundancy and resilience through check-pointing.

While Cigna’s journey with Kafka and Spark Streaming is only beginning, we are excited about the doors it opens in the realm of real-time data analytics, and our quest to help people live healthier, happier lives.

Mohammad Quraishi (@AtifQ) is a Senior Principal Technologist at Cigna Corporation and has 20 years of experience in application architecture, design, and development. He has specific experience in mobile native applications, SOA platform implementation, web development, distributed applications, object-oriented analysis and design, requirements analysis, data modeling, and database design.

Jeff Shmain is a Senior Solutions Architect at Cloudera.

How Cigna Tuned Its Spark Streaming App for Real-time Processing with Apache Kafka的更多相关文章

  1. Kafka:ZK+Kafka+Spark Streaming集群环境搭建(十三)kafka+spark streaming打包好的程序提交时提示虚拟内存不足(Container is running beyond virtual memory limits. Current usage: 119.5 MB of 1 GB physical memory used; 2.2 GB of 2.1 G)

    异常问题:Container is running beyond virtual memory limits. Current usage: 119.5 MB of 1 GB physical mem ...

  2. Spark Streaming fileStream实现原理

    fileStream是Spark Streaming Basic Source的一种,用于“近实时”地分析HDFS(或者与HDFS API兼容的文件系统)指定目录(假设:dataDirectory)中 ...

  3. 基于Spark Streaming + Canal + Kafka对Mysql增量数据实时进行监测分析

    Spark Streaming可以用于实时流项目的开发,实时流项目的数据源除了可以来源于日志.文件.网络端口等,常常也有这种需求,那就是实时分析处理MySQL中的增量数据.面对这种需求当然我们可以通过 ...

  4. 整合Kafka到Spark Streaming——代码示例和挑战

    作者Michael G. Noll是瑞士的一位工程师和研究员,效力于Verisign,是Verisign实验室的大规模数据分析基础设施(基础Hadoop)的技术主管.本文,Michael详细的演示了如 ...

  5. Spark学习之路(十五)—— Spark Streaming 整合 Flume

    一.简介 Apache Flume是一个分布式,高可用的数据收集系统,可以从不同的数据源收集数据,经过聚合后发送到分布式计算框架或者存储系统中.Spark Straming提供了以下两种方式用于Flu ...

  6. Spark 系列(十五)—— Spark Streaming 整合 Flume

    一.简介 Apache Flume 是一个分布式,高可用的数据收集系统,可以从不同的数据源收集数据,经过聚合后发送到分布式计算框架或者存储系统中.Spark Straming 提供了以下两种方式用于 ...

  7. 4. Spark Streaming解析

    4.1 初始化StreamingContext import org.apache.spark._ import org.apache.spark.streaming._ val conf = new ...

  8. Spark Streaming实践和优化

    发表于:<程序员>杂志2016年2月刊.链接:http://geek.csdn.net/news/detail/54500 作者:徐鑫,董西成 在流式计算领域,Spark Streamin ...

  9. Spark Streaming 整合 Flume

    Spark Streaming 整合 Flume ​ 一.简介二.推送式方法        2.1 配置日志收集Flume        2.2 项目依赖        2.3 Spark Strea ...

随机推荐

  1. asp.net core系列 33 EF查询数据 (2)

    一. 原生SQL查询 接着上篇讲.通过 Entity Framework Core 可以在使用关系数据库时下降到原始 SQL 查询. 在无法使用 LINQ 表达要执行的查询时,或因使用 LINQ 查询 ...

  2. WebSocket刨根问底(一)

    年初的时候,写过两篇博客介绍在Spring Boot中如何使用WebSocket发送消息[在Spring Boot框架下使用WebSocket实现消息推送][在Spring Boot框架下使用WebS ...

  3. Java开发知识之Java类的高级特性,内部类.以及包使用.跟常量关键字

    Java开发知识之Java类的高级特性,内部类.以及包使用.跟常量关键字 一丶Java中包的机制 首先包其实就是个文件夹.作用就是管理类. Java中每次定义一个类的时候.通过Java编译之后.都会生 ...

  4. SpringMvc 请求中日期类型参数接收一二事儿

    首先说明:以版本为Spring 4.3.0为测试对象: 开启<mvc:annotation-driven /> 测试场景一:请求中含有date属性,该类型为日期类型,SpringMvc采用 ...

  5. 学JAVA第十一天,属性与方法

    今天清明节假期结束第二天,昨天请了一天假去考科目二,还好考过了,O(∩_∩)O哈哈~~~~ 今天老师讲了类的属性与方法的使用 就用代码来说明吧: package pkg3;public class T ...

  6. Java开发笔记(二十五)方法的输入参数

    前面通过main方法介绍了方法的定义形式,对于方法的输入参数来说,还有几个值得注意的地方,接下来分别对输入参数的几种用法进行阐述.一个方法可以有输入参数,也可以没有输入参数,倘若无需输入参数,则方法定 ...

  7. mybatis报错:Caused by: java.lang.IllegalArgumentException: Caches collection already contains value for com.crm.dao.PaperUserMapper

    一.问题 eclipse启动时报下面的错误: Caused by: java.lang.IllegalArgumentException: Caches collection already cont ...

  8. append和appendTo的区别!

    今天在写dome的时候,碰到了一小点问题,就是我们想把一个小效果用jquery的办法添加到HTML页面中.我用的办法就是先在HTML中把代码写完,js和css同样写好并调试完成后.然后只保存外面最大的 ...

  9. 2019-01-23 JavaScript实现ZLOGO: 性能改进

    主攻前文吴烜:JavaScript实现ZLOGO: 界面改进与速度可调的几个性能问题 在线演示: 圈3 源码仍在: program-in-chinese/quan3 之前是在绘制过程中计算每帧需要绘制 ...

  10. 41.Odoo产品分析 (四) – 工具板块(10) – 问卷(1)

    查看Odoo产品分析系列--目录 在该模块下,可以创建问卷,收集答案,打印统计.  安装"问卷"模块,首页显示当前各个阶段中的问卷:  打开"开发者模式",能对 ...