Explore the configuration changes that Cigna’s Big Data Analytics team has made to optimize the performance of its real-time architecture.

Real-time stream processing with Apache Kafka as a backbone provides many benefits. For example, this architectural pattern can handle massive, organic data growth via the dynamic addition of streaming sources such as mobile devices, web servers, system logs, and wearable device data (aka, “Internet of Things”). Kafka can also help capture data in real-time and enable the proactive analysis of that data through Spark Streaming.

At Cigna Corporation, we have implemented such a real-time architecture based on Kafka and Apache Spark/Spark Streaming. As a result, we can capture many different types of events and react to them in real-time, and utilize machine-learning algorithms that constantly learn from new data as it arrives. We can also enrich data that arrives in batch with data that was just created in real-time.

This approach required significant tuning of our Spark Streaming application for optimal performance, however. In the remainder of this post, we’ll describe the details of that tuning. (Note: these results are specific to Cigna’s environment; your mileage may vary. But, they provide a good starting point for experimentation.)

Architecture Overview

As you may know, Spark Streaming enables the creation of real-time complex event processing architectures, and Kafka is a real-time, fault-tolerant, highly scalable data pipeline (or pub-sub system) for data streams. Spark Streaming can be configured to consume topics from Kafka, and create corresponding Kafka DStreams. Each DStream batches incoming messages into an abstraction called the RDD, which is an immutable collection of incoming messages. Each RDD is a micro-batch of the incoming messages, and the micro-batching window is configurable.

As illustrated below, most of the events we track originate from different types of portals (1). When users visit monitored pages or links, those events are captured and sent to Kafka through an Apache Flume agent (2). A Spark Streaming application then reads that data asynchronously and in parallel (3), with the Kafka events arriving in mini-batches (one-minute window).  The streaming application parses the semi-structured events, and then enriches them with other data from a large (125 million records) Apache Hive table (4). This table is read from Hive via Spark’s HiveContext and cached in memory. It also parses the rest of the event using the DataFrame API to build a structure around it. Once the record is built, it persists the DataFrame as a row in a Hive table. This same information can also be written back to a Kafka topic (5); while the physical layout of the files in the table is Apache Parquet, the rows are served up as a JSON response over HTTP to enterprise services. The enriched events can then become available to other applications or other systems in real time over Kafka.

This data is accessible through a RESTful API or JDBC/ODBC (6). For example, using the Impyla API for Apache Impala (incubating), we can make the data accessible as JSON over HTTP7mdash;a simple but effective low-level data service. Using dashboard creation tools like Looker and Tableau over Impala also helps users query and visualize live event data that has been enriched and processed in real time.

Summary of Optimizations

When we first deployed the entire solution, the Kafka and Flume components were performing well. But the Spark Streaming application was taking nearly 4-8 minutes, depending on resources allocated, to process a single batch. This latency was due to the use of DataFrames to enrich the data from the very large Hive table mentioned previously, and due to various undesirable configuration options.

To optimize our processing time, we started down two paths: First, to cache data and partition it where appropriate, and second to tune the Spark application via configuration changes. (We also packaged this application as a Cloudera Manager service by creating a Custom Service Descriptor and parcel, but that step is out of scope for this post.)

The spark-submit command we use to run the Spark app is shown below. It reflects all the options that, together with coding improvements, resulted in significantly less processing time: from 4-8 minutes to under 25 seconds.

/opt/app/dev/spark-1.5./bin/spark-submit \
--jars \
/opt/cloudera/parcels/CDH/jars/zkclient-0.3.jar,/opt/cloudera/parcels/CDH/jars/kafka_2.-0.8.1.1.jar,/opt/app/dev/jars/datanucleus-core-3.2..jar,/opt/app/dev/jars/datanucleus-api-jdo-3.2..jar,/opt/app/dev/jars/datanucleus-rdbms-3.2..jar \
--files /opt/app/dev/spark-1.5./conf/hive-site.xml,/opt/app/dev/jars/log4j-eir.properties \
--queue spark_service_pool \
--master yarn \
--deploy-mode cluster \
--conf "spark.ui.showConsoleProgress=false" \
--conf "spark.driver.extraJavaOptions=-XX:MaxPermSize=6G -XX:+UseConcMarkSweepGC -Dlog4j.configuration=log4j-eir.properties" \
--conf "spark.sql.tungsten.enabled=false" \
--conf "spark.eventLog.dir=hdfs://nameservice1/user/spark/applicationHistory" \
--conf "spark.eventLog.enabled=true" \
--conf "spark.sql.codegen=false" \
--conf "spark.sql.unsafe.enabled=false" \
--conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC -Dlog4j.configuration=log4j-eir.properties" \
--conf "spark.streaming.backpressure.enabled=true" \
--conf "spark.locality.wait=1s" \
--conf "spark.streaming.blockInterval=1500ms" \
--conf "spark.shuffle.consolidateFiles=true" \
--driver-memory 10G \
--executor-memory 8G \
--executor-cores \
--num-executors \
--class com.bigdata.streaming.OurApp \ /opt/app/dev/jars/OurStreamingApplication.jar external_props.conf

Next, we’ll describe the configuration changes and caching approach in detail.

Driver Options

Note that the driver is being run on the cluster and that we are running Spark on YARN. Because Spark Streaming apps are long running, the log files generated can be very large. To solve for this issue, we limited the number of messages written to the logs and used the RollingFileAppender to limit their maximum size. We also disabled console log messages by turning off the spark.ui.showConsoleProgress option.

Also, during testing, our driver frequently ran out of memory due to the permanent generation space filling up. (The permanent space is where the classes, methods, internalized strings, and similar objects used by the VM are stored and never de-allocated.) Increasing the permanent space to 6GB solved the problem:

spark.driver.extraJavaOptions=-XX:MaxPermSize=6G

Garbage Collection

Because our streaming application is a long-running process, after a period of processing time, we noticed long GC pauses that we wanted to either minimize or keep in the background. Adjusting the UseConcMarkSweepGC
parameter seemed to do the trick:

--conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC -Dlog4j.configuration=log4j-eir.properties" \

(The G1 collector is being considered for a future release.)

Disabling Tungsten

Tungsten is a major revamp of the Spark execution engine, and, as such, could prove to be problematic in its first release. Disabling Tungsten simplified our Spark SQL DAGs somewhat. We will re-evaluate Tungsten in the future when it is more hardened for version 1.6, especially if it optimizes shuffles.

We disabled the following flags:

spark.sql.tungsten.enabled=false
spark.sql.codegen=false
spark.sql.unsafe.enabled=false

Enabling Backpressure

Spark Streaming has trouble with situations where the batch-processing time is larger than the batch interval. In other words, Spark will not be able to read data from the topic faster than it arrives—the Kafka receiver for the executor won’t be able to keep up. If this throughput is sustained for long enough, it leads to an unstable situation where the memory of the receiver’s executor overflows.

Setting the following alleviated this issue:

spark.streaming.backpressure.enabled=true

Adjusting Locality and Blocks

These two options are complementary: One determines how long to wait for the data to be local to a task/executor, and the other is used by the Spark Streaming receivers to chunk data into blocks. The larger the data blocks the better, but if the data is not local to the executors, it will have to move over the network to wherever the task will be executed. We had to find a good balance between these two options because we don’t want the data blocks to be large, nor do we want to wait too long for locality. We want all the tasks in our application to finish within seconds.

Thus, we changed the locality option to 1 second from the default 3 seconds, thereby enabling one of the 20 executors to start after 1 second has passed.  We also changed the block interval to 1.5 seconds.

--conf "spark.locality.wait=1s" \
--conf "spark.streaming.blockInterval=1500ms" \

Consolidating Intermediate Files

Enabling this flag is recommended for ext4 filesystems because it results in fewer intermediate files, thereby improving filesystem performance for shuffles.

--conf "spark.shuffle.consolidateFiles=true" \

Enabling Executor Performance

While configuring a Kafka DStream, you can specify the number of parallel consumer threads. However, the consumers of a DStream will run on the same Spark Driver node. Thus, to do parallel consumption of a Kafka topic from multiple machines, you have to instantiate multiple DStreams. Although one approach would be to union the corresponding RDDs before processing, we found it cleaner to run multiple instances of the application and make them part of the same Kafka consumer group.

To do that, we enabled 20 executors and 20 cores per executor.

--executor-memory 8G
--executor-cores
--num-executors

We provided about 8GB memory to each executor to ensure that cached data remains in memory, and that there is enough room for the heap to shrink and grow. Caching reference datasets in memory helps tremendously when we run heavy DataFrame joins. This approach also speeds up processing of batches: 7,000 of them in 17-20 seconds, according to a recent benchmark.

Caching Approach

Cache the RDD before using it, but remember to remove it from cache to make room for the next batched iteration. Also, caching any data that is used multiple times, beyond the foreach loop, helps a lot. In our case, we cached the 125 million records in our Hive table as a DataFrame, partitioned that data, and used it in multiple joins. That change shaved nearly 4 minutes from total batch-processing time.

However, don’t make the number of partitions too large. Rather, keeping the number of partitions low will reduce the number of tasks and keep scheduling delays to a minimum. It will also ensure that larger chunks of data are processed with minimal delays. To confirm that the number of executors is proportional to the partitions we have, we simply kept the partitions at:

# of executors * # of cores = # of partitions

For instance, (20 * 20) = 400 partitions. Once the RDD is no longer needed in memory, rdd.unpersist() is called to swap it back out to disk.

(Note: The DataFrame API, although very effective, leaves a lot of processing to the underlying Spark system. In testing, we found that using the RDD API reduced processing time and excessive shuffling.)

Conclusion

Thanks to these changes, we now have a Spark Streaming app that is long running, uses resources responsibly, and can process real-time data within a few seconds. It provides:

  • Near real-time access to data + a view of history (batch) = all data
  • The ability to handle organic growth of data
  • The ability to proactively analyze data
  • Access to “continuous” data from many sources
  • The ability to detect an event when it actually occurs
  • The ability to combine events into patterns of behavior in real-time

As next steps, we are now looking at ways to further optimize our joins, simplify our DAGs, and reduce the number of shuffles that can occur.  We are also experimenting with Kafka Direct Streams, which may give us the ability to control data flow and implement redundancy and resilience through check-pointing.

While Cigna’s journey with Kafka and Spark Streaming is only beginning, we are excited about the doors it opens in the realm of real-time data analytics, and our quest to help people live healthier, happier lives.

Mohammad Quraishi (@AtifQ) is a Senior Principal Technologist at Cigna Corporation and has 20 years of experience in application architecture, design, and development. He has specific experience in mobile native applications, SOA platform implementation, web development, distributed applications, object-oriented analysis and design, requirements analysis, data modeling, and database design.

Jeff Shmain is a Senior Solutions Architect at Cloudera.

How Cigna Tuned Its Spark Streaming App for Real-time Processing with Apache Kafka的更多相关文章

  1. Kafka:ZK+Kafka+Spark Streaming集群环境搭建(十三)kafka+spark streaming打包好的程序提交时提示虚拟内存不足(Container is running beyond virtual memory limits. Current usage: 119.5 MB of 1 GB physical memory used; 2.2 GB of 2.1 G)

    异常问题:Container is running beyond virtual memory limits. Current usage: 119.5 MB of 1 GB physical mem ...

  2. Spark Streaming fileStream实现原理

    fileStream是Spark Streaming Basic Source的一种,用于“近实时”地分析HDFS(或者与HDFS API兼容的文件系统)指定目录(假设:dataDirectory)中 ...

  3. 基于Spark Streaming + Canal + Kafka对Mysql增量数据实时进行监测分析

    Spark Streaming可以用于实时流项目的开发,实时流项目的数据源除了可以来源于日志.文件.网络端口等,常常也有这种需求,那就是实时分析处理MySQL中的增量数据.面对这种需求当然我们可以通过 ...

  4. 整合Kafka到Spark Streaming——代码示例和挑战

    作者Michael G. Noll是瑞士的一位工程师和研究员,效力于Verisign,是Verisign实验室的大规模数据分析基础设施(基础Hadoop)的技术主管.本文,Michael详细的演示了如 ...

  5. Spark学习之路(十五)—— Spark Streaming 整合 Flume

    一.简介 Apache Flume是一个分布式,高可用的数据收集系统,可以从不同的数据源收集数据,经过聚合后发送到分布式计算框架或者存储系统中.Spark Straming提供了以下两种方式用于Flu ...

  6. Spark 系列(十五)—— Spark Streaming 整合 Flume

    一.简介 Apache Flume 是一个分布式,高可用的数据收集系统,可以从不同的数据源收集数据,经过聚合后发送到分布式计算框架或者存储系统中.Spark Straming 提供了以下两种方式用于 ...

  7. 4. Spark Streaming解析

    4.1 初始化StreamingContext import org.apache.spark._ import org.apache.spark.streaming._ val conf = new ...

  8. Spark Streaming实践和优化

    发表于:<程序员>杂志2016年2月刊.链接:http://geek.csdn.net/news/detail/54500 作者:徐鑫,董西成 在流式计算领域,Spark Streamin ...

  9. Spark Streaming 整合 Flume

    Spark Streaming 整合 Flume ​ 一.简介二.推送式方法        2.1 配置日志收集Flume        2.2 项目依赖        2.3 Spark Strea ...

随机推荐

  1. Spring Boot2.0:使用Docker部署Spring Boot

    一.Spring Boot项目添加 Docker 支持1.在pom.xml中添加 Docker 构建插件 <plugins> <!-- Docker maven plugin --& ...

  2. log4j2 使用纪要

    简介 Apache Log4j 2 是Log4j 的升级版,在该版本实现中,日志的处理流程及效率有了显著提升. 此外新版本也合入了一些logback日志框架体系的一些改进点. 关键特性 API分离,接 ...

  3. 设计模式总结篇系列:抽象工厂模式(Abstract Factory)

    在上一篇的工厂方法模式中,通过一个公用的类对其他具有相同特性(实现相同接口或继承同一父类)的类的对象进行创建.随之带来的问题在于:当新定义了一个具有相同特性的类时,需要修改工厂类.这与设计模式中的开闭 ...

  4. spring原理案例-基本项目搭建 02 spring jar包详解 spring jar包的用途

    Spring4 Jar包详解 SpringJava Spring AOP: Spring的面向切面编程,提供AOP(面向切面编程)的实现 Spring Aspects: Spring提供的对Aspec ...

  5. .Net Framework 4.5.1 ASP.NET MVC 5 下新建视图报“错误 运行所选代码生成器时出错 无法检索元数据 没有为该对象定义无参数构造函数”

    当在控制器中新建视图的时候,选择的视图界面如下: 执行添加后报如下错误: 错误的内容为: 错误运行所选代码生成器时出错 无法检索"XXX"的元数据没有为该对象定义无参数构造函数 U ...

  6. EF 的 CURD 操作

    EF 的 CURD 操作 这里采用了数据库 Northwind,下载地址:https://northwinddatabase.codeplex.com/ 增 /// <summary> / ...

  7. VisualStudio2017集成GitHub

    1 概述 通过使用VisualStudio2017来编写C语言版本的helloworld程序,然后上传至GitHub 2 VisualStudio2017安装GitHub插件 前提条件: 必须完成Vi ...

  8. c#调用com组件,程序 发生意外<hr=0x80020009>

    引用dll,确认dll没有问题,版本正确,可是一直报发生意外,没有任何其他提示. 解决方案: 看dll引用选项配置 复制到本地:设为true,我的就是false; 嵌入互操作类型:false,如果是t ...

  9. Android项目刮刮奖详解(三)

    Android项目刮刮奖详解(二) 前言 上一期我们已经实现了一个简易的刮刮卡功能,这一期我们来将其完善一下 目标 将刮刮奖的宽高改为合适高度 将刮刮奖位置居中 将信息层的图片换成文字(重点) 实现 ...

  10. C# System.Runtime.Caching使用

    System.Runtime.Caching命名空间是.NET 4.0新增的,目的是将以前的.NET 版本中的System.Web.Caching单独提取出来,独立使用,这样web和其他.NET程序如 ...