The Big Data Team at Tencent

   

In recent years, the increasing need for timeliness, together with advances in software and hardware technologies, drive the emergence of real-time stream processing. Real-time stream processing allows public and private organizations to monitor collected information, make rapid decisions, tweak production processes, and ultimately gain competitive advantages. To satisfy the need for accessible real-time stream processing, we built Oceanus, a one-stop platform for real-time stream processing. Oceanus deploys Apache FlinkⓇ as its execution engine, hence realizing the benefits brought by Flink. Additionally, Oceanus provides efficient management for all stages in a real-time application’s lifecycle, namely development, testing, deployment, and operating. These functionalities significantly improve the efficiency of real-time applications in production.

Introduction

Tencent users generate a lot of data every day, which is a huge asset for us. In return, Tencent leverages the data every day in both strategic and operational decisions to better serve users. Recent efforts in Big Data allow us to process data at large scale, but that is far from satisfactory. As the value of data vanishes over time, it’s also critical to obtain timely results from the evolving world.

The Big Data team (http://data.qq.com) is responsible for building an efficient and reliable infrastructure for data collection, analytics, and serving at Tencent. Every day we process 17 trillion messages, whose size is approximately 3PB at disk after compression. In extreme cases, the peak throughput reaches 210 million per second. With such a huge  scale and a large number of serving business units the team faces some great challenges that can be summarized below:

  • With diverse application scenarios in different business units, the requirements for real-time applications vary significantly.

  • Given the large amount of data and the number of jobs, we have to allocate computing resources efficiently and reasonably;

  • We have to provide high-throughput and low-latency computation while tolerating failures in production.

In order to meet the above challenges and help our customers gain timely insight in a fast-evolving manner, we built a one-stop platform for real-time stream processing called Oceanus. Oceanus deploys Apache FlinkⓇ  as its execution engine and provides efficient management for the life cycle of Flink jobs.

We chose Flink as the execution engine because of its excellent performance and powerful programming interfaces. Prior to that, our real-time applications were running on a real-time computing platform based on Apache Storm. While working with Storm we were struggling with some drawbacks such as:

  • Apache Storm’s API is very low-level and requires a lot of effort to develop real-time applications. Users must have knowledge of the computing framework to ensure the correctness of their programs.

  • Due to lack of support for windows, users have to deal with out-of-order records by themselves, which is a tedious and error-prone task.

  • There is no built-in support for state. Users have to manually take care of the storage of state and its distribution under distributed settings. Very often, users store their state in external distributed storage, e.g. HBase and MySQL. The performance is naturally degraded by remote accesses in these cases.

  • It’s difficult to achieve EXACTLY-ONCE message transmission and computation in cases of failure.

Currently, we have migrated most of our Storm jobs to Flink. The number of real-time computations performed at Oceanus reaches 20 trillion times per day!

Introduction to Oceanus

Oceanus is built to improve the efficiency of the development and operations of real-time applications, e.g., online ML, ETL, and real-time BI. The architecture of Oceanus is illustrated in the following figure.

In typical cases, users read events from external storage like Tube, MySQL, and HBase, process these events with Oceanus and finally write results back to external storages. Oceanus deploys Apache FlinkⓇ  as its execution engine and facilitates the development of Flink jobs with flexible programming interfaces. To achieve good performance with high resource utility, Oceanus executes all Flink jobs on Gaia, a resource management system built on top of Yarn. With efficient logging, metrics and visualization tools, users can effectively operate and monitor their jobs with Oceanus.

Flexible programming interfaces

Oceanus provides a variety of methods to develop real-time applications, namely Canvas, SQL and JAR.

Most users can easily develop their real-time applications with canvas. Oceanus provisions a set of common operators. Users can develop their applications by dragging and connecting these operators. With Canvas, users can focus on their application logic, without the need to understand any underlying implementation details.

Users familiar with SQL can write their applications with Flink SQL. Because Flink SQL follows SQL standards, users can easily adapt their applications, originally written for batch processing, to data streams. To further improve the efficiency of developing SQL programs, Oceanus provides a set of methods such as:

  • Syntax highlight and automatic completion

  • Fuzzy matching of the names of tables, fields and functions.

  • One-click code formatting

  • One-click code verification

Given the limited expressive power of Canvas and SQL, Oceanus allows users to develop their applications with the imperative DataStream interfaces. Users can use DataStream to deal with complex logic and perform aggressive optimizations. They can then run their DataStream programs on Oceanus by simply uploading the JARs of their applications.

Visualization of computing results

Oceanus provides two methods to visualize the computing results. Firstly, Oceanus samples computing results and demonstrates them in web pages. This method is very suitable for users who are testing their programs. They can validate the correctness by simply comparing these results with the expected ones.

Oceanus also allows the visualization of computing results with Xiaoma Dashboard (http://xiaoma.qq.com) , a visualization tool developed by Tencent. Users can easily build their dashboards to make the results understandable.

Quick verification of programs

With the visualization of computing results, users can quickly verify their programs by comparing the expected with the actual results. Users can populate the testing data by randomly generated data by Oceanus or by uploading their own data. To achieve more realistic results, they can also test their programs with the data sampled in production.

Easy deployment of applications

Oceanus frees users from the burdens of resource allocation and application deployment. Oceanus employs Gaia, a resource management system built on top of Yarn, to manage resources and schedule jobs. Users can configure their jobs in Oceanus and submit them with one click. Oceanus will then calculate the required resources and deploy the jobs in production with Gaia. With the checkpointing mechanism provided by Flink and the scheduling mechanism provisioned by Gaia, users can change the parallelism of jobs dynamically.

Rich runtime metrics

Oceanus collects a lot of runtime metrics of Flink jobs and writes collected metrics into Tube, a message queue service at Tencent. These metrics are then aggregated and displayed on web pages. With these metrics, users can better operate their applications and easily find the causes of failures or exceptions.

Improvements to Flink

The Big Data team at Tencent makes a lot of efforts to enhance the functionality and reliability of Flink, some of which are described below:

  • Provide more than 30 functions for Table API and SQL

  • Allow the handling of time-out events in AsyncIO operators

  • Allow flexible processing of late records in window operators

  • Enhance the reliability with the reconciling of Job Masters

The Tencent team is very active in the Flink community, having contributed approximately 100 pull requests with this number definitely set to increase.

In the near future, we will continue our work on Flink to enhance its functionality and reliability. We will also provide more functionality to Oceanus to facilitate the development of real-time applications, by building an effective framework for online learning on top of Oceanus as an example.

腾讯大数据平台Oceanus: A one-stop platform for real time stream processing powered by Apache Flink的更多相关文章

  1. 从0到N建立高性价比的大数据平台(转载)

    2016-07-29 14:13:23 钱曙光 阅读数 794 原文链接:https://blog.csdn.net/qiansg123/article/details/80124521 声明:本文为 ...

  2. 从 Hadoop 到云原生, 大数据平台如何做存算分离

    Hadoop 的诞生改变了企业对数据的存储.处理和分析的过程,加速了大数据的发展,受到广泛的应用,给整个行业带来了变革意义的改变:随着云计算时代的到来, 存算分离的架构受到青睐,企业开开始对 Hado ...

  3. 大数据平台R语言web UI应用架构 设计与开发

    1. 系统拓扑图 在日常业务分析中,R是非常常用的分析工具,而当数据量较大时,用R语言需要需用更多的时间来完成训练模型,spark作为大规模数据处理框架,采用内存计算,可以短时间内完成大量的数据的处理 ...

  4. 朝花夕拾之--大数据平台CDH集群离线搭建

    body { border: 1px solid #ddd; outline: 1300px solid #fff; margin: 16px auto; } body .markdown-body ...

  5. 大数据平台搭建(hadoop+spark)

    大数据平台搭建(hadoop+spark) 一.基本信息 1. 服务器基本信息 主机名 ip地址 安装服务 spark-master 172.16.200.81 jdk.hadoop.spark.sc ...

  6. 基于Hadoop的大数据平台实施记——整体架构设计[转]

    http://blog.csdn.net/jacktan/article/details/9200979 大数据的热度在持续的升温,继云计算之后大数据成为又一大众所追捧的新星.我们暂不去讨论大数据到底 ...

  7. 基于Hadoop的大数据平台实施记——整体架构设计

    大数据的热度在持续的升温,继云计算之后大数据成为又一大众所追捧的新星.我们暂不去讨论大数据到底是否适用于您的组织,至少在互联网上已经被吹嘘成无所不能的超级战舰.好像一夜之间我们就从互联网时代跳跃进了大 ...

  8. Spark大型项目实战:电商用户行为分析大数据平台

    本项目主要讲解了一套应用于互联网电商企业中,使用Java.Spark等技术开发的大数据统计分析平台,对电商网站的各种用户行为(访问行为.页面跳转行为.购物行为.广告点击行为等)进行复杂的分析.用统计分 ...

  9. 部署开启了Kerberos身份验证的大数据平台集群外客户端

    转载请注明出处 :http://www.cnblogs.com/xiaodf/ 本文档主要用于说明,如何在集群外节点上,部署大数据平台的客户端,此大数据平台已经开启了Kerberos身份验证.通过客户 ...

随机推荐

  1. 带着萌新看springboot源码8(spring ioc源码上)

    emmm.....这次先不说springboot原理,先好好回顾一下以前的注解版spring原理,先把spring原理了解清晰了,再看springboot原理更容易. 要说起spring,最重要的就是 ...

  2. 磊哥测评之数据库SaaS篇:腾讯云控制台、DMC和小程序

    本文由云+社区发表 作者:腾讯云数据库 随着云计算和数据库技术的发展,数据库正在变得越来越强大.数据库的性能如处理速度.对高并发的支持在节节攀升,同时分布式.实时的数据分析.兼容主流数据库等强大的性能 ...

  3. [六]JavaIO之 ByteArrayInputStream与ByteArrayOutputStream

      功能简介   ByteArrayInputStream 和 ByteArrayOutputStream 提供了针对于字符数组 byte [] 的标准的IO操作方式     ByteArrayInp ...

  4. 痞子衡嵌入式:飞思卡尔i.MX RT系列MCU启动那些事(6)- Bootable image格式与加载(elftosb/.bd)

    大家好,我是痞子衡,是正经搞技术的痞子.今天痞子衡给大家介绍的是飞思卡尔i.MX RT系列MCU的Bootable image格式与加载过程. 在i.MXRT启动系列第三篇文章 Serial Down ...

  5. xamarin.forms之实现ListView列表倒计时

    做商城类APP时经常会遇到抢购倒计时的功能,之前做小区宝iOS的时候也有类似的功能,想着参考iOS做的思路,自定义一个Cell,在Cell中每秒刷新一下控件的文本值,但使用xamarin.forms实 ...

  6. 第48章 UserInfo端点(UserInfo Endpoint) - Identity Server 4 中文文档(v1.0.0)

    UserInfo端点可用于检索有关用户的身份信息(请参阅规范). 调用者需要发送代表用户的有效访问令牌.根据授予的范围,UserInfo端点将返回映射的声明(至少需要openid作用域). 示例 GE ...

  7. C#反射与特性使用简介

    本文是学习特性与反射的学习笔记,在介绍完特性和反射之后,会使用特性与反射实现一个简单的将DataTable转换为List的功能,水平有限,如有错误,还请大神不吝赐教. 1.      反射:什么是反射 ...

  8. [MySQL] 联合索引与using index condition

    1.测试联合索引的最左原则的时候, 发现了5.6版本后的新特性Index Condition Pushdown 2.含义就是存储引擎层根据索引尽可能的过滤数据,然后在返回给服务器层根据where其他条 ...

  9. 配置多个 git 账号的 ssh密钥

    背景 在工作中,我们通常会以 ssh 的方式配置公司的 git 账号,但是平时也会使用 github 管理自己的项目.因此,我们需要为自己的 github 创建一个新的 git 账号,这就需要生成新的 ...

  10. 用php输出心形曲线

    <?php for($t=0;$t<360;$t++) { $y=2*cos($t)-cos(2*$t); //笛卡尔心形曲线函数 $x=2*sin($t)-sin(2*$t); $x+= ...