This is a guest post from Xiaowei Jiang, Senior Director of Alibaba’s search infrastructure team. The post is adapted from Alibaba’s presentation at Flink Forward 2016, and you can see the original talk from the conference here.

Alibaba is the largest e-commerce retailer in the world. Our annual sales in 2015 totalled $394 billion--more than eBay and Amazon combined. Alibaba Search, our personalized search and recommendation platform, is a critical entry point for our customers and is responsible for much of our online revenue, and so the search infrastructure team is constantly exploring ways to improve the product. What makes for a great search engine on an e-commerce site? Results that, in real-time, are as relevant and accurate as possible for each user.

At Alibaba’s scale, this is a non-trivial problem, and it’s difficult to find technologies that are capable of handling our use cases. Apache Flink® is one such technology, and Alibaba is using Blink, a system based on Flink, to power critical aspects of its search infrastructure and to deliver relevance and accuracy to end users.

In this post, I’ll walk through Flink’s role in Alibaba search and outline the reasons we chose to work with Flink on the search infrastructure team. I’ll also discuss how we adapted Flink to meet our unique requirements with Blink and how we are working together with data Artisans and the Flink community to contribute these changes back to Flink. We are actively transitioning our system from Blink to vanilla Apache Flink once we have successfully merged our modifications into the open source project. 

Part 1: Flink in Alibaba Search

Document Creation

The first step in providing users with a world-class search engine is building the documents that will be available for search. In Alibaba’s case, the document is made up of millions of product listings and related product data. Search document creation is a challenge because product data is stored in many different places, and it’s up to the search infrastructure team to bring together all relevant information to create a complete search document. Generally speaking, this is a 3-stage process:

  1. Synchronize all product data from disparate sources (e.g. MySQL, distributed file systems) into one HBase cluster.
  2. Join data from different tables together using business logic to create a final, searchable document. This is an HBase table that we call our ‘Result’ table.
  3. Export this HBase table as a file or as a set of updates.

All 3 of these stages actually run on 2 different pipelines in a classical ‘lambda architecture’: a full-build pipeline and an incremental build pipeline.

  • In the full-build pipeline, we process all data sources, and this is traditionally a batch job.
  • In the incremental pipeline, we process updates that occur after the batch job is finished. For instance, sellers can modify price or description or inventory availability might change. This information must be reflected in search results as quickly as possible. The incremental-build pipeline is traditionally a streaming job.

Real-time A/B testing of search algorithms

Our engineers test different search algorithms on a regular basis and need to be able to evaluate performance as quickly as possible. Right now, this evaluation happens once a day, but we’d like to do the analysis in real-time, and so we used Blink to build a real-time A/B testing framework. Online logs (impressions, clicks, transactions) are collected and processed by a parser and filter then later joined together using some business logic. Next, the data is aggregated, and the aggregated result is pushed to Druid; inside Druid, it’s possible to write a query to perform complex OLAP analysis on the data and see how well different algorithms are performing.

Online machine learning

There are a couple of applications here, and first, we’ll discuss real-time feature updates. Some of the features used in Alibaba’s search ranking are product CTR, product inventory, and total number of clicks. These data change over time, and if we can use the most recent data available, we’ll be able to offer a more relevant search ranking to our users. Our Flink pipeline provides us with online feature updates and has given a significant boost on conversion rate.

Second, there are specific days of the year (such as Singles Day) where products are heavily discounted--sometimes up to 50%--and therefore, user behavior changes dramatically. Transaction volume is huge, often many times higher than what we see in a normal day. Our previously-trained models are useless in this scenario, and so we use our logs and a Flink streaming job to power online machine learning, building models that take into account the real-time data. The result is a much higher conversion rate on these uncommon, but very important, sale days.

Part 2: Choosing a framework to solve the problem

When we chose Flink to power our search infrastructure, our evaluation included the following four categories. Flink met our requirements in all four.

  • Agility: It was our goal to be able to maintain one codebase for our entire (2-pipeline) search infrastructure process. And we wanted an API that was high-level enough for us to express our business logic.
  • Consistency: Changes to the seller or product databases must be reflected in final search results, and so the search infrastructure team requires at-least-once semantics (and for some other Flink use cases in the company, we have exactly-once requirements).
  • Low latency: When inventory availability changes, this must be reflected in search results very quickly; for example, we don’t want to give a high search ranking to a sold-out product.
  • Cost: Alibaba processes lots of data, and at our scale, an efficiency improvement results in significant cost savings. We need a framework that handles high-throughput efficiently.

More broadly speaking, there are 2 ways to think about unified batch and stream processing. The first approach is to use batch as a starting point then try to build streaming on top of batch. This likely won’t meet strict latency requirements, though, because micro-batching to simulate streaming requires some fixed overhead--hence the proportion of the overhead increases as you try to reduce latency. At our scale, 1000s of tasks would need to be scheduled for each microbatch, the connection re-established, and state reloaded. So at some point, the micro-batch approach becomes too costly to make sense.

Flink, on the other hand, uses streaming as a fundamental starting point and builds a batch solution on top of streaming, where a batch is basically a special case of a stream. With this approach, we don’t lose the benefit of our optimizations in batch mode--when a stream is finite, you can still do whatever optimization you’d like to do for batch processing.

Part 3: What is Blink?

Blink is a forked version of Flink that we have been maintaining to fit some of the unique requirements we have at Alibaba. At this point, Blink is running on a few different clusters, and each cluster has about 1000 machines, so large-scale performance is very important to us. Blink’s improvements generally cover two areas:

  • Making the Table API more complete so that we can have the same SQL for batch and streaming
  • A more robust YARN mode that’s still 100% compatible with Flink’s API and broader ecosystem

Table API

We first added support for user-defined functions to make it easy to bring our unique business logic into Flink. We also added a stream-to-stream join, which is a non-trivial task but relatively straightforward in Flink due to Flink’s first-class support for state. Next, we added a few different aggregations, the most interesting one probably being distinct_count, as well as windowing support. (Editor’s note: FLIP-11 covers a range of Table API and SQL improvements for Flink related to the features listed above and is recommended reading for anyone interested in the topic.)Next, we’ll cover runtime improvements, which we can break into four separate categories.

Blink on Yarn

When we started our project, Flink supported 2 cluster modes: standalone mode and Flink on YARN. In YARN mode, a job couldn’t request and release resources dynamically and instead needed to grab all required resources up front. And different jobs might share the same JVM process, which favored resource utilization over resource isolation. Blink includes an architecture where every job has its own JobMaster to request and release resources as the job requires. And different jobs can’t run in the same Java process, which yields the best isolation between jobs and tasks. The Alibaba team is currently working with the Flink community to contribute this work back to the open source, and the improvements are captured in FLIP-6 (which extends to other cluster managers in addition to YARN).

Operator Rescale

In production, our clients might need to change the parallelism of operators, but at the same time, they don’t want to lose state. When we started working on Blink, Flink did not support changing the parallelism of operators while maintaining state. Blink introduced the concept of “buckets” as the unit of state management. There are many more buckets than tasks, and each task will be assigned multiple buckets. When the parallelism changes, we’ll reassign buckets to tasks. Using this method, it’s possible to change the parallelism of operators and maintain state.

(Editor’s note: the Flink community has concurrently solved this issue for Flink 1.2 - the feature is available in the latest version of the master branch. Flink’s notion of “key groups” is largely equivalent with “buckets” mentioned above, but the implementation differs slightly in how the data structures back these buckets. For more information, check out FLINK-3755 in Jira.)

Incremental Checkpointing

In Flink, checkpointing happens in 2 stages: taking a snapshot of state locally, then persisting the snapshot of state to HDFS (or another storage system), and the entire snapshot of state is stored in HDFS with each snapshot. Our state was too large for this approach to be acceptable, and so Blink only stores the modified state in HDFS, and we’ve been able to improve checkpointing efficiency greatly. This modification enabled us to use large state in production.

Asynchronous I/O

The production bottleneck for many of our jobs is accessing external storage like HBase. To solve this problem, we introduced Asynchronous I/O, which we’ll be working to contribute to the community and is described in detail in FLIP-12. (Editor’s note: data Artisans thinks that FLIP-12 is substantial enough to have its own, separate writeup at some point in the near future. So we’ll only briefly introduce the idea here, and for the time being, you should check out the FLIP writeup if you’d like to learn more. At the time of publishing, the code has already been contributed to Flink.)

Part 4: What’s next for Flink at Alibaba?

We’ll continue to optimize our streaming jobs, specifically, better handling of temporary skew and slow machines without negating the positive aspects of backpressure and faster recovery from failure. As was discussed by a number of different speakers at Flink Forward, we believe that Flink has great potential as a batch processor as well as a stream processor. We’re working to fully leverage Flink’s batch processing capabilities and hope to have a Flink batch mode in production in a couple months.

Another popular conversation topic from the conference is streaming SQL, and we’re continuing to add further SQL support and Table API support in Flink. And Alibaba’s business continues to grow, meaning that our jobs get larger and larger--it becomes increasingly important to make sure we can scale to even larger clusters. Very importantly, we look forward to continued collaboration with the community in order to contribute our work back to the open source so that all Flink users can benefit from the work we’ve put into Blink. We look forward to updating you on our progress at Flink Forward 2017.

Blink: How Alibaba Uses Apache Flink的更多相关文章

  1. 修改代码150万行!与 Blink 合并后的 Apache Flink 1.9.0 究竟有哪些重大变更?

    8月22日,Apache Flink 1.9.0 正式发布,早在今年1月,阿里便宣布将内部过去几年打磨的大数据处理引擎Blink进行开源并向 Apache Flink 贡献代码.当前 Flink 1. ...

  2. Apache Flink 1.9.0版本新功能介绍

    摘要:Apache Flink是一个面向分布式数据流处理和批量数据处理的开源计算平台,它能够基于同一个Flink运行时,提供支持流处理和批处理两种类型应用的功能.目前,Apache Flink 1.9 ...

  3. 终于等到你!阿里正式向 Apache Flink 贡献 Blink 源码

    摘要: 如同我们去年12月在 Flink Forward China 峰会所约,阿里巴巴内部 Flink 版本 Blink 将于 2019 年 1 月底正式开源.今天,我们终于等到了这一刻. 阿里妹导 ...

  4. Apache Flink 1.9重磅发布!首次合并阿里内部版本Blink重要功能

    8月22日,Apache Flink 1.9.0 版本正式发布,这也是阿里内部版本 Blink 合并入 Flink 后的首次版本发布.此次版本更新带来的重大功能包括批处理作业的批式恢复,以及 Tabl ...

  5. Apache Flink系列(1)-概述

    一.设计思想及介绍 基本思想:“一切数据都是流,批是流的特例” 1.Micro Batching 模式 在Micro-Batching模式的架构实现上就有一个自然流数据流入系统进行攒批的过程,这在一定 ...

  6. Apache Flink 漫谈系列 - JOIN 算子

    聊什么 在<Apache Flink 漫谈系列 - SQL概览>中我们介绍了JOIN算子的语义和基本的使用方式,介绍过程中大家发现Apache Flink在语法语义上是遵循ANSI-SQL ...

  7. Apache Flink 1.5.0 Release Announcement

    Apache Flink: Apache Flink 1.5.0 Release Announcement https://flink.apache.org/news/2018/05/25/relea ...

  8. Apache Flink 开发环境搭建和应用的配置、部署及运行

    https://mp.weixin.qq.com/s/noD2Jv6m-somEMtjWTJh3w 本文是根据 Apache Flink 系列直播课程整理而成,由阿里巴巴高级开发工程师沙晟阳分享,主要 ...

  9. 如何在 Apache Flink 中使用 Python API?

    本文根据 Apache Flink 系列直播课程整理而成,由 Apache Flink PMC,阿里巴巴高级技术专家 孙金城 分享.重点为大家介绍 Flink Python API 的现状及未来规划, ...

随机推荐

  1. asp.net core 系列 12 选项 TOptions

    一.概述 本章讲的选项模式是对Configuration配置的功能扩展. 讲这篇时有个专用名词叫“选项类(TOptions)” .该选项类作用是指:把选项类中的属性与配置来源中的键关联起来.举个例,假 ...

  2. TypeError: unorderable types: str() >= int()

    1.问题描述 age=input('please enter your age') if age >=18: print('your age is',age) print('adult') el ...

  3. 从jvm角度看懂类初始化、方法重写、重载。

    类初始化 在讲类的初始化之前,我们先来大概了解一下类的声明周期.如下图 类的声明周期可以分为7个阶段,但今天我们只讲初始化阶段.我们我觉得出来使用和卸载阶段外,初始化阶段是最贴近我们平时学的,也是笔试 ...

  4. Chapter 4 Invitations——12

    "I don't know what you mean," I said, my voice guarded. “我不知道你什么意思”我声音谨慎地说道. "It's be ...

  5. iOS逆向开发(8):微信自动添加好友

    这一次,小程演示怎么让一个APP自动地运行,从而代替手工的操作.同样以"微信"以例,实现在一个微信群里面,对所有的成员,自动地一个一个地发出添加好友的请求. 知识点还是之前介绍的东 ...

  6. Java开发知识之Java的正则表达式

    目录 正则表达式 一丶什么是正则表达式 1.正则表达式简介 2.无正则表达式判断代码 3.使用正则表达式代码. 二丶正则表达式API 三丶正则表达式语法格式 1.正则表达式语法 正则表达式 一丶什么是 ...

  7. 使用MaxCompute Java SDK运行安全相关命令

    使用MaxCompute Console的同学,可能都使用过MaxCompute安全相关的命令.官方文档上有详细的MaxCompute安全指南,并给出了安全相关语句汇总.   简而言之,权限管理.列级 ...

  8. 从0到1,了解NLP中的文本相似度

    本文由云+社区发表 作者:netkiddy 导语 AI在2018年应该是互联网界最火的名词,没有之一.时间来到了9102年,也是项目相关,涉及到了一些AI写作相关的功能,为客户生成一些素材文章.但是, ...

  9. kubernetes进阶之五:Replication Controller&Replica Sets&Deployments

    一:Replication Controller RC是kubernetes的核心概念之一.它定义了一个期望的场景即声明某种Pod的副本数量在任意时候都要符合某个预期值. 它由以下几个部分组成: 1. ...

  10. .Net语言 APP开发平台——Smobiler学习日志:SmoOne新增考勤功能

    大家好!SmoOne这次新增了考勤功能,大家打开SmoOne应用便可体验,无需重新下载更新.如果没有下载SmoOne客户端,可以在apps.smobiler.com进行下载安装. 另外,SmoOne开 ...