Structured streaming: A Declarative API for Real-Time Applications in Apache Spark(Abstract: 原文+注译）

题目中文：结构化流： Apache spark中，处理实时数据的声明式API

Abstract

with the ubiquity of real-time data, organizations need streaming systems that are scalable, easy to use, and easy to integrate into business applications. Structured Streaming is a new high-level streaming. Structured Streaming differs from other recent streaming APIs, such as Google Dataflow, in two main ways. First, it is a purely declarative API based on automatically incrementalizing a static relational query(expressed using SQL or DataFrames), in contrast to APIs that ask the user to build a DAG of physical operators. Second, Structured Streaming aims to support end-to-end real time applications that integrate streaming with batch and interactive analysis. We found that this integration was often a key challenge in practice. Structured Streaming achieves high performance via Spark SQL's code generation engine and can outperformance via Spark SQL's code generation engine and can outperform Apache Flink by up to 2* and Apache Kafka streams by 90*. It also offers rich operational features such as roolbacks, code updates, and mixed streaming/batch execution. We describtion deployments on Databricks, the largest of which process over 1PB of data per month.

　　随着实时数据的普遍性，企业也更加需要“流式计算系统”具有更好的可扩展性、容易使用，并且容易整合进业务系统中去。结构化流（Structured Streaming)是一个基于我们开发spark streaming经验而开发出来的一个高级流式API。结构化流和其他最近的一些流式API（比如：Google Dataflow）主要在两个方面有所不一样。

　　第一. 它是一个纯粹的声明式API。它基于自动增量化关系查询（这个查询方法，通常使用SQL或者DataFrames)。这一点上，它和那些要求用户创建一个物理operators的DAG很不一样。

　　第二. 结构化流目的在于支持端到端的实时应用，并且集成了批处理和交互式分析。

　　我们实践时才发现，这样的集成却是真正关键的挑战。结构化流媒体通过Spark SQL代码生成器能够取得很好的表现。我们测试时得到的效果是：它的性能是Flink的两倍，是Apache Kafka Stream的90倍（主要应该是吞吐量上，可参考第9节的具体讲解）。它也提供了丰富的操作特性，比如：回滚，代码更新，混合批处理和流处理。

　　我们通过Databricks上上百个生产部署的案例来描述系统的设计和使用。其中，最大的每个月处理超过1PB的数据。

Structured streaming: A Declarative API for Real-Time Applications in Apache Spark(Abstract: 原文+注译）的更多相关文章

Spark2.3（四十二）：Spark Streaming和Spark Structured Streaming更新broadcast总结（二）
本次此时是在SPARK2,3 structured streaming下测试,不过这种方案,在spark2.2 structured streaming下应该也可行(请自行测试).以下是我测试结果: ...
Spark2.3（三十四）：Spark Structured Streaming之withWaterMark和windows窗口是否可以实现最近一小时统计
WaterMark除了可以限定来迟数据范围,是否可以实现最近一小时统计? WaterMark目的用来限定参数计算数据的范围:比如当前计算数据内max timestamp是12::00,waterMar ...
Kafka：ZK+Kafka+Spark Streaming集群环境搭建（二十五）Structured Streaming：同一个topic中包含一组数据的多个部分，按照key它们拼接为一条记录（以及遇到的问题）。
需求: 目前kafka的topic上有一批数据,这些数据被分配到9个不同的partition中(就是发布时key:{m1,m2,m3,m4...m9},value:{records items}),m ...
Structured Streaming编程向导
简介 Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark ...
Spark Structured Streaming框架(5)之进程管理
Structured Streaming提供一些API来管理Streaming对象.用户可以通过这些API来手动管理已经启动的Streaming,保证在系统中的Streaming有序执行. 1. St ...
Apache Spark 2.2.0 中文文档 - Structured Streaming 编程指南 | ApacheCN
Structured Streaming 编程指南概述快速示例 Programming Model (编程模型) 基本概念处理 Event-time 和延迟数据容错语义 API 使用 Data ...
Spark3.0分布，Structured Streaming UI登场
近日,在Spark开源十周年之际,Spark3.0发布了,这个版本大家也是期盼已久.登录Spark官网,最新的版本已经是3.0.而且不出意外,对于Structured Streaming进行了再一次的 ...
Spark2.x（五十四）：在spark structured streaming下测试ds.selectExpr()，当返回列多时出现卡死问题。
业务需求,有一部分动态字段,需要在程序中动态加载并解析表达式: 实现方案1):在MapFunction.MapPartitionFunction中使用FelEngine进行解析: FelEngine ...
Spark之Structured Streaming
目录 Part V. Streaming Stream Processing Fundamentals Structured Streaming Basics Event-Time and State ...

随机推荐

获取header信息
获取header信息 function _get_all_header() { // 忽略获取的header数据.这个函数后面会用到.主要是起过滤作用 $ignore = array('host',' ...
iptables防火墙入门
一.iptables基本管理 iptables运行前提:关闭firewalld防火墙再开启iptables,不然造成冲突. 基本指令: 1.部署iptables服务 yum –y install ip ...
14-SQLServer索引碎片
一.总结 1.数据库的存储本身是无序的,建立聚集索引之后,就会按照聚集索引的物理顺序存入硬盘: 2.建立索引完全是为了提升读取的速度,相对写入的速度就会降低,没有索引的表写入时最快的,但是大多数系统读 ...
k8sSecret资源
Secret资源的功能类似于ConfigMap,但它专用于存放敏感数据,如密码.数字证书.私钥.令牌和ssh key等. 一.概述 Secret对象存储数据以键值方式存储数据,再pod资源中通过环境变 ...
jquery之闭包
闭包常见形式是函数A里面定义一个函数B,并返回函数体的引用,很抽象是不是,具体代码如下: function wenwa() { ; function cj() { console.log(" ...
题解小B的询问
题面解析这就是道莫队模板啊啊!! 因此,似乎并没有什么好讲的. 莫队算法传送门我们只需要将询问存下来, 离线处理就行了. 还是上代码吧: #include<bits/stdc++.h> ...
Confluence 6 在一个空间中查看所有附加的文件
有下面 2 种方法可以让你查看空间的所有附件.你可以: 使用 Space Attachments Macro 来在一个页面中显示列表文件. 进入空间后,然后从边栏的底部选择空间工具(Space to ...
LA 4223 最短路路径选择要求提高一点
F - Trucking Time Limit:3000MS Memory Limit:0KB 64bit IO Format:%lld & %llu Submit Statu ...
“美登杯”上海市高校大学生程序设计邀请赛 (华东理工大学) E 小花梨的数组线段树
题意分析预处理出每个数的最小素因子,首先可以知道$minprime(x*minprime(x))=minprime(x)$,我们用线段树维护区间最大值$mx[p]$,注意这里的最大值并不是 ...
从零开始入门 K8s | Kubernetes 调度和资源管理
作者 | 子誉蚂蚁金服高级技术专家关注"阿里巴巴云原生"公众号,回复关键词"入门",即可下载从零入门 K8s 系列文章 PPT. Kubernetes 调 ...

Structured streaming: A Declarative API for Real-Time Applications in Apache Spark(Abstract: 原文+注译）

Structured streaming: A Declarative API for Real-Time Applications in Apache Spark(Abstract: 原文+注译）的更多相关文章

随机推荐

热门专题