转自:https://blog.minio.io/stream-processing-with-apache-flink-and-minio-10da85590787

Modern technology trends like Machine Learning, Deep Learning, Artificial intelligence, and IoT have pushed the need for a reliable, scaleable storage platform that is versatile enough to cater to the high volume data streams that these applications generate.

In this post, we’ll see an introduction to Apache Flink, one of the most popular stream processing engines today and try to understand its value that makes it widely adopted by Enterprises across the world. Later we’ll also explore how Minio works with Flink to build a private cloud data pipeline for a variety of use cases.

What is Stream processing?

Stream processing enables analyzing continuous data streams. In this approach, data is seen as a continuous stream that processing engines ingest, analyze and return the response within a small time frame — few milliseconds to minutes.

The response time is generally based on the use-case and criticality of response time. For example, you’d expect IoT sensor data from a nuclear reactor to be processed in a much smaller time frame as compared to data from a user’s website visit.

There are several situations where the streaming approach to data analysis is better suited when compared to the batch analysis:

  • With modern technologies (IoT, transaction logs, application logs, activity logs, visit logs) generating continuous data streams, processing that data in a similar, continuous manner is the natural approach.
  • Batch processing takes a bigger chunk of data and processes them at once while stream processing takes data as they come in, hence spreading the processing over time. This allows stream processing work with fewercompute resources compared to batch processing.
  • Sometimes data is too huge and it is not economically sensible to store it all. Stream processing let you handle large fire-hose style data and retainonly useful bits.
  • Streaming allows detecting patterns, inspect results, and also easily look at data from multiple streams simultaneously. This means you get approximate results in a shorter time frame. In contrast, with batch processing, you need to process multiple batches and aggregate results across these batches to get better results, but it takes longer.

Stream processing use-cases

As we discussed, stream processing is beneficial in situations where quick, (sometimes approximate) answer is best suited, while processing data. Let us now take a look at common real world applications of stream processing approach:

Anomaly detection: Streaming analysis can be applied to continuous streams of data and detect anomalies in near real time. For example, in a stream of financial transaction data, fraudulent transactions can be thought of as anomalies — stream processing can detect these, protecting banks and customers from financial damage.

Business process monitoring: A business process involves several events within a specific domain for example in an e-commerce business all the events starting CHECK_OUT_FROM_CART to ITEM_RECEIVED_BY_CUSTOMER may be thought of as one business process — a critical one at that. Stream processing can be used to monitor such processes for anomalies like not completing within a time frame, items mishandled by delivery partners etc.

Rule based alerting: Stream processing can be used to trigger alerts based on certain rules. This means as soon as a certain criteria is met, alerts can be sent out to different targets.

Read more about stream processing use cases on Apache Flink website.


Apache Flink

Apache Flink is a distributed processing engine for stateful computations over data streams. Flink excels at processing unbounded and bounded data sets.

Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.

While Apache Spark is well know to provide Stream processing support as one of its features, stream processing is an after thought in Spark and under the hoods Spark is known to use mini-batches to emulate stream processing.

Apache Flink on the other hand has been designed ground up as a stream processing engine. This means Flink

  • Does better memory management and avoids occasional spikes in memory usage.
  • Manages faster speeds by allowing iterative processing to take place on the same node rather than having the cluster run them independently.

Minio with Apache Flink

Apache Flink supports three different data targets in its typical processing flow — data source, sink and checkpoint target. While data source and sink are fairly obvious, checkpoint target is used to persist states at certain intervals, during processing, to guard against data loss and recover consistently from a failure of nodes.

With AWS S3 API support a first class citizen in Apache Flink, all the three data targets can be configured to work with any AWS S3 API compatible object store, including ofcourse, Minio.

Minio can be configured with Flink in four broad ways, let’s take a look at all four below:

  1. Minio event notifications: Minio event logs can be sent to Flink via Kafka as event streams. Such event data is beneficial in cases where object access logs are important for business to understand certain user behaviour trends or data access trends.
 

2. Minio object data: Minio S3 SELECT command response is streaming data, this data can be directly fed to Flink for further analysis and processing.

 

3. Minio as the checkpoint for Flink: Flink supports checkpointing to ensure it can recover node failures and start from right where it left off. Flink can be configured to store these Checkpoints on Minio server.

 

4. Minio as the sink for Flink: As Flink can output data to S3 targets, Minio can be used the sink for processing data output from Flink.

 

Why is it a good idea to use Minio with Flink:

  • Remote object storage target like Minio de-couples state from Flink’s compute nodes. This means Flink becomes stateless i.e. free to grow and shrink as and when needed (saving cost) while state is safely stored on Minio.
  • Minio performance (upto 10 GBps per node) ensures that even if the state is de-coupled, it is readily available and adds no latency to Flink processing.
  • With configurable erasure coding, scaleable design, server side encryption, Minio ensures safe, scaleable and reliable storage of data in cost efficient manner.
  • Native AWS S3 API support in Flink means out of the box integration and support for Minio, reducing configuration and maintenance costs.

Configure Minio with Flink

Let us now take a look at how to configure Apache Flink with Minio as the remote storage backend. In this example, we’ll use Minio as both the source and sink.

To start with, you’ll need Minio server deployed, refer this document for details. Next, download Flink binary as explained in the quick start document.

Then update $FLINK_DIR/conf/flink-conf.yaml and add the below sections:

state.backend: filesystem
s3.endpoint: http://127.0.0.1:9000
s3.path-style: true
s3.access-key: minio
s3.secret-key: minio123

$FLINK_DIR here is the directory where you untarred Flink tar file. Also, don’t forget to update the s3. fields based on actuals from your Minio server deployment.

Now, start Flink. The setup is now ready to use Minio as the default storage system. To test this I used the WordCount example from Flink documentation

./bin/flink run examples/batch/WordCount.jar — input s3://input/test.txt — output s3://testbucket/output

Here test.txt is a sample text file (use any file with lots of text data). Once the job finishes, you can see the word count in the testbucket/output file.

Conclusion

In this post we learnt about Stream processing and how it has the potential to help enterprises speed up their data processing approach. We learnt why Stream processing is gaining popularity and saw some of the popular use cases. Finally we understood how Minio combined with Flink can help create a private cloud based streaming data infrastructure.

As Streaming data becomes one of the most popular ways to consume and process events, we hope this post helped you understand how Flink is well suited to handle such approach and why it makes sense to use Minio as the storage engine for such streaming data infrastructure.

 
 
 
 

Stream processing with Apache Flink and Minio的更多相关文章

  1. An Overview of End-to-End Exactly-Once Processing in Apache Flink (with Apache Kafka, too!)

    01 Mar 2018 Piotr Nowojski (@PiotrNowojski) & Mike Winters (@wints) This post is an adaptation o ...

  2. Apache Samza - Reliable Stream Processing atop Apache Kafka and Hadoop YARN

    http://engineering.linkedin.com/data-streams/apache-samza-linkedins-real-time-stream-processing-fram ...

  3. Flink监控:Monitoring Apache Flink Applications

    This post originally appeared on the Apache Flink blog. It was reproduced here under the Apache Lice ...

  4. Apache Flink 1.5.0 Release Announcement

    Apache Flink: Apache Flink 1.5.0 Release Announcement https://flink.apache.org/news/2018/05/25/relea ...

  5. 园子的推广博文:欢迎收看 Apache Flink 技术峰会 FFA 2021 的视频回放

    园子专属收看链接:https://developer.aliyun.com/special/ffa2021/live#?utm_content=g_1000316459 Flink Forward 是 ...

  6. Peeking into Apache Flink's Engine Room

    http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html   Join Processin ...

  7. 13 Stream Processing Patterns for building Streaming and Realtime Applications

    原文:https://iwringer.wordpress.com/2015/08/03/patterns-for-streaming-realtime-analytics/ Introduction ...

  8. How Cigna Tuned Its Spark Streaming App for Real-time Processing with Apache Kafka

    Explore the configuration changes that Cigna’s Big Data Analytics team has made to optimize the perf ...

  9. 腾讯大数据平台Oceanus: A one-stop platform for real time stream processing powered by Apache Flink

    January 25, 2019Use Cases, Apache Flink The Big Data Team at Tencent     In recent years, the increa ...

随机推荐

  1. sass 变量的声明 嵌套

    sass 的默认变量一般是用来设置默认值,然后根据需求来覆盖的,覆盖的方式也很简单,只需要在默认变量之前重新声明下变量即可. $baseLineHeight: 2; $baseLineHeight: ...

  2. 3.1 C++继承的概念及语法

    参考:http://www.weixueyuan.net/view/6358.html. 总结: 继承可以理解为一个类从另一个类获取方法(函数)和属性(成员变量)的过程. 被继承的类称为父类或基类,继 ...

  3. 2.18 C++类与static关键字

    参考:http://www.weixueyuan.net/view/6349.html 总结: 类中的成员变量或成员函数一旦与static关键字相结合,则该成员变量或成员函数就是属于类的,而不是再是属 ...

  4. dialog销毁不干净与弹出多个dialog问题

    1.关闭dialog的时候不销毁.重新打开然后影响页面的效果与样式. 原因: dialog的close()只是将html片段隐藏,并没有销毁移除. 解决方式: 打开dialog的时候在写onClose ...

  5. 51nod1009

    给定一个十进制正整数N,写下从1开始,到N的所有正数,计算出其中出现所有1的个数.   例如:n = 12,包含了5个1.1,10,12共包含3个1,11包含2个1,总共5个1. Input 输入N( ...

  6. <Hadoop><SequenceFile><Hadoop小文件>

    Origin 我们首先理解一下SequenceFile试图解决什么问题,然后看SeqFile怎么解决这些问题. In HDFS 序列文件是解决Hadoop小文件问题的一个方法: 小文件是显著小于HDF ...

  7. 爬虫系列2:scrapy项目入门案例分析

    本文从一个基础案例入手,较为详细的分析了scrapy项目的建设过程(在官方文档的基础上做了调整).主要内容如下: 0.准备工作 1.scrapy项目结构 2.编写spider 3.编写item.py ...

  8. Ubuntu创建新用户并设置权限

    打开终端开启root账户 sudo passwd -u root 设置root密码,输入两次 sudo passwd root 切换root账号 su - 或 su root 退出root账户使用ex ...

  9. MySQL将DESC等关键字作为列名表名的处理方式

    面试被问到一个问题,假如MySQL中的关键字在查询语句中作为列明或者表名出现,应该怎么处理. 例如 select desc from t; 首先创建一张表,包含两个字段,id和desc 插入了三条数据 ...

  10. Python学习笔记第二十二周(前端知识点补充)

    目录: 一.伪类 二.样式 1.字体 2.背景图片 3.margin和padding 4.列表属性 5.float 6.clear 7.position 8.text-decoration(a标签下划 ...