应用一:kafka数据同步到kudu

1 准备kafka topic

# bin/kafka-topics.sh --zookeeper $zk:2181/kafka -create --topic test_sync --partitions 2 --replication-factor 2
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic "test_sync".
# bin/kafka-topics.sh --zookeeper $zk:2181/kafka -describe --topic test_sync
Topic:test_sync PartitionCount:2 ReplicationFactor:2 Configs:
Topic: test_sync Partition: 0 Leader: 112 Replicas: 112,111 Isr: 112,111
Topic: test_sync Partition: 1 Leader: 110 Replicas: 110,112 Isr: 110,112

2 准备kudu表

impala-shell

CREATE TABLE test.test_sync (
id int,
name string,
description string,
create_time timestamp,
update_time timestamp,
primary key (id)
)
PARTITION BY HASH (id) PARTITIONS 4
STORED AS KUDU
TBLPROPERTIES ('kudu.master_addresses'='$kudu_master:7051');

3 准备flume kudu支持

3.1 下载jar

# wget https://repository.cloudera.com/artifactory/cloudera-repos/org/apache/kudu/kudu-flume-sink/1.7.0-cdh5.16.1/kudu-flume-sink-1.7.0-cdh5.16.1.jar
# mv kudu-flume-sink-1.7.0-cdh5.16.1.jar $FLUME_HOME/lib/ # wget http://central.maven.org/maven2/org/json/json/20160810/json-20160810.jar
# mv json-20160810.jar $FLUME_HOME/lib/

3.2 开发

代码库:https://github.com/apache/kudu/tree/master/java/kudu-flume-sink

kudu-flume-sink默认使用的producer是

org.apache.kudu.flume.sink.SimpleKuduOperationsProducer

  public List<Operation> getOperations(Event event) throws FlumeException {
try {
Insert insert = table.newInsert();
PartialRow row = insert.getRow();
row.addBinary(payloadColumn, event.getBody()); return Collections.singletonList((Operation) insert);
} catch (Exception e) {
throw new FlumeException("Failed to create Kudu Insert object", e);
}
}

是将消息直接存放到一个payload列中

如果想要支持json格式数据,需要二次开发

package com.cloudera.kudu;
public class JsonKuduOperationsProducer implements KuduOperationsProducer {

代码详见:https://www.cnblogs.com/barneywill/p/10573221.html

打包放到$FLUME_HOME/lib下

4 准备flume conf

a1.sources = r1
a1.sinks = k1
a1.channels = c1 # Describe/configure the source a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.r1.batchSize = 5000
a1.sources.r1.batchDurationMillis = 2000
a1.sources.r1.kafka.bootstrap.servers = 192.168.0.1:9092
a1.sources.r1.kafka.topics = test_sync
a1.sources.r1.kafka.consumer.group.id = flume-consumer # Describe the sink
a1.sinks.k1.type = logger # Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000 # Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1 a1.sinks.k1.type = org.apache.kudu.flume.sink.KuduSink
a1.sinks.k1.producer = com.cloudera.kudu.JsonKuduOperationsProducer
a1.sinks.k1.masterAddresses = 192.168.0.1:7051
a1.sinks.k1.tableName = impala::test.test_sync
a1.sinks.k1.batchSize = 50

5 启动flume

bin/flume-ng agent --conf conf --conf-file conf/order.properties --name a1

6 kudu确认

impala-shell

select * from test_sync limit 10;

参考:https://kudu.apache.org/2016/08/31/intro-flume-kudu-sink.html

【原创】大数据基础之Flume(2)应用之kafka-kudu的更多相关文章

  1. 【原创】大数据基础之Flume(2)kudu sink

    kudu中的flume sink代码路径: https://github.com/apache/kudu/tree/master/java/kudu-flume-sink kudu-flume-sin ...

  2. 【原创】大数据基础之Flume(2)Sink代码解析

    flume sink核心类结构 1 核心接口Sink org.apache.flume.Sink /** * <p>Requests the sink to attempt to cons ...

  3. 【原创】大数据基础之Zookeeper(2)源代码解析

    核心枚举 public enum ServerState { LOOKING, FOLLOWING, LEADING, OBSERVING; } zookeeper服务器状态:刚启动LOOKING,f ...

  4. 大数据系列之Flume+kafka 整合

    相关文章: 大数据系列之Kafka安装 大数据系列之Flume--几种不同的Sources 大数据系列之Flume+HDFS 关于Flume 的 一些核心概念: 组件名称     功能介绍 Agent ...

  5. 【原创】大数据基础之词频统计Word Count

    对文件进行词频统计,是一个大数据领域的hello word级别的应用,来看下实现有多简单: 1 Linux单机处理 egrep -o "\b[[:alpha:]]+\b" test ...

  6. 【原创】大数据基础之Impala(1)简介、安装、使用

    impala2.12 官方:http://impala.apache.org/ 一 简介 Apache Impala is the open source, native analytic datab ...

  7. 【原创】大数据基础之Benchmark(2)TPC-DS

    tpc 官方:http://www.tpc.org/ 一 简介 The TPC is a non-profit corporation founded to define transaction pr ...

  8. 大数据基础知识问答----spark篇,大数据生态圈

    Spark相关知识点 1.Spark基础知识 1.Spark是什么? UCBerkeley AMPlab所开源的类HadoopMapReduce的通用的并行计算框架 dfsSpark基于mapredu ...

  9. 低调、奢华、有内涵的敏捷式大数据方案:Flume+Cassandra+Presto+SpagoBI

    基于FacebookPresto+Cassandra的敏捷式大数据 文件夹 1 1.1 1.1.1 1.1.2 1.2 1.2.1 1.2.2 2 2.1 2.2 2.3 2.4 2.5 2.6 3 ...

随机推荐

  1. vee-validate表单验证组件

    vee-validate是VUE的基于模板的验证框架,允许您验证输入并显示错误 安装 npm i vee-validate --save 引入 import Vue from 'vue'; impor ...

  2. java.io.OutputStream & java.io.FileOutputStream

    java.io.OutputStream & java.io.FileOutputStream 1.Java.io.OutputStream(字节输出流) 字节输出流,这是一个抽象类,是表示输 ...

  3. error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

    解决方案 1. http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted 下载twisted对应版本的whl文件(我的Twisted‑17.5.0‑cp36 ...

  4. es6写法

    我们在日常开发中,如果我们使用es5则可以直接在浏览器里面写JavaScript脚本.一点问题也没有. 但是在写es6语法的JavaScript代码的时候,我们就需要引入babel翻译器了. 例如: ...

  5. [C++]猜数字游戏的提示(Master-Mind Hints,UVa340)

    [本博文非博主原创,思路与题目均摘自 刘汝佳<算法竞赛与入门经典(第2版)>] Question 例题3-4 猜数字游戏的提示(Master-Mind Hints,UVa340) 实现一个 ...

  6. Coursera, Machine Learning, notes

      Basic theory (i) Supervised learning (parametric/non-parametric algorithms, support vector machine ...

  7. kettle中的合并记录使用记录

    注意:合并记录的使用前提是2个数据源都按比较关键字排过序,否则合并之后的数据不准确,可能会多出很多. 该步骤用于将两个不同来源的数据合并,这两个来源的数据分别为旧数据和新数据,该步骤将旧数据和新数据按 ...

  8. Java基础_0305:简单Java类

    简单Java类 简单Java类是一种在实际开发之中使用最多的类的定义形式,在简单Java类中包含有类.对象.构造方法.private封装等核心概念的使用,而对于简单Java类首先给出如下的基本开发要求 ...

  9. Linux 01 计算机系统硬件组成简介

    PC服务器 1U = 4.445cm 企业1-2U比较多 互联网公司,品牌 DELL,HP, IBM. Dell品牌 2010年之前:1850,1950(1u),2850,2950(2u) 2010年 ...

  10. 源码学习之mybatis

    1.先看看俩种调用方式 public static void main(String[] args) { SqlSessionFactory sqlSessionFactory; SqlSession ...