kerberos环境下spark消费kafka写入到Hbase
一、准备环境: 创建Kafka Topic和HBase表
1. 在kerberos环境下创建Kafka Topic
1.1 因为kafka默认使用的协议为PLAINTEXT,在kerberos环境下需要变更其通信协议: 在${KAFKA_HOME}/config/producer.properties和config/consumer.properties下添加
security.protocol=SASL_PLAINTEXT
1.2 在执行前,需要在环境变量中添加KAFKA_OPT选项,否则kafka无法使用keytab:
export KAFKA_OPTS="$KAFKA_OPTS -Djava.security.auth.login.config=/usr/ndp/current/kafka_broker/conf/kafka_jaas.conf"
其中kafka_jaas.conf内容如下:
cat /usr/ndp/current/kafka_broker/conf/kafka_jaas.conf
KafkaServer {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="/etc/security/keytabs/kafka.service.keytab"
storeKey=true
useTicketCache=false
serviceName="kafka"
principal="kafka/hzadg-mammut-platform3.server.163.org@BDMS.163.COM";
};
KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useTicketCache=true
renewTicket=true
serviceName="kafka";
};
Client {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="/etc/security/keytabs/kafka.service.keytab"
storeKey=true
useTicketCache=false
serviceName="zookeeper"
principal="kafka/hzadg-mammut-platform3.server.163.org@BDMS.163.COM";
};
1.3 创建新的topic:
bin/kafka-topics.sh --create --zookeeper hzadg-mammut-platform2.server.163.org:2181,hzadg-mammut-platform3.server.163.org:2181 --replication-factor 1 --partitions 1 --topic spark-test
1.4 创建生产者:
bin/kafka-console-producer.sh --broker-list hzadg-mammut-platform2.server.163.org:6667,hzadg-mammut-platform3.server.163.org:6667,hzadg-mammut-platform4.server.163.org:6667 --topic spark-test --producer.config ./config/producer.properties
1.5 测试消费者:
bin/kafka-console-consumer.sh --zookeeper hzadg-mammut-platform2.server.163.org:2181,hzadg-mammut-platform3.server.163.org:2181 --bootstrap-server hzadg-mammut-platform2.server.163.org:6667 --topic spark-test --from-beginning --new-consumer --consumer.config ./config/consumer.properties
2. 创建HBase表
2.1 kinit到hbase账号,否则无法创建hbase表
kinit -kt /etc/security/keytabs/hbase.service.keytab hbase/hzadg-mammut-platform2.server.163.org@BDMS.163.COM
./bin/hbase shell
> create 'recsys_logs', 'f'
二、编写Spark代码
编写简单的Spark Java程序,功能为: 从Kafka消费信息,同时将batch内统计的数量写入Hbase中,具体可以参考项目:
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/ package com.netease.spark.streaming.hbase; import com.netease.spark.utils.Consts;
import com.netease.spark.utils.JConfig;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HConnection;
import org.apache.hadoop.hbase.client.HConnectionManager;
import org.apache.hadoop.hbase.client.HTableInterface;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.streaming.Duration;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka010.ConsumerStrategies;
import org.apache.spark.streaming.kafka010.KafkaUtils;
import org.apache.spark.streaming.kafka010.LocationStrategies;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory; import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Arrays;
import java.util.Date;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set; public class JavaKafkaToHBaseKerberos {
private final static Logger LOGGER = LoggerFactory.getLogger(JavaKafkaToHBaseKerberos.class); private static HConnection connection = null;
private static HTableInterface table = null; public static void openHBase(String tablename) throws IOException {
Configuration conf = HBaseConfiguration.create();
synchronized (HConnection.class) {
if (connection == null)
connection = HConnectionManager.createConnection(conf);
} synchronized (HTableInterface.class) {
if (table == null) {
table = connection.getTable("recsys_logs");
}
}
} public static void closeHBase() {
if (table != null)
try {
table.close();
} catch (IOException e) {
LOGGER.error("关闭 table 出错", e);
}
if (connection != null)
try {
connection.close();
} catch (IOException e) {
LOGGER.error("关闭 connection 出错", e);
}
} public static void main(String[] args) throws Exception {
String hbaseTable = JConfig.getInstance().getProperty(Consts.HBASE_TABLE);
String kafkaBrokers = JConfig.getInstance().getProperty(Consts.KAFKA_BROKERS);
String kafkaTopics = JConfig.getInstance().getProperty(Consts.KAFKA_TOPICS);
String kafkaGroup = JConfig.getInstance().getProperty(Consts.KAFKA_GROUP); // open hbase
try {
openHBase(hbaseTable);
} catch (IOException e) {
LOGGER.error("建立HBase 连接失败", e);
System.exit(-1);
} SparkConf conf = new SparkConf().setAppName("JavaKafakaToHBase");
JavaStreamingContext ssc = new JavaStreamingContext(conf, new Duration(1000)); Set<String> topicsSet = new HashSet<>(Arrays.asList(kafkaTopics.split(",")));
Map<String, Object> kafkaParams = new HashMap<>();
kafkaParams.put("bootstrap.servers", kafkaBrokers);
kafkaParams.put("key.deserializer", StringDeserializer.class);
kafkaParams.put("value.deserializer", StringDeserializer.class);
kafkaParams.put("group.id", kafkaGroup);
kafkaParams.put("auto.offset.reset", "earliest");
kafkaParams.put("enable.auto.commit", false);
// 在kerberos环境下,这个配置需要增加
kafkaParams.put("security.protocol", "SASL_PLAINTEXT"); // Create direct kafka stream with brokers and topics
final JavaInputDStream<ConsumerRecord<String, String>> stream =
KafkaUtils.createDirectStream(
ssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String>Subscribe(Arrays.asList(topicsSet.toArray(new String[0])), kafkaParams)
); JavaDStream<String> lines = stream.map(new Function<ConsumerRecord<String, String>, String>() {
private static final long serialVersionUID = -1801798365843350169L; @Override
public String call(ConsumerRecord<String, String> record) {
return record.value();
}
}).filter(new Function<String, Boolean>() {
private static final long serialVersionUID = 7786877762996470593L; @Override
public Boolean call(String msg) throws Exception {
return msg.length() > 0;
}
}); JavaDStream<Long> nums = lines.count(); nums.foreachRDD(new VoidFunction<JavaRDD<Long>>() {
private SimpleDateFormat sdf = new SimpleDateFormat("yyyyMMdd HH:mm:ss"); @Override
public void call(JavaRDD<Long> rdd) throws Exception {
Long num = rdd.take(1).get(0);
String ts = sdf.format(new Date());
Put put = new Put(Bytes.toBytes(ts));
put.add(Bytes.toBytes("f"), Bytes.toBytes("nums"), Bytes.toBytes(num));
table.put(put);
}
}); ssc.start();
ssc.awaitTermination();
closeHBase();
}
}
三、 编译并在Yarn环境下运行
3.1 切到项目路径下,编译项目:
mvn clean package
3.2 运行Spark环境
- 由于executor需要访问kafka,所以需要将Kafka授权过的kerberos用户下发至executor中;
- 由于集群环境的hdfs也是kerberos加密的,需要通过spark.yarn.keytab/spark.yarn.principal配置可以访问Hdfs/HBase的keytab信息;
在项目目录下执行如下:
/usr/ndp/current/spark2_client/bin/spark-submit \
--files ./kafka_client_jaas.conf,./kafka.service.keytab \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./kafka_client_jaas.conf" \
--driver-java-options "-Djava.security.auth.login.config=./kafka_client_jaas.conf" \
--conf spark.yarn.keytab=/etc/security/keytabs/hbase.service.keytab \
--conf spark.yarn.principal=hbase/hzadg-mammut-platform1.server.163.org@BDMS.163.COM \
--class com.netease.spark.streaming.hbase.JavaKafkaToHBaseKerberos \
--master yarn \
--deploy-mode client \
./target/spark-demo-0.1.0-jar-with-dependencies.jar
其中kafka_client_jaas.conf文件具体内容如下:
cat kafka_client_jaas.conf
KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
renewTicket=true
keyTab="./kafka.service.keytab"
storeKey=true
useTicketCache=false
serviceName="kafka"
principal="kafka/hzadg-mammut-platform1.server.163.org@BDMS.163.COM";
};
3.2 执行结果


kerberos环境下spark消费kafka写入到Hbase的更多相关文章
- 17-Flink消费Kafka写入Mysql
戳更多文章: 1-Flink入门 2-本地环境搭建&构建第一个Flink应用 3-DataSet API 4-DataSteam API 5-集群部署 6-分布式缓存 7-重启策略 8-Fli ...
- 本机spark 消费kafka失败(无法连接)
本机spark 消费kafka失败(无法连接) 终端也不报错 就特么不消费: 但是用console的consumer 却可以 经过各种改版本 ,测试配置,最后发现 只要注释掉 kafka 配置se ...
- Centos 6.5 x64环境下 spark 1.6 maven 编译-- 已验证
Centos 6.5 x64 jdk 1.7 scala 2.10 maven 3.3.3 cd spark-1.6 export MAVEN_OPTS="-Xmx2g -XX:MaxPer ...
- Spark消费Kafka如何实现精准一次性消费?
1.定义 精确一次消费(Exactly-once) 是指消息一定会被处理且只会被处理一次.不多不少就一次处理. 如果达不到精确一次消费,可能会达到另外两种情况: 至少一次消费(at least onc ...
- kerberos环境下flume写hbase
直接看官网 http://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#hbasesinks a1.channels = c1 ...
- Spark Streaming消费Kafka Direct方式数据零丢失实现
使用场景 Spark Streaming实时消费kafka数据的时候,程序停止或者Kafka节点挂掉会导致数据丢失,Spark Streaming也没有设置CheckPoint(据说比较鸡肋,虽然可以 ...
- spark streaming - kafka updateStateByKey 统计用户消费金额
场景 餐厅老板想要统计每个用户来他的店里总共消费了多少金额,我们可以使用updateStateByKey来实现 从kafka接收用户消费json数据,统计每分钟用户的消费情况,并且统计所有时间所有用户 ...
- Spark streaming消费Kafka的正确姿势
前言 在游戏项目中,需要对每天千万级的游戏评论信息进行词频统计,在生产者一端,我们将数据按照每天的拉取时间存入了Kafka当中,而在消费者一端,我们利用了spark streaming从kafka中不 ...
- spark streaming从指定offset处消费Kafka数据
spark streaming从指定offset处消费Kafka数据 -- : 770人阅读 评论() 收藏 举报 分类: spark() 原文地址:http://blog.csdn.net/high ...
随机推荐
- 如何设置Linux(Centos)系统定期任务(corntab详细用法)
如何设置Linux(Centos)系统定期任务(crontab详细用法) 1.Crontab简介 Linux 系统则是由 cron (crond) 这个系统服务来控制的.Linux 系统上面原本就有非 ...
- Microsoft Edge浏览器下载文件乱码修复方法(二)
之前有写过"Microsoft Edge浏览器下载文件乱码修复方法",发现很多情况下下载文件乱码问题还是存在,这里对之前内容做简单补充,希望可以帮到大家. 方法二: 默认如果提示下 ...
- 量化投资技术分析工具---ipython使用
量化投资实际上就是分析数据从而做出决策的过程python数据处理相关模块NumPy:数组批量计算pandas:灵活的表计算Matplotlib:数据可视化 学习目标:用NumPy+pandas+Mat ...
- vue 中 vue-router、transition、keep-alive 怎么结合使用?
<transition :name="name" mode="out-in" name="fade"> <keep-ali ...
- 【死磕 Spring】----- IOC 之深入理解 Spring IoC
在一开始学习 Spring 的时候,我们就接触 IoC 了,作为 Spring 第一个最核心的概念,我们在解读它源码之前一定需要对其有深入的认识,本篇为[死磕 Spring]系列博客的第一篇博文,主要 ...
- Java 10 var关键字详解和示例教程
在本文中,我将通过示例介绍新的Java SE 10特性——“var”类型.你将学习如何在代码中正确使用它,以及在什么情况下不能使用它. 介绍 Java 10引入了一个闪亮的新功能:局部变量类型推断.对 ...
- 手把手安装Laravel框架(permissions扩展包)实现RBAC权限---以及一些安装时的ERROR
a.依赖管理工具,框架,环境 1.composer 2.laravel(我的是5.5) 3.PHP(我的7.2),MySql(我的5.7) b,安装 1.首先需要安装一个干净的 Laravel 项目, ...
- Ubuntu 安装phpMyAdmin + 配置nginx
0x01 安装phpMyAdmin ``` sudo apt-get install phpmyadmin ``` 0x02 添加链接 ``` sudo ln -s /usr/share/phpMyA ...
- 【Netty】(3)—源码NioEventLoopGroup
netty(3)-源码NioEventLoopGroup 一.概念 NioEventLoopGroup对象可以理解为一个线程池,内部维护了一组线程,每个线程负责处理多个Channel上的事件,而一个C ...
- GoLang simple-project-demo-01
Hello world 经典例子: package main import "fmt" func main(){ fmt.Println("hello world&quo ...