spark与kafka整合需要引入spark-streaming-kafka.jar,该jar根据kafka版本有2个分支,分别是spark-streaming-kafka-0-8和spark-streaming-kafka-0-10。

jar包分支选择原则:0.10.0>kafka版本>=0.8.2.1,选择spark-streaming-kafka-0-8;kafka版本>=0.10.0,选择spark-streaming-kafka-0-10。

kafka0.8.2.1及之后版本依次是0.8.2.1(2015年3月11号发布)、0.8.2.2(2015年10月2号发布)、0.9.x、0.10.x(0.10.0.0于2016年5月22号发布)、0.11.x、1.0.x(1.0.0版本于2017年11月1号发布)、1.1.x、2.0.x(2.0.0版本于2018年7月30日发布)。

本次学习使用kafka1.0.0版本,故需要引入spark-streaming-kafka-0-10的jar,如下

<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.2.1</version>
</dependency>

PS:从jar包的groupId可看出,该jar是由spark项目组开发的。

简单用例1:本例在spark2.4.0(scala2.12)、kafka2.2.0(scala2.12)环境测试通过

import org.apache.commons.collections.MapUtils;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.spark.SparkConf;
import org.apache.spark.sql.*;
import org.apache.spark.storage.StorageLevel;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.*;
import org.apache.spark.streaming.kafka010.*;
import com.alibaba.fastjson.JSON; import java.util.*; public class SparkConsumerTest {
public static void main(String[] args) throws Exception {
System.setProperty("hadoop.home.dir", "C:/Users/lenovo/Downloads/winutils-master/winutils-master/hadoop-2.7.1");
SparkConf conf = new SparkConf().setAppName("heihei").setMaster("local[*]");
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(10));
Properties props = new Properties();
props.setProperty("bootstrap.servers", "192.168.56.100:9092");
props.setProperty("group.id", "my-test-consumer-group");
props.setProperty("enable.auto.commit", "true");
props.setProperty("auto.commit.interval.ms", "1000");
props.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
Map kafkaParams = new HashMap(8);
kafkaParams.putAll(props);
JavaInputDStream<ConsumerRecord<String, String>> javaInputDStream = KafkaUtils.createDirectStream(jssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.Subscribe(Arrays.asList("test"), kafkaParams));
javaInputDStream.persist(StorageLevel.MEMORY_AND_DISK_SER());
SparkSession spark = SparkSession.builder().config(conf).getOrCreate();
javaInputDStream.foreachRDD(rdd -> {
Dataset<Row> df = spark.createDataFrame(rdd.map(consumerRecord -> {
Map testMap = JSON.parseObject(consumerRecord.value(), Map.class);
return new DemoBean(MapUtils.getString(testMap, "id"),
MapUtils.getString(testMap, "name"),
MapUtils.getIntValue(testMap, "age"));
}), DemoBean.class);
DataFrameWriter writer = df.write();
String url = "jdbc:postgresql://192.168.56.100/postgres";
String table = "test";
Properties connectionProperties = new Properties();
connectionProperties.put("user", "postgres");
connectionProperties.put("password", "abc123");
connectionProperties.put("driver", "org.postgresql.Driver");
connectionProperties.put("batchsize", "3000");
writer.mode(SaveMode.Append).jdbc(url, table, connectionProperties);
});
jssc.start();
jssc.awaitTermination();
}
}

DemoBean是另外的一个实体类。

相应pom.xml:

    <dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.12</artifactId>
<version>2.2.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-streams -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-streams</artifactId>
<version>2.2.0</version>
</dependency> <!-- https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.9</version>
</dependency>
<!-- https://mvnrepository.com/artifact/commons-io/commons-io -->
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>2.6</version>
</dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka-0-10 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.12</artifactId>
<version>2.4.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>2.4.0</version>
<!-- <scope>provided</scope>-->
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>2.4.0</version>
</dependency> <!-- https://mvnrepository.com/artifact/com.fasterxml.jackson.core/jackson-databind -->
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.6.7.2</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.alibaba/fastjson -->
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.58</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.thoughtworks.paranamer/paranamer -->
<dependency>
<groupId>com.thoughtworks.paranamer</groupId>
<artifactId>paranamer</artifactId>
<version>2.8</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.postgresql/postgresql -->
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>42.2.5</version>
</dependency> </dependencies>

简单用例2:本例在spark1.6.0(scala2.11)、kafka0.10.2.0(scala2.11)环境测试通过

import com.alibaba.fastjson.JSON;
import kafka.common.TopicAndPartition;
import kafka.message.MessageAndMetadata;
import kafka.serializer.StringDecoder;
import kafka.utils.ZKGroupTopicDirs;
import org.apache.commons.lang3.StringUtils;
import org.apache.curator.framework.CuratorFramework;
import org.apache.curator.framework.CuratorFrameworkFactory;
import org.apache.curator.retry.RetryUntilElapsed;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkException;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.HasOffsetRanges;
import org.apache.spark.streaming.kafka.KafkaCluster;
import org.apache.spark.streaming.kafka.KafkaUtils; import java.nio.charset.StandardCharsets;
import java.util.*; import org.apache.spark.api.java.function.Function;
import org.apache.spark.streaming.kafka.OffsetRange;
import scala.collection.JavaConversions;
import scala.collection.Map$;
import scala.collection.immutable.Set;
import scala.collection.mutable.ArrayBuffer;
import scala.util.Either; public class SparkConsumerTest { public static CuratorFramework curatorFramework; static {
curatorFramework = CuratorFrameworkFactory.builder().connectString("192.168.56.103:2181")
.connectionTimeoutMs(30000)
.sessionTimeoutMs(30000)
.retryPolicy(new RetryUntilElapsed(1000, 1000))
.build();
curatorFramework.start();
} public static void main(String[] args) throws Exception {
String topic = "test";
String groupId = "spark-test-consumer-group"; System.setProperty("hadoop.home.dir", "C:/Users/lenovo/Downloads/winutils-master/winutils-master/hadoop-2.7.1");
SparkConf conf = new SparkConf().setAppName("heihei").setMaster("local[*]");
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
// 每5s一个批次
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(30));
Map<String, String> kafkaParams = new HashMap(4);
kafkaParams.put("bootstrap.servers", "192.168.56.103:9092");
// 生成fromOffsets,KafkaUtils.createDirectStream要使用
Map<TopicAndPartition, Long> fromOffsets = getFromOffsets(kafkaParams, topic, groupId);
SQLContext sqlContext = new SQLContext(jssc.sparkContext());
// Function不是jdk的类,是spark中的类
Function<MessageAndMetadata<String, String>, String> function = MessageAndMetadata::message;
kafkaParams.put("group.id", groupId);
JavaInputDStream<String> messages = KafkaUtils.createDirectStream(
jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
String.class,
kafkaParams,
fromOffsets,
function
);
messages.foreachRDD(rdd -> {
if (!rdd.isEmpty()) {
String firstValue = rdd.first();
System.out.println("message:" + firstValue);
JavaRDD<Person> personJavaRDD = rdd.mapPartitions(it -> {
List<String> list = new ArrayList();
while (it.hasNext()) {
list.add(it.next());
}
return list;
}).map(p -> {
try {
Person person = JSON.parseObject(StringUtils.deleteWhitespace(p), Person.class);
return person;
} catch (Exception e) {
e.printStackTrace();
}
return new Person();
}).filter(p -> StringUtils.isNotBlank(p.getId())
|| StringUtils.isNotBlank(p.getName())
|| p.getAge() != 0
);
if (!personJavaRDD.isEmpty()) {
DataFrame df = sqlContext.createDataFrame(personJavaRDD, Person.class).select("id", "name", "age");
df.show(5);
}
// 设置zookeeper 消费偏移量
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
Arrays.asList(offsetRanges).forEach(offsetRange -> {
String consumerOffsetDir = new ZKGroupTopicDirs(groupId, topic).consumerOffsetDir()
+ "/" + offsetRange.partition();
try {
curatorFramework.setData().forPath(consumerOffsetDir, String.valueOf(offsetRange.untilOffset()).getBytes(StandardCharsets.UTF_8));
} catch (Exception e) {
e.printStackTrace();
}
});
}
});
jssc.start();
jssc.awaitTermination();
} public static scala.collection.immutable.Map jMap2sMap(Map<String, String> map) {
scala.collection.mutable.Map mapTest = JavaConversions.mapAsScalaMap(map);
Object objTest = Map$.MODULE$.newBuilder().$plus$plus$eq(mapTest.toSeq());
Object resultTest = ((scala.collection.mutable.Builder) objTest).result();
scala.collection.immutable.Map resultTest2 = (scala.collection.immutable.Map) resultTest;
return resultTest2;
} public static Map<TopicAndPartition, Long> getFromOffsets(Map kafkaParams, String topic, String groupId) throws Exception {
// kafkaParams只有bootstrap.servers -> broker列表
KafkaCluster kc = new KafkaCluster(jMap2sMap(kafkaParams));
ArrayBuffer<String> arrayBuffer = new ArrayBuffer();
arrayBuffer.$plus$eq(topic);
Either<ArrayBuffer<Throwable>, Set<TopicAndPartition>> either = kc.getPartitions(arrayBuffer.toSet());
if (either.isLeft()) {
throw new SparkException("get partitions failed", either.left().toOption().get().last());
}
scala.collection.immutable.Set<TopicAndPartition> topicAndPartitions = either.right().get(); Either<ArrayBuffer<Throwable>, scala.collection.immutable.Map<TopicAndPartition, KafkaCluster.LeaderOffset>> either2 = kc.getEarliestLeaderOffsets(topicAndPartitions);
if (either2.isLeft()) {
throw new SparkException("get earliestLeaderOffsets failed", either2.left().toOption().get().last());
}
scala.collection.immutable.Map<TopicAndPartition, KafkaCluster.LeaderOffset> earliestLeaderOffsets = either2.right().get(); Either<ArrayBuffer<Throwable>, scala.collection.immutable.Map<TopicAndPartition, KafkaCluster.LeaderOffset>> either3 = kc.getLatestLeaderOffsets(topicAndPartitions);
if (either3.isLeft()) {
throw new SparkException("get latestLeaderOffsets failed", either3.left().toOption().get().last());
} scala.collection.immutable.Map<TopicAndPartition, KafkaCluster.LeaderOffset> latestLeaderOffsets = either3.right().get(); Map<TopicAndPartition, Long> fromOffsets = new HashMap();
ZKGroupTopicDirs zKGroupTopicDirs = new ZKGroupTopicDirs(groupId, topic);
// 从0分区开始
for (int i = 0; i < topicAndPartitions.size(); i++) {
TopicAndPartition topicAndPartition = new TopicAndPartition(topic, i);
// 路径是/consumers/$group/offsets/$topic
String consumerOffsetDir = zKGroupTopicDirs.consumerOffsetDir() + "/" + i;
long zookeeperConsumerOffset = 0;
// 没有消费组偏移量目录,说明没有开始消费
if (curatorFramework.checkExists().forPath(consumerOffsetDir) == null) {
System.out.println(consumerOffsetDir + "目录不存在");
// 如果目录不存在的话,就创建目录,并设值为0
curatorFramework.create().creatingParentsIfNeeded().forPath(consumerOffsetDir, "0".getBytes(StandardCharsets.UTF_8));
} else {
// 拿到zookeeper节点存储的值
byte[] zookeeperConsumerOffsetBytes = curatorFramework.getData().forPath(consumerOffsetDir);
if (zookeeperConsumerOffsetBytes != null) {
zookeeperConsumerOffset = Long.parseLong(new String(zookeeperConsumerOffsetBytes, StandardCharsets.UTF_8));
}
}
long earliestLeaderOffset = earliestLeaderOffsets.get(topicAndPartition).get().offset();
long latestLeaderOffset = latestLeaderOffsets.get(topicAndPartition).get().offset();
long fromOffset;
if (zookeeperConsumerOffset < earliestLeaderOffset) {
fromOffset = earliestLeaderOffset;
} else if (zookeeperConsumerOffset > latestLeaderOffset) {
fromOffset = latestLeaderOffset;
} else {
fromOffset = zookeeperConsumerOffset;
}
fromOffsets.put(topicAndPartition, fromOffset);
}
return fromOffsets;
}
}

pom.xml:

    <dependencies>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.58</version>
</dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.11</artifactId>
<version>1.6.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>1.6.0</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>1.6.0</version>
<scope>provided</scope>
</dependency>
</dependencies>

这里用了spark-streaming-kafka_2.11-1.6.0.jar,而没有用spark-streaming-kafka-assembly_2.11-1.6.0.jar。这两个jar包是完全一样的,但是后面的assembly包死活找不到源码。注意,这里kafka服务器虽然是0.10.2.0版本,但是没有引用kafka_2.11-0.10.2.0.jar,因为实测会报java.lang.ClassCastException: kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker。根本原因是在kafka_2.11-0.8.2.1.jar中,kafka.api.PartitionMetadata类定义是case class PartitionMetadata(partitionId: Int, val leader: Option[Broker], replicas: Seq[Broker], isr: Seq[Broker] = Seq.empty, errorCode: Short = ErrorMapping.NoError),但是从kafka_2.11-0.10.0.0.jar开始,变成了case class PartitionMetadata(partitionId: Int, leader: Option[BrokerEndPoint], replicas: Seq[BrokerEndPoint], isr: Seq[BrokerEndPoint] = Seq.empty, errorCode: Short = Errors.NONE.code),成员变量的类型发生了变化。但是spark-streaming-kafka_2.11-1.6.0.jar包是和kafka_2.11-0.8.2.1.jar兼容的,所以在kafka_2.11-0.10.2.0.jar时会发生类型转换错误。

需要特别提醒的是,spark-streaming-kafka_2.11-1.6.0.jar包的KafkaCluster的内部类都是private的,引用KafkaCluster.LeaderOffset时一直报错。大量搜索后,发现从spark-streaming-kafka_2.11-2.0.jar版本开始,内部类才不是private的。解决办法是在项目中创建一个名为org.apache.spark.streaming.kafka的package,把spark-streaming-kafka_2.11-1.6.0.jar中的KafkaCluster类拷贝到这个包下,同时修改源码,把LeaderOffset的private[spark]标识符去掉。这样,引用的KafkaCluster类就是我们自己的了,不是spark-streaming-kafka_2.11-1.6.0.jar包中的了,KafkaCluster.LeaderOffset就可以用了。

以上场景都是kafka的一条消息对应数据库中的一条记录。如果一条kafka消息对应数据库中的多条记录呢?

简单用例3:和例2一样环境,pom也一样

import com.alibaba.fastjson.JSON;
import kafka.common.TopicAndPartition;
import kafka.message.MessageAndMetadata;
import kafka.serializer.StringDecoder;
import kafka.utils.ZKGroupTopicDirs;
import org.apache.commons.lang3.StringUtils;
import org.apache.curator.framework.CuratorFramework;
import org.apache.curator.framework.CuratorFrameworkFactory;
import org.apache.curator.retry.RetryUntilElapsed;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkException;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.*;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.HasOffsetRanges;
import org.apache.spark.streaming.kafka.KafkaCluster;
import org.apache.spark.streaming.kafka.KafkaUtils; import java.nio.charset.StandardCharsets;
import java.util.*; import org.apache.spark.api.java.function.Function;
import org.apache.spark.streaming.kafka.OffsetRange;
import scala.collection.JavaConversions;
import scala.collection.Map$;
import scala.collection.immutable.Set;
import scala.collection.mutable.ArrayBuffer;
import scala.util.Either; public class SparkConsumerTest { public static CuratorFramework curatorFramework; static {
curatorFramework = CuratorFrameworkFactory.builder().connectString("192.168.56.103:2181")
.connectionTimeoutMs(30000)
.sessionTimeoutMs(30000)
.retryPolicy(new RetryUntilElapsed(1000, 1000))
.build();
curatorFramework.start();
} public static void main(String[] args) throws Exception {
String topic = "test";
String groupId = "spark-test-consumer-group"; System.setProperty("hadoop.home.dir", "C:/Users/lenovo/Downloads/winutils-master/winutils-master/hadoop-2.7.1");
SparkConf conf = new SparkConf().setAppName("heihei").setMaster("local[*]");
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
conf.set("spark.streaming.kafka.maxRatePerPartition", "10000");
// 每5s一个批次
JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.seconds(10));
Map<String, String> kafkaParams = new HashMap(4);
kafkaParams.put("bootstrap.servers", "192.168.56.103:9092");
// 生成fromOffsets,KafkaUtils.createDirectStream要使用
Map<TopicAndPartition, Long> fromOffsets = getFromOffsets(kafkaParams, topic, groupId); JavaSparkContext jsc = jssc.sparkContext();
// Function不是jdk的类,是spark中的类
Function<MessageAndMetadata<String, String>, String> messageHandler = MessageAndMetadata::message;
kafkaParams.put("group.id", groupId);
StructType schema = new StructType(new StructField[]{
new StructField("id", DataTypes.StringType, false, Metadata.empty()),
new StructField("msg", DataTypes.StringType, false, Metadata.empty()),
new StructField("createdDate", DataTypes.StringType, false, Metadata.empty()),
new StructField("updatedDate", DataTypes.StringType, false, Metadata.empty())
});
SQLContext sqlContext = new SQLContext(jssc.sparkContext()); JavaInputDStream<String> messages = KafkaUtils.createDirectStream(
jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
String.class,
kafkaParams,
fromOffsets,
messageHandler
);
messages.foreachRDD(rdd -> {
if (!rdd.isEmpty()) {
JavaRDD<Object> javaRDD = rdd.mapPartitions(it -> {
List<Object> list = new ArrayList<>();
while (it.hasNext()) {
String str = it.next();
if (StringUtils.isNotBlank(str)) {
str = StringUtils.deleteWhitespace(str);
try {
List<Person> personList = JSON.parseArray(str, Person.class);
list.addAll(personList);
} catch (Exception e) {
list.add(str);
}
}
}
return list;
});
JavaRDD<Person> personRDD = javaRDD.filter(p -> p instanceof Person).map(p -> (Person) p);
DataFrame df;
DataFrameWriter writer;
if (!personRDD.isEmpty()) {
df = sqlContext.createDataFrame(personRDD, Person.class)
.select("id", "name", "age");
df.show(2);
}
JavaRDD<Row> rowRDD = javaRDD.filter(p -> p instanceof String)
.map(p -> RowFactory.create(UUID.randomUUID().toString(), p, String.valueOf(System.currentTimeMillis()), String.valueOf(System.currentTimeMillis())));
if (!rowRDD.isEmpty()) {
df = sqlContext.createDataFrame(rowRDD, schema);
df.show(2);
} // 设置zookeeper 消费偏移量
OffsetRange[] offsetRanges = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
Arrays.asList(offsetRanges).forEach(offsetRange -> {
String consumerOffsetDir = new ZKGroupTopicDirs(groupId, topic).consumerOffsetDir()
+ "/" + offsetRange.partition();
try {
curatorFramework.setData().forPath(consumerOffsetDir, String.valueOf(offsetRange.untilOffset()).getBytes(StandardCharsets.UTF_8));
} catch (Exception e) {
e.printStackTrace();
}
});
}
});
jssc.start();
jssc.awaitTermination();
} public static scala.collection.immutable.Map jMap2sMap(Map<String, String> map) {
scala.collection.mutable.Map mapTest = JavaConversions.mapAsScalaMap(map);
Object objTest = Map$.MODULE$.newBuilder().$plus$plus$eq(mapTest.toSeq());
Object resultTest = ((scala.collection.mutable.Builder) objTest).result();
scala.collection.immutable.Map resultTest2 = (scala.collection.immutable.Map) resultTest;
return resultTest2;
} public static Map<TopicAndPartition, Long> getFromOffsets(Map kafkaParams, String topic, String groupId) throws
Exception {
// kafkaParams只有bootstrap.servers -> broker列表
KafkaCluster kc = new KafkaCluster(jMap2sMap(kafkaParams));
ArrayBuffer<String> arrayBuffer = new ArrayBuffer();
arrayBuffer.$plus$eq(topic);
Either<ArrayBuffer<Throwable>, Set<TopicAndPartition>> either = kc.getPartitions(arrayBuffer.toSet());
if (either.isLeft()) {
throw new SparkException("get partitions failed", either.left().toOption().get().last());
}
scala.collection.immutable.Set<TopicAndPartition> topicAndPartitions = either.right().get(); Either<ArrayBuffer<Throwable>, scala.collection.immutable.Map<TopicAndPartition, KafkaCluster.LeaderOffset>> either2 = kc.getEarliestLeaderOffsets(topicAndPartitions);
if (either2.isLeft()) {
throw new SparkException("get earliestLeaderOffsets failed", either2.left().toOption().get().last());
}
scala.collection.immutable.Map<TopicAndPartition, KafkaCluster.LeaderOffset> earliestLeaderOffsets = either2.right().get(); Either<ArrayBuffer<Throwable>, scala.collection.immutable.Map<TopicAndPartition, KafkaCluster.LeaderOffset>> either3 = kc.getLatestLeaderOffsets(topicAndPartitions);
if (either3.isLeft()) {
throw new SparkException("get latestLeaderOffsets failed", either3.left().toOption().get().last());
} scala.collection.immutable.Map<TopicAndPartition, KafkaCluster.LeaderOffset> latestLeaderOffsets = either3.right().get(); Map<TopicAndPartition, Long> fromOffsets = new HashMap();
ZKGroupTopicDirs zKGroupTopicDirs = new ZKGroupTopicDirs(groupId, topic);
// 从0分区开始
for (int i = 0; i < topicAndPartitions.size(); i++) {
TopicAndPartition topicAndPartition = new TopicAndPartition(topic, i);
// 路径是/consumers/$group/offsets/$topic
String consumerOffsetDir = zKGroupTopicDirs.consumerOffsetDir() + "/" + i;
long zookeeperConsumerOffset = 0;
// 没有消费组偏移量目录,说明没有开始消费
if (curatorFramework.checkExists().forPath(consumerOffsetDir) == null) {
System.out.println(consumerOffsetDir + "目录不存在");
// 如果目录不存在的话,就创建目录,并设值为0
curatorFramework.create().creatingParentsIfNeeded().forPath(consumerOffsetDir, "0".getBytes(StandardCharsets.UTF_8));
} else {
// 拿到zookeeper节点存储的值
byte[] zookeeperConsumerOffsetBytes = curatorFramework.getData().forPath(consumerOffsetDir);
if (zookeeperConsumerOffsetBytes != null) {
String zookeeperConsumerOffsetStr = new String(zookeeperConsumerOffsetBytes, StandardCharsets.UTF_8);
zookeeperConsumerOffset = Long.parseLong(zookeeperConsumerOffsetStr);
}
}
long earliestLeaderOffset = earliestLeaderOffsets.get(topicAndPartition).get().offset();
long latestLeaderOffset = latestLeaderOffsets.get(topicAndPartition).get().offset();
long fromOffset;
if (zookeeperConsumerOffset < earliestLeaderOffset) {
fromOffset = earliestLeaderOffset;
} else if (zookeeperConsumerOffset > latestLeaderOffset) {
fromOffset = latestLeaderOffset;
} else {
fromOffset = zookeeperConsumerOffset;
}
fromOffsets.put(topicAndPartition, fromOffset);
}
System.out.println("fromOffsets= " + fromOffsets);
return fromOffsets;
}
}

Receiver DStream和Direct DStream分别是什么?spark从kafka拉取数据有哪几种方式?在oppo面试的时候有问过。

spark第十篇:Spark与Kafka整合的更多相关文章

  1. spark调优篇-spark on yarn web UI

    spark on yarn 的执行过程在 yarn RM 上无法直接查看,即 http://192.168.10.10:8088,这对于调试程序很不方便,所以需要手动配置 配置方法 1. 配置 spa ...

  2. spark第八篇:与Phoenix整合

    spark sql可以与hbase交互,比如说通过jdbc,但是实际使用时,一般是利用phoenix操作hbase.此时,需要在项目中引入phoenix-core-4.10.0-HBase-1.2.j ...

  3. Spark(十)Spark之数据倾斜调优

    一 调优概述 有的时候,我们可能会遇到大数据计算中一个最棘手的问题——数据倾斜,此时Spark作业的性能会比期望差很多.数据倾斜调优,就是使用各种技术方案解决不同类型的数据倾斜问题,以保证Spark作 ...

  4. Spark(十) -- Spark Streaming API编程

    本文测试的Spark版本是1.3.1 Spark Streaming编程模型: 第一步: 需要一个StreamingContext对象,该对象是Spark Streaming操作的入口 ,而构建一个S ...

  5. spark调优篇-Spark ON Yarn 内存管理(汇总)

    本文旨在解析 spark on Yarn 的内存管理,使得 spark 调优思路更加清晰 内存相关参数 spark 是基于内存的计算,spark 调优大部分是针对内存的,了解 spark 内存参数有也 ...

  6. 大数据之路【第十篇】:kafka消息系统

    一.简介 1.简介 简 介• Kafka是Linkedin于2010年12月份开源的消息系统• 一种分布式的.基于发布/订阅的消息系统 2.特点 – 消息持久化:通过O(1)的磁盘数据结构提供数据的持 ...

  7. 【转】Spark Streaming和Kafka整合开发指南

    基于Receivers的方法 这个方法使用了Receivers来接收数据.Receivers的实现使用到Kafka高层次的消费者API.对于所有的Receivers,接收到的数据将会保存在Spark ...

  8. Spark Streaming和Kafka整合保证数据零丢失

    当我们正确地部署好Spark Streaming,我们就可以使用Spark Streaming提供的零数据丢失机制.为了体验这个关键的特性,你需要满足以下几个先决条件: 1.输入的数据来自可靠的数据源 ...

  9. Spark Streaming和Kafka整合开发指南(二)

    在本博客的<Spark Streaming和Kafka整合开发指南(一)>文章中介绍了如何使用基于Receiver的方法使用Spark Streaming从Kafka中接收数据.本文将介绍 ...

随机推荐

  1. [GO]方法值和方法表达式

    package main import "fmt" type Person struct { name string sex byte age int } func (p Pers ...

  2. java Jvm工作原理学习笔记

    一.         JVM的生命周期 1.       JVM实例对应了一个独立运行的java程序它是进程级别 a)     启动.启动一个Java程序时,一个JVM实例就产生了,任何一个拥有pub ...

  3. unity小记

    1.window下的Occlusion Culling是实现遮挡剔除效果,即不再摄像机里出现的物体使其不被渲染. 这样做要使物体为静态的,而且效果在设计时只在Occlusion面板下有效 2.wind ...

  4. 洛谷P2173 [ZJOI2012]网络(10棵lct与瞎jb暴力)

    有一个无向图G,每个点有个权值,每条边有一个颜色.这个无向图满足以下两个条件: 对于任意节点连出去的边中,相同颜色的边不超过两条. 图中不存在同色的环,同色的环指相同颜色的边构成的环. 在这个图上,你 ...

  5. [leetcode] 3. Pascal's Triangle

    第三道还是帕斯卡三角,这个是要求正常输出,题目如下: Given numRows, generate the first numRows of Pascal's triangle. For examp ...

  6. Reporting Service服务SharePoint集成模式安装配置(3、4、安装sharepoint 2010必备组件及产品)

    Reporting Service服务SharePoint集成模式安装配置 第三步和第四部 第三步 安装sharepoint 2010必备组件 1.安装SharePoint2010必备组件,执行Pre ...

  7. Inno Setup卸载时注销bho

    Inno setup是一个制作安装包的免费工具,比如你用Qt开发完成一款软件,拿Inno setup打个安装包甩给客户安装就好了. 但是bho插件在注册后,万一用户卸载软件时,bho插件还是躺在管理加 ...

  8. python-自定义异步非阻塞爬虫框架

    api import socket import select class MySock: def __init__(self, sock, data): self.sock = sock self. ...

  9. 【题解】 UVa11729 Commando War

    题目大意 你有n个部下,每个部下需要完成一项任务.第i个部下需要你花Bj分钟交代任务,然后他就会立刻独立地.无间断地执行Ji分钟后完成任务.你需要选择交代任务的顺序,使得所有任务尽早执行完毕(即最后一 ...

  10. react学习之路-配制antd-mobile

    经过将近三个月的使用,现在终于在我老大的带领下做完了一个react的项目,感觉还可以,最大的不足就是,对于react中的很多的东西都是掺杂着jq使用来做的,这是最不满意的一点吧,但是开发进度很近,只能 ...