Kafka
kafka是一個高吞吐的分布式消息隊列系統。特點是生産者消費者模式,先進先出(FIFO)保證順序,自己不丢資料,預設每隔7天清理資料。消息列隊常見場景:系統之間解耦合、峰值壓力緩沖、異步通信。
- producer : 消息生産者
- consumer : 消息消費之
- broker : kafka叢集的server,負責處理消息讀、寫請求,存儲消息,在kafka cluster這一層這裡,其實裡面是有很多個broker
- topic : 消息隊列/分類相當于隊列,裡面有生産者和消費者模型
- zookeeper : 中繼資料資訊存在zookeeper中,包括:存儲消費偏移量,topic話題資訊,partition資訊
- 1、一個topic分成多個partition
- 2、每個partition内部消息強有序, 其中的每個消息都有一個序号交offset
- 3、一個partition 隻對應一個broker, 一個broker 可以管理多個partition
- 4、 消息直接寫入檔案,并不儲存在記憶體中
- 5、按照時間政策, 預設一周删除, 而不是消息消費完就删除
- 6、producer自己決定網那個partition寫消息,可以是輪詢的負載均衡,或者是基于hash的partition政策
kafka 的消息消費模型
- consumer 自己維護消費到哪個offset
- 每個consumer都有對應的group
-
group 内是queue消費模型
– 各個consumer消費不同的partition
– 一個消息在group内隻消費一次
- 各個group各自獨立消費,互不影響
kafka 特點
- 生存者消費模型:FIFO; partition内部是FIFO的, partition之間不是FIFO
- 高性能:單節點支援上千個用戶端,百MB/s 吞吐
- 持久性:直接持久在普通的磁盤上,性能比較好; 直接append 方式追加到磁盤,資料不會丢
- 分布式:資料副本備援,流量負載均衡、可擴充; 資料副本,也就是同一份資料可以到不同的broker上面去,也就是當一份資料, 磁盤壞掉,資料不虧丢失
- 很靈活: 消息長時間持久化+Cilent維護消費狀态; 1、持久花時間長,可以是一周、一天,2、可以自定義消息偏移量
kafka 安裝
-
https://www.apache.org/dyn/closer.cgi?path=/kafka/2.0.1/kafka_2.11-2.0.1.tgz
下載下傳
- 解壓壓縮包,修改config 檔案夾下 server.properties
// 節點編号:(不同節點按0,1,2,3整數來配置)
broker.id = 0
// 資料存放目錄
log.dirs = /log
// zookeeper 叢集配置
zookeeper.connect=node1:2181,node2:2181,node3:2181
- 啟動
bin/kafka-server-start.sh config/server.properties
可以單獨配置一個啟動檔案
vim start-kafka.sh
nohup bin/kafka-server-start.sh config/server.properties > kafka.log 2>&1 &
授權
chmod 755 start-kafka.sh
kafka基礎指令
建立topic
./kafka-topics.sh --zookeeper node1:2181,node2:2181,node3:2181 --create --topic t0315 --partitions 3 --replication-factor 3
檢視topic:
./kafka-topics.sh --zookeeper node1:2181,node2:2181,node3:2181 --list
生産者:
./kafka-console-producer.sh --topic t0315 --broker-list node1:9092,node2:9092,node3:9092
消費者:
./kafka-console-consumer.sh --bootstrap-server node1:9092,node2:9092,node3:9092 --topic t0315
擷取描述:
./kafka-topics.sh --describe --zookeeper node1:2181,node2:2181,node3:2181 --topic t0315
kafka中有一個被稱為優先副本(preferred replicas)的概念。如果一個分區有3個副本,且這3個副本的優先級别分别為0,1,2,根據優先副本的概念,0會作為leader 。當0節點的broker挂掉時,會啟動1這個節點broker當做leader。當0節點的broker再次啟動後,會自動恢複為此partition的leader。不會導緻負載不均衡和資源浪費,這就是leader的均衡機制。
在配置檔案conf/ server.properties中配置開啟(預設就是開啟):
auto.leader.rebalance.enable true
Code 部分
sparkStreaming 的direact 方式
<properties>
<spark.version>2.2.0</spark.version>
</properties>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>${spark.version}</version>
<!-- <exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
<exclusion>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
</exclusion>
</exclusions>-->
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
producer 部分:
import kafka.serializer.StringEncoder;
import org.apache.kafka.clients.producer.ProducerRecord;
import java.util.Properties;
/**
*@Author PL
*@Date 2018/12/27 10:59
*@Description TODO
**/
public class KafkaProducer {
public static void main(String[] args) throws InterruptedException {
Properties pro = new Properties();
pro.put("bootstrap.servers","node1:9092,node2:9092,node3:9092");
pro.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
pro.put("value.serializer","org.apache.kafka.common.serialization.StringSerializer");
//Producer<String,String> producer = new Producer<String, String>(new ProducerConfig(pro));
//org.apache.kafka.clients.producer.KafkaProducer producer1 = new Kafka
org.apache.kafka.clients.producer.KafkaProducer<String,String> producer = new org.apache.kafka.clients.producer.KafkaProducer<String, String>(pro);
System.out.println("11");
String topic = "t0315";
String msg = "hello word";
for (int i =0 ;i <100;i++) {
producer.send(new ProducerRecord<String, String>(topic, "hello", msg));
System.out.println(msg);
}
producer.close();
}
}
customer
import kafka.serializer.StringDecoder;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;
import scala.Tuple2;
import java.util.*;
/**
*@Author PL
*@Date 2018/12/26 13:28
*@Description TODO
**/
public class SparkStreamingForkafka {
public static void main(String[] args) throws InterruptedException {
SparkConf sc = new SparkConf().setMaster("local[2]").setAppName("test");
JavaStreamingContext jsc = new JavaStreamingContext(sc, Durations.seconds(5));
Map<String,String> kafkaParam = new HashMap<>();
kafkaParam.put("metadata.broker.list","node1:9092,node2:9092,node3:9092");
//kafkaParam.put("t0315",1);
HashSet<String> topic = new HashSet<>();
topic.add("t0315");
//JavaPairInputDStream<String, String> line = KafkaUtils.createStream(jsc,"node1:9092,node2:9092,node3:9092","wordcountGrop",kafkaParam);
JavaPairInputDStream<String, String> line = KafkaUtils.createDirectStream(jsc, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParam, topic);
JavaDStream<String> flatLine = line.flatMap(new FlatMapFunction<Tuple2<String, String>, String>() {
@Override
public Iterator<String> call(Tuple2<String, String> tuple2) throws Exception {
return Arrays.asList(tuple2._2.split(" ")).iterator();
}
});
JavaPairDStream<String, Integer> pair = flatLine.mapToPair(new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String s) throws Exception {
return new Tuple2<String, Integer>(s, 1);
}
});
JavaPairDStream<String, Integer> count = pair.reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer integer, Integer integer2) throws Exception {
return integer + integer2;
}
});
count.print();
jsc.start();
jsc.awaitTermination();
jsc.close();;
}
}
上述方式為一個SparkStreaming 的消費者, direct方式就是把kafka當成一個存儲資料的庫,spark 自己維護offset。假設,driver 端當機了, 之後再重新開機,會從offset 那一部分開始取?
是以我們需要将kafka 的offset 儲存在檔案中, 當機之後在啟動時去恢複檔案中的offset 讀取資料。
import kafka.serializer.StringDecoder;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function0;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairDStream;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;
import scala.Tuple2;
import java.util.*;
/**
*@Author PL
*@Date 2018/12/26 13:28
*@Description TODO
**/
public class KafkaCheckPoint {
public static void main(String[] args) throws InterruptedException {
final String checkPoint = "./checkPoint";
Function0<JavaStreamingContext> scFunction = new Function0<JavaStreamingContext>() {
@Override
public JavaStreamingContext call() throws Exception {
return createJavaStreamingContext();
}
};
// 如果存在checkport 就恢複資料,不存在就直接運作
JavaStreamingContext jsc = JavaStreamingContext.getOrCreate(checkPoint, scFunction);
jsc.start();
jsc.awaitTermination();
jsc.close();;
}
public static JavaStreamingContext createJavaStreamingContext(){
System.out.println("初始化"); // 第一次會執行,當機之後重新開機執行資料恢複時不執行
final SparkConf sc = new SparkConf().setMaster("local").setAppName("test");
JavaStreamingContext jsc = new JavaStreamingContext(sc, Durations.seconds(5));
/**
* checkpoint 儲存
* 1、 配置資訊
* 2、Dstream 執行邏輯
* 3、Job 的執行進度
* 4、offset
*/
jsc.checkpoint("./checkPoint");
Map<String,String> kafkaParam = new HashMap<>();
kafkaParam.put("metadata.broker.list","node1:9092,node2:9092,node3:9092");
HashSet<String> topic = new HashSet<>();
topic.add("t0315");
JavaPairInputDStream<String, String> line = KafkaUtils.createDirectStream(jsc, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParam, topic);
JavaDStream<String> flatLine = line.flatMap(new FlatMapFunction<Tuple2<String, String>, String>() {
@Override
public Iterator<String> call(Tuple2<String, String> tuple2) throws Exception {
return Arrays.asList(tuple2._2.split(" ")).iterator();
}
});
JavaPairDStream<String, Integer> pair = flatLine.mapToPair(new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String s) throws Exception {
return new Tuple2<String, Integer>(s, 1);
}
});
JavaPairDStream<String, Integer> count = pair.reduceByKey(new Function2<Integer, Integer, Integer>() {
@Override
public Integer call(Integer integer, Integer integer2) throws Exception {
return integer + integer2;
}
});
count.print();
return jsc;
}
}
這次我們啟動的時候會發現先從checkpoint中恢複資料, 從上次當機的資料開始讀取并執行。但是,當我們更改功能時,發現新修改的部分沒有執行, 還是執行的上次儲存的代碼。。。。。。。
這時候可以把offset 儲存至zookeeper中
主方法
import com.pl.data.offset.getoffset.GetTopicOffsetFromKafkaBroker;
import com.pl.data.offset.getoffset.GetTopicOffsetFromZookeeper;
import kafka.common.TopicAndPartition;
import org.apache.log4j.Logger;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import java.util.Map;
public class UseZookeeperManageOffset {
/**
* 使用log4j列印日志,“UseZookeeper.class” 設定日志的産生類
*/
static final Logger logger = Logger.getLogger(UseZookeeperManageOffset.class);
public static void main(String[] args) throws InterruptedException {
/**
* 從kafka叢集中得到topic每個分區中生産消息的最大偏移量位置
*/
Map<TopicAndPartition, Long> topicOffsets = GetTopicOffsetFromKafkaBroker.getTopicOffsets("node1:9092,node2:9092,node3:9092", "t0315");
/**
* 從zookeeper中擷取目前topic每個分區 consumer 消費的offset位置
*/
Map<TopicAndPartition, Long> consumerOffsets =
GetTopicOffsetFromZookeeper.getConsumerOffsets("node1:2181,node2:2181,node3:2181","pl","t0315");
/**
* 合并以上得到的兩個offset ,
* 思路是:
* 如果zookeeper中讀取到consumer的消費者偏移量,那麼就zookeeper中目前的offset為準。
* 否則,如果在zookeeper中讀取不到目前消費者組消費目前topic的offset,就是目前消費者組第一次消費目前的topic,
* offset設定為topic中消息的最大位置。
*/
if(null!=consumerOffsets && consumerOffsets.size()>0){
topicOffsets.putAll(consumerOffsets);
}
/**
* 如果将下面的代碼解開,是将topicOffset 中目前topic對應的每個partition中消費的消息設定為0,就是從頭開始。
*/
/*for(Map.Entry<TopicAndPartition, Long> item:topicOffsets.entrySet()){
item.setValue(0l);
}*/
/**
* 建構SparkStreaming程式,從目前的offset消費消息
*/
JavaStreamingContext jsc = SparkStreamingDirect.getStreamingContext(topicOffsets,"pl");
jsc.start();
jsc.awaitTermination();
jsc.close();
}
}
擷取kafka中目前的offset 偏移量(kafka API)
import kafka.api.PartitionOffsetRequestInfo;
import kafka.cluster.Broker;
import kafka.common.TopicAndPartition;
import kafka.javaapi.OffsetRequest;
import kafka.javaapi.OffsetResponse;
import kafka.javaapi.PartitionMetadata;
import kafka.javaapi.TopicMetadata;
import kafka.javaapi.TopicMetadataRequest;
import kafka.javaapi.TopicMetadataResponse;
import kafka.javaapi.consumer.SimpleConsumer;
/**
* 測試之前需要啟動kafka
* @author root
*
*/
public class GetTopicOffsetFromKafkaBroker {
public static void main(String[] args) {
Map<TopicAndPartition, Long> topicOffsets = getTopicOffsets("node1:9092,node2:9092,node3:9092", "t0315");
Set<Entry<TopicAndPartition, Long>> entrySet = topicOffsets.entrySet();
for(Entry<TopicAndPartition, Long> entry : entrySet) {
TopicAndPartition topicAndPartition = entry.getKey();
Long offset = entry.getValue();
String topic = topicAndPartition.topic();
int partition = topicAndPartition.partition();
System.out.println("topic = "+topic+",partition = "+partition+",offset = "+offset);
}
}
/**
* 從kafka叢集中得到目前topic,生産者在每個分區中生産消息的偏移量位置
* @param KafkaBrokerServer
* @param topic
* @return
*/
public static Map<TopicAndPartition,Long> getTopicOffsets(String KafkaBrokerServer, String topic){
Map<TopicAndPartition,Long> retVals = new HashMap<TopicAndPartition,Long>();
// 周遊kafka叢集,并拆分
for(String broker:KafkaBrokerServer.split(",")){
SimpleConsumer simpleConsumer = new SimpleConsumer(broker.split(":")[0],Integer.valueOf(broker.split(":")[1]), 64*10000,1024,"consumer");
TopicMetadataRequest topicMetadataRequest = new TopicMetadataRequest(Arrays.asList(topic));
TopicMetadataResponse topicMetadataResponse = simpleConsumer.send(topicMetadataRequest);
List<TopicMetadata> topicMetadataList = topicMetadataResponse.topicsMetadata();
// 周遊每個topic下的中繼資料
for (TopicMetadata metadata : topicMetadataList) {
// 周遊中繼資料下的分區
for (PartitionMetadata part : metadata.partitionsMetadata()) {
Broker leader = part.leader();
if (leader != null) {
TopicAndPartition topicAndPartition = new TopicAndPartition(topic, part.partitionId());
PartitionOffsetRequestInfo partitionOffsetRequestInfo = new PartitionOffsetRequestInfo(kafka.api.OffsetRequest.LatestTime(), 10000);
OffsetRequest offsetRequest = new OffsetRequest(ImmutableMap.of(topicAndPartition, partitionOffsetRequestInfo), kafka.api.OffsetRequest.CurrentVersion(), simpleConsumer.clientId());
OffsetResponse offsetResponse = simpleConsumer.getOffsetsBefore(offsetRequest);
if (!offsetResponse.hasError()) {
long[] offsets = offsetResponse.offsets(topic, part.partitionId());
retVals.put(topicAndPartition, offsets[0]);
}
}
}
}
simpleConsumer.close();
}
return retVals;
}
}
擷取zookeeper中上次的消費的offset
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;
import org.apache.curator.framework.CuratorFramework;
import org.apache.curator.framework.CuratorFrameworkFactory;
import org.apache.curator.retry.RetryUntilElapsed;
import com.fasterxml.jackson.databind.ObjectMapper;
import kafka.common.TopicAndPartition;
public class GetTopicOffsetFromZookeeper {
public static Map<TopicAndPartition,Long> getConsumerOffsets(String zkServers,String groupID, String topic) {
Map<TopicAndPartition,Long> retVals = new HashMap<TopicAndPartition,Long>();
// 連接配接 zookeeper
ObjectMapper objectMapper = new ObjectMapper();
CuratorFramework curatorFramework = CuratorFrameworkFactory.builder()
.connectString(zkServers).connectionTimeoutMs(1000)
.sessionTimeoutMs(10000).retryPolicy(new RetryUntilElapsed(1000, 1000)).build();
curatorFramework.start();
try{
String nodePath = "/consumers/"+groupID+"/offsets/" + topic;
if(curatorFramework.checkExists().forPath(nodePath)!=null){
List<String> partitions=curatorFramework.getChildren().forPath(nodePath);
for(String partiton:partitions){
int partitionL=Integer.valueOf(partiton);
Long offset=objectMapper.readValue(curatorFramework.getData().forPath(nodePath+"/"+partiton),Long.class);
TopicAndPartition topicAndPartition=new TopicAndPartition(topic,partitionL);
retVals.put(topicAndPartition, offset);
}
}
}catch(Exception e){
e.printStackTrace();
}
curatorFramework.close();
return retVals;
}
public static void main(String[] args) {
Map<TopicAndPartition, Long> consumerOffsets = getConsumerOffsets("node1:2181,node2:2181,node3:2181","pl","t0315");
Set<Entry<TopicAndPartition, Long>> entrySet = consumerOffsets.entrySet();
for(Entry<TopicAndPartition, Long> entry : entrySet) {
TopicAndPartition topicAndPartition = entry.getKey();
String topic = topicAndPartition.topic();
int partition = topicAndPartition.partition();
Long offset = entry.getValue();
System.out.println("topic = "+topic+",partition = "+partition+",offset = "+offset);
}
}
}
讀取kafka中指定offset開始的消息
import com.fasterxml.jackson.databind.ObjectMapper;
import kafka.common.TopicAndPartition;
import kafka.message.MessageAndMetadata;
import kafka.serializer.StringDecoder;
import org.apache.curator.framework.CuratorFramework;
import org.apache.curator.framework.CuratorFrameworkFactory;
import org.apache.curator.retry.RetryUntilElapsed;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.HasOffsetRanges;
import org.apache.spark.streaming.kafka.KafkaUtils;
import org.apache.spark.streaming.kafka.OffsetRange;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.atomic.AtomicReference;
public class SparkStreamingDirect {
public static JavaStreamingContext getStreamingContext(Map<TopicAndPartition, Long> topicOffsets,final String groupID){
SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("SparkStreamingOnKafkaDirect");
conf.set("spark.streaming.kafka.maxRatePerPartition", "10");
JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(5));
// jsc.checkpoint("/checkpoint");
Map<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list","node1:9092,node2:9092,node3:9092");
// kafkaParams.put("group.id","MyFirstConsumerGroup");
for(Map.Entry<TopicAndPartition,Long> entry:topicOffsets.entrySet()){
System.out.println(entry.getKey().topic()+"\t"+entry.getKey().partition()+"\t"+entry.getValue());
}
JavaInputDStream<String> message = KafkaUtils.createDirectStream(
jsc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
String.class,
kafkaParams,
topicOffsets,
new Function<MessageAndMetadata<String,String>,String>() {
private static final long serialVersionUID = 1L;
public String call(MessageAndMetadata<String, String> v1)throws Exception {
return v1.message();
}
}
);
final AtomicReference<OffsetRange[]> offsetRanges = new AtomicReference<>();
JavaDStream<String> lines = message.transform(new Function<JavaRDD<String>, JavaRDD<String>>() {
private static final long serialVersionUID = 1L;
@Override
public JavaRDD<String> call(JavaRDD<String> rdd) throws Exception {
OffsetRange[] offsets = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
offsetRanges.set(offsets);
return rdd;
}
}
);
message.foreachRDD(new VoidFunction<JavaRDD<String>>(){
/**
*
*/
private static final long serialVersionUID = 1L;
@Override
public void call(JavaRDD<String> t) throws Exception {
ObjectMapper objectMapper = new ObjectMapper();
CuratorFramework curatorFramework = CuratorFrameworkFactory.builder()
.connectString("node1:2181,node2:2181,node3:2181").connectionTimeoutMs(1000)
.sessionTimeoutMs(10000).retryPolicy(new RetryUntilElapsed(1000, 1000)).build();
curatorFramework.start();
for (OffsetRange offsetRange : offsetRanges.get()) {
long fromOffset = offsetRange.fromOffset();
long untilOffset = offsetRange.untilOffset();
final byte[] offsetBytes = objectMapper.writeValueAsBytes(offsetRange.untilOffset());
String nodePath = "/consumers/"+groupID+"/offsets/" + offsetRange.topic()+ "/" + offsetRange.partition();
System.out.println("nodePath = "+nodePath);
System.out.println("fromOffset = "+fromOffset+",untilOffset="+untilOffset);
if(curatorFramework.checkExists().forPath(nodePath)!=null){
curatorFramework.setData().forPath(nodePath,offsetBytes);
}else{
curatorFramework.create().creatingParentsIfNeeded().forPath(nodePath, offsetBytes);
}
}
curatorFramework.close();
}
});
lines.print();
return jsc;
}
}