Presentation is loading. Please wait.

Presentation is loading. Please wait.

A high-throughput distributed messaging system

Similar presentations


Presentation on theme: "A high-throughput distributed messaging system"— Presentation transcript:

1 A high-throughput distributed messaging system
Apache Kafka A high-throughput distributed messaging system Johan Lundahl

2 Agenda Kafka overview Features, strengths and tradeoffs
Main concepts and comparisons to other messaging systems Features, strengths and tradeoffs Message format and broker concepts Partitioning, Keyed messages, Replication Producer / Consumer APIs Operation considerations Kafka ecosystem If time permits: Kafka as a real-time processing backbone Brief intro to Storm Kafka-Storm wordcount demo

3 What is Apache Kafka? Distributed, high-throughput, pub-sub messaging system Fast, Scalable, Durable Main use cases: log aggregation, real-time processing, monitoring, queueing Originally developed by LinkedIn Implemented in Scala/Java Top level Apache project since 2012:

4 Comparison to other messaging systems
Traditional: JMS, xxxMQ/AMQP New gen: Kestrel, Scribe, Flume, Kafka Message queues Low throughput, low latency JMS ActiveMQ Qpid RabbitMQ Log aggregators High throughput, high latency Kestrel Scribe Flume Hedwig Batch jobs Kafka

5 Kafka concepts Producers Broker Consumers Frontend Frontend Service
Topic1 Topic1 Topic3 Topic2 Push Broker Kafka Pull Topic3 Topic3 Topic2 Topic1 Topic3 Topic2 Topic1 Consumers Monitoring Stream processing Batch processing Data warehouse

6 Distributed model Zookeeper Producer Producer Producer
KAFKA-156 Producer Producer Producer Producer persistence Partitioned Data Publication Broker Broker Broker Intra cluster replication Zookeeper Ordered subscription Topic1 consumer group Topic2 consumer group

7 Agenda Kafka overview Features, strengths and tradeoffs
Main concepts and comparisons to other messaging systems Features, strengths and tradeoffs Message format and broker concepts Partitioning, Keyed messages, Replication Producer / Consumer APIs Operation considerations Kafka ecosystem If time permits: Kafka as a real-time processing backbone Brief intro to Storm Kafka-Storm wordcount demo

8 Performance factors Broker doesn’t track consumer state
Like a distributed commit log Everything is distributed Low overhead protocol Zero-copy (sendfile) reads/writes Message batching (Producer & Consumer) Usage of page cache backed by sequential disk allocation Compression (End to end) Configurable ack levels From:

9 Kafka features and strengths
Simple model, focused on high throughput and durability O(1) time persistence on disk Horizontally scalable by design (broker and consumers) Push - pull => consumer burst tolerance Replay messages Multiple independent subscribes per topic Configurable batching, compression, serialization Online upgrades

10 Tradeoffs Not optimized for millisecond latencies Have not beaten CAP
Simple messaging system, no processing Zookeeper becomes a bottleneck when using too many topics/partitions (>>10000) Not designed for very large payloads (full HD movie etc.) Helps to know your data in advance

11 Agenda Kafka overview Features, strengths and tradeoffs
Main concepts and comparisons to other messaging systems Features, strengths and tradeoffs Message format and broker concepts Partitioning, Keyed messages, Replication Producer / Consumer APIs Operation considerations Kafka ecosystem If time permits: Kafka as a real-time processing backbone Brief intro to Storm Kafka-Storm wordcount demo

12 Message/Log Format Message Length Version Checksum Payload

13 Log based queue (Simplified model)
Broker Producer API used directly by application or through one of the contributed implementations, e.g. log4j/logback appender Topic1 Topic2 Consumer1 Message1 Message1 Message2 Message2 Producer1 Message3 Message3 Consumer2 Message4 Message4 Message5 Message5 Producer2 Message6 Message6 Message7 Message7 Message8 Consumer3 ConsumerGroup1 Message9 Batching Compression Serialization Message10

14 Partitioning Broker No partition for this guy Group1 Partitions Topic1
Consumer Group1 Broker Partitions Topic1 Producer Producer Producer Consumer Group2 Topic2 Producer Consumer Group3 Producer No partition for this guy Consumer

15 Keyed messages #partitions=3 hash(key) % #partitions Message1 Message2
BrokerId=1 BrokerId=2 BrokerId=3 Topic1 Topic1 Topic1 Message1 Message2 Message3 Message5 Message4 Message7 Message9 Message6 Message11 Message13 Message8 Message15 Message17 Message10 Message12 Message14 Producer Message16 Message18

16 Intra cluster replication
Replication factor = 3 Broker1 Broker2 Broker3 InSyncReplicas Follower fails: Follower dropped from ISR When follower comes online again: fetch data from leader, then ISR gets updated Topic1 leader Topic1 follower Topic1 follower Message1 Message1 Message1 Message2 Message2 Message2 Message3 Message3 Message3 Leader fails: Detected via Zookeeper from ISR New leader gets elected Message4 Message4 Message4 Message5 Message5 Message5 Message6 Message6 Message6 Message7 Message7 Message7 Message8 Message8 Message8 Producer Message9 ack ack ack Message9 Message9 ack Message10 Message10 Message10 3 commit modes: Commit mode Latency Durability Fire & Forget “none” Weak Leader ack 1 roundtrip Medium Full replication 2 roundtrips Strong

17 Agenda Kafka overview Features, strengths and tradeoffs
Main concepts and comparisons to other messaging systems Features, strengths and tradeoffs Message format and broker concepts Partitioning, Keyed messages, Replication Producer / Consumer APIs Operation considerations Kafka ecosystem If time permits: Kafka as a real-time processing backbone Brief intro to Storm Kafka-Storm wordcount demo

18 Producer API …or for log aggregation: Configuration parameters:
ProducerType (sync/async) CompressionCodec (none/snappy/gzip) BatchSize EnqueueSize/Time Encoder/Serializer Partitioner #Retries MaxMessageSize

19 Consumer API(s) High-level (consumer group, auto-commit)
Low-level (simple consumer, manual commit)

20 Agenda Kafka overview Features, strengths and tradeoffs
Main concepts and comparisons to other messaging systems Features, strengths and tradeoffs Message format and broker concepts Partitioning, Keyed messages, Replication Producer / Consumer APIs Operation considerations Kafka ecosystem If time permits: Kafka as a real-time processing backbone Brief intro to Storm Kafka-Storm wordcount demo

21 Broker Protips Reasonable number of partitions – will affect performance Reasonable number of topics – will affect performance Performance decrease with larger Zookeeper ensembles Disk flush rate settings message.max.bytes – max accept size, should be smaller than the heap socket.request.max.bytes – max fetch size, should be smaller than the heap log.retention.bytes – don’t want to run out of disk space… Keep Zookeeper logs under control for same reason as above Kafka brokers have been tested on Linux and Solaris

22 Operating Kafka Zookeeper usage Distribution Tools: Monitoring
Producer loadbalancing Broker ISR Consumer tracking Monitoring JMX Audit trail/console in the making Distribution Tools: Controlled shutdown tool Preferred replica leader election tool List topic tool Create topic tool Add partition tool Reassign partitions tool MirrorMaker

23 Multi-datacenter replication

24 Agenda Kafka overview Features, strengths and tradeoffs
Main concepts and comparisons to other messaging systems Features, strengths and tradeoffs Message format and broker concepts Partitioning, Keyed messages, Replication Producer / Consumer APIs Operation considerations Kafka ecosystem If time permits: Kafka as a real-time processing backbone Brief intro to Storm Kafka-Storm wordcount demo

25 Ecosystem Producers: Consumers: Java (in standard dist)
Scala (in standard dist) Log4j (in standard dist) Logback: logback-kafka Udp-kafka-bridge Python: kafka-python Python: pykafka Python: samsa Python: pykafkap Python: brod Go: Sarama Go: kafka.go C: librdkafka C/C++: libkafka Clojure: clj-kafka Clojure: kafka-clj Ruby: Poseidon Ruby: kafka-rb Ruby: em-kafka PHP: kafka-php(1) PHP: kafka-php(2) PHP: log4php Node.js: Prozess Node.js: node-kafka Node.js: franz-kafka Erlang: erlkafka Consumers: Java (in standard dist) Scala (in standard dist) Python: kafka-python Python: samsa Python: brod Go: Sarama Go: nuance Go: kafka.go C/C++: libkafka Clojure: clj-kafka Clojure: kafka-clj Ruby: Poseidon Ruby: kafka-rb Ruby: Kafkaesque Jruby::Kafka PHP: kafka-php(1) PHP: kafka-php(2) Node.js: Prozess Node.js: node-kafka Node.js: franz-kafka Erlang: erlkafka Erlang: kafka-erlang Common integration points: Stream Processing Storm - A stream-processing framework. Samza - A YARN-based stream processing framework. Hadoop Integration Camus - LinkedIn's Kafka=>HDFS pipeline. This one is used for all data at LinkedIn, and works great. Kafka Hadoop Loader A different take on Hadoop loading functionality from what is included in the main distribution. AWS Integration Automated AWS deployment Kafka->S3 Mirroring Logging klogd - A python syslog publisher klogd2 - A java syslog publisher Tail2Kafka - A simple log tailing utility Fluentd plugin - Integration with Fluentd Flume Kafka Plugin - Integration with Flume Remote log viewer LogStash integration - Integration with LogStash and Fluentd Official logstash integration Metrics Mozilla Metrics Service - A Kafka and Protocol Buffers based metrics and logging system Ganglia Integration Packing and Deployment RPM packaging Debian packaginghttps://github.com/tomdz/kafka-deb-packaging Puppet integration Dropwizard packaging Misc. Kafka Mirror - An alternative to the built-in mirroring tool Ruby Demo App  Apache Camel Integration Infobright integration

26 What’s in the future? Topic and transient consumer garbage collection (KAFKA-560/KAFKA-559) Producer side persistence (KAFKA-156/KAFKA-789) Exact mirroring (KAFKA-658) Quotas (KAFKA-656) YARN integration (KAFKA-949) RESTful proxy (KAFKA-639) New build system? (KAFKA-855) More tooling (Console, Audit trail) (KAFKA-266/KAFKA-260) Client API rewrite (Proposal) Application level security (Proposal)

27 Agenda Kafka overview Features, strengths and tradeoffs
Main concepts and comparisons to other messaging systems Features, strengths and tradeoffs Message format and broker concepts Partitioning, Keyed messages, Replication Producer / Consumer APIs Operation considerations Kafka ecosystem If time permits: Kafka as a real-time processing backbone Brief intro to Storm Kafka-Storm wordcount demo

28 Kafka as a processing pipeline backbone
Stream processing Kafka as a processing pipeline backbone Producer Process1 Process2 Kafka topic1 Kafka topic2 Producer Process1 Process2 Producer Process1 Process2 System1 System2

29 What is Storm? Distributed real-time computation system with design goals: Guaranteed processing No orphaned tasks Horizontally scalable Fault tolerant Fast Use cases: Stream processing, DRPC, Continuous computation 4 basic concepts: streams, spouts, bolts, topologies In Apache incubator Implemented in Clojure

30 Streams Spouts (t4,s2,e2) (t3,s3) (t2,s1,e2) (t1,s1,e1) (t4,s2,e2)
an [infinite] sequence (of tuples) (timestamp,sessionid,exception stacktrace) (t4,s2,e2) (t3,s3) (t2,s1,e2) (t1,s1,e1) Spouts a source of streams (t4,s2,e2) (t3,s3) (t2,s1,e2) (t1,s1,e1) Connects to queues, logs, API calls, event data. Some features like transactional topologies (which gives exactly-once messaging semantics) is only possible using the Kafka-TransactionalSpout-consumer

31 Bolts (t2,s1,h2) (t1,s1,h1) (t3,s3) (t4,s2,e2) (t5,s4) Filters
Transformations Apply functions Aggregations Access DB, APIs etc. Emitting new streams Trident = a high level abstraction on top of Storm

32 Topologies (t2,s1,h2) (t1,s1,h1) (t4,s2,e2) (t5,s4) (t3,s3) (t6,s6)

33 Storm cluster Nimbus Compare with Hadoop: (JobTracker) Zookeeper
Deploy Topology Compare with Hadoop: Nimbus (JobTracker) Zookeeper Supervisor Supervisor Supervisor Supervisor Supervisor (TaskTrackers) Mesos/YARN

34 Links Apache Kafka: Papers and presentations Main project page
Small Mediawiki case study Storm: Introductory article Realtime discussing blog post Kafka+Storm for realtime BigData Trifecta blog post: Kafka+Storm+Cassandra IBM developer article BigData Quadfecta blog post


Download ppt "A high-throughput distributed messaging system"

Similar presentations


Ads by Google