Presentation is loading. Please wait.

Presentation is loading. Please wait.

Apache Kafka A high-throughput distributed messaging system Johan Lundahl.

Similar presentations


Presentation on theme: "Apache Kafka A high-throughput distributed messaging system Johan Lundahl."— Presentation transcript:

1 Apache Kafka A high-throughput distributed messaging system Johan Lundahl

2 Agenda Kafka overview –M–Main concepts and comparisons to other messaging systems Features, strengths and tradeoffs Message format and broker concepts –P–Partitioning, Keyed messages, Replication Producer / Consumer APIs Operation considerations Kafka ecosystem If time permits: Kafka as a real-time processing backbone Brief intro to Storm Kafka-Storm wordcount demo 2

3 What is Apache Kafka? Distributed, high-throughput, pub-sub messaging system – Fast, Scalable, Durable Main use cases: – log aggregation, real-time processing, monitoring, queueing Originally developed by LinkedIn Implemented in Scala/Java Top level Apache project since 2012:

4 4 Comparison to other messaging systems – Traditional: JMS, xxxMQ/AMQP – New gen: Kestrel, Scribe, Flume, Kafka Kafka Message queues Low throughput, low latency JMS ActiveMQ Qpid RabbitMQ Log aggregators High throughput, high latency Kestrel Scribe FlumeHedwig Batch jobs

5 5 Frontend Service Frontend Monitoring Stream processing Batch processing Data warehouse Kafka Producers Broker Consumers Topic1 Topic2 Topic3 Topic1 Topic3 Topic2 Topic3 Topic2 Topic1 Push Pull Kafka concepts

6 Distributed model 6 Producer Broker Topic1 consumer group Topic2 consumer group Partitioned Data Publication Ordered subscription Intra cluster replication Producer persistence KAFKA-156 Zookeeper

7 Agenda Kafka overview –M–Main concepts and comparisons to other messaging systems Features, strengths and tradeoffs Message format and broker concepts –P–Partitioning, Keyed messages, Replication Producer / Consumer APIs Operation considerations Kafka ecosystem If time permits: Kafka as a real-time processing backbone Brief intro to Storm Kafka-Storm wordcount demo 7

8 Performance factors Broker doesn’t track consumer state Everything is distributed Zero-copy (sendfile) reads/writes Usage of page cache backed by sequential disk allocation Like a distributed commit log Low overhead protocol Message batching (Producer & Consumer) Compression (End to end) Configurable ack levels 8 From:

9 Kafka features and strengths Simple model, focused on high throughput and durability O(1) time persistence on disk Horizontally scalable by design (broker and consumers) Push - pull => consumer burst tolerance Replay messages Multiple independent subscribes per topic Configurable batching, compression, serialization Online upgrades 9

10 Tradeoffs Not optimized for millisecond latencies Have not beaten CAP Simple messaging system, no processing Zookeeper becomes a bottleneck when using too many topics/partitions (>>10000) Not designed for very large payloads (full HD movie etc.) Helps to know your data in advance 10

11 Agenda Kafka overview –M–Main concepts and comparisons to other messaging systems Features, strengths and tradeoffs Message format and broker concepts –P–Partitioning, Keyed messages, Replication Producer / Consumer APIs Operation considerations Kafka ecosystem If time permits: Kafka as a real-time processing backbone Brief intro to Storm Kafka-Storm wordcount demo 11

12 Message/Log Format Length Version Checksum Payload Message

13 Log based queue (Simplified model) Message1 Topic2 Message2 Message3 Message4 Message5 Message6 Message7 Producer1 Consumer2 Producer2 Consumer1 Message1 Message2 Message3 Message4 Message5 Message6 Message7 Message8 Message9 Message10 Topic1 Broker Consumer3 ConsumerGroup1 Batching Compression Serialization Producer API used directly by application or through one of the contributed implementations, e.g. log4j/logback appender

14 Broker Producer Topic1 Topic2 Partitions Partitioning Consumer Group2 Consumer Group1 Group3 Consumer No partition for this guy

15 Keyed messages Producer Message1 Message5 Message9 Message13 Message17 Topic1 BrokerId=1 Message2 Message4 Message6 Message8 Message10 Message12 Message14 Message16 Message18 Topic1 BrokerId=2 Message3 Message7 Message11 Message15 Topic1 BrokerId=3 hash(key) % #partitions #partitions=3

16 Intra cluster replication Message1 Message2 Message3 Message4 Message5 Message6 Message7 Message8 Message9 Message10 Topic1 leader Broker1 Message1 Message2 Message3 Message4 Message5 Message6 Message7 Message8 Message9 Message10 Topic1 follower Broker2 Message1 Message2 Message3 Message4 Message5 Message6 Message7 Message8 Message9 Topic1 follower Broker3 Producer ack Replication factor = 3 Message10 InSyncReplicas Commit modeLatencyDurability Fire & Forget“none”Weak Leader ack1 roundtripMedium Full replication2 roundtripsStrong Follower fails: Follower dropped from ISR When follower comes online again: fetch data from leader, then ISR gets updated Leader fails: Detected via Zookeeper from ISR New leader gets elected 3 commit modes:

17 Agenda Kafka overview –M–Main concepts and comparisons to other messaging systems Features, strengths and tradeoffs Message format and broker concepts –P–Partitioning, Keyed messages, Replication Producer / Consumer APIs Operation considerations Kafka ecosystem If time permits: Kafka as a real-time processing backbone Brief intro to Storm Kafka-Storm wordcount demo 17

18 Producer API …or for log aggregation: Configuration parameters: ProducerType (sync/async) CompressionCodec (none/snappy/gzip) BatchSize EnqueueSize/Time Encoder/Serializer Partitioner #Retries MaxMessageSize …

19 Consumer API(s) High-level (consumer group, auto-commit) Low-level (simple consumer, manual commit)

20 Agenda Kafka overview –M–Main concepts and comparisons to other messaging systems Features, strengths and tradeoffs Message format and broker concepts –P–Partitioning, Keyed messages, Replication Producer / Consumer APIs Operation considerations Kafka ecosystem If time permits: Kafka as a real-time processing backbone Brief intro to Storm Kafka-Storm wordcount demo 20

21 Broker Protips Reasonable number of partitions – will affect performance Reasonable number of topics – will affect performance Performance decrease with larger Zookeeper ensembles Disk flush rate settings message.max.bytes – max accept size, should be smaller than the heap socket.request.max.bytes – max fetch size, should be smaller than the heap log.retention.bytes – don’t want to run out of disk space… Keep Zookeeper logs under control for same reason as above Kafka brokers have been tested on Linux and Solaris

22 Operating Kafka Zookeeper usage – Producer loadbalancing – Broker ISR – Consumer tracking Monitoring – JMX – Audit trail/console in the making Distribution Tools: Controlled shutdown tool Preferred replica leader election tool List topic tool Create topic tool Add partition tool Reassign partitions tool MirrorMaker

23 Multi-datacenter replication 23

24 Agenda Kafka overview –M–Main concepts and comparisons to other messaging systems Features, strengths and tradeoffs Message format and broker concepts –P–Partitioning, Keyed messages, Replication Producer / Consumer APIs Operation considerations Kafka ecosystem If time permits: Kafka as a real-time processing backbone Brief intro to Storm Kafka-Storm wordcount demo 24

25 Ecosystem Producers: Java (in standard dist) Scala (in standard dist) Log4j (in standard dist) Logback: logback-kafkalogback-kafka Udp-kafka-bridge Python: kafka-pythonkafka-python Python: pykafkapykafka Python: samsasamsa Python: pykafkappykafkap Python: brodbrod Go: SaramaSarama Go: kafka.gokafka.go C: librdkafkalibrdkafka C/C++: libkafkalibkafka Clojure: clj-kafkaclj-kafka Clojure: kafka-cljkafka-clj Ruby: PoseidonPoseidon Ruby: kafka-rbkafka-rb Ruby: em-kafkaem-kafka PHP: kafka-php(1)kafka-php(1) PHP: kafka-php(2)kafka-php(2) PHP: log4phplog4php Node.js: ProzessProzess Node.js: node-kafkanode-kafka Node.js: franz-kafkafranz-kafka Erlang: erlkafkaerlkafka Consumers: Java (in standard dist) Scala (in standard dist) Python: kafka-pythonkafka-python Python: samsasamsa Python: brodbrod Go: SaramaSarama Go: nuancenuance Go: kafka.gokafka.go C/C++: libkafkalibkafka Clojure: clj-kafkaclj-kafka Clojure: kafka-cljkafka-clj Ruby: PoseidonPoseidon Ruby: kafka-rbkafka-rb Ruby: KafkaesqueKafkaesque Jruby::Kafka PHP: kafka-php(1)kafka-php(1) PHP: kafka-php(2)kafka-php(2) Node.js: ProzessProzess Node.js: node-kafkanode-kafka Node.js: franz-kafkafranz-kafka Erlang: erlkafkaerlkafka Erlang: kafka-erlangkafka-erlang Common integration points: Stream Processing StormStorm - A stream-processing framework. SamzaSamza - A YARN-based stream processing framework. Hadoop Integration CamusCamus - LinkedIn's Kafka=>HDFS pipeline. This one is used for all data at LinkedIn, and works great. Kafka Hadoop LoaderKafka Hadoop Loader A different take on Hadoop loading functionality from what is included in the main distribution. AWS Integration Automated AWS deployment Kafka->S3 Mirroring Logging klogdklogd - A python syslog publisher klogd2klogd2 - A java syslog publisher Tail2KafkaTail2Kafka - A simple log tailing utility Fluentd pluginFluentd plugin - Integration with FluentdFluentd Flume Kafka PluginFlume Kafka Plugin - Integration with FlumeFlume Remote log viewer LogStash integrationLogStash integration - Integration with LogStash and FluentdLogStashFluentd Official logstashOfficial logstash integration Metrics Mozilla Metrics ServiceMozilla Metrics Service - A Kafka and Protocol Buffers based metrics and logging system Ganglia Integration Packing and Deployment RPM packaging Debian packaginghttps://github.com/tomdz/kafka-deb-packaging Puppet integration Dropwizard packaging Misc. Kafka MirrorKafka Mirror - An alternative to the built-in mirroring tool Ruby Demo App Apache Camel Integration Infobright integration

26 What’s in the future? Topic and transient consumer garbage collection (KAFKA-560/KAFKA-559)KAFKA-560KAFKA-559 Producer side persistence (KAFKA-156/KAFKA-789)KAFKA-156KAFKA-789 Exact mirroring (KAFKA-658)KAFKA-658 Quotas (KAFKA-656)KAFKA-656 YARN integration (KAFKA-949)KAFKA-949 RESTful proxy (KAFKA-639)KAFKA-639 New build system? (KAFKA-855)KAFKA-855 More tooling (Console, Audit trail) (KAFKA-266/KAFKA-260)KAFKA-266KAFKA-260 Client API rewrite (Proposal)Proposal Application level security (Proposal)Proposal

27 Agenda Kafka overview –M–Main concepts and comparisons to other messaging systems Features, strengths and tradeoffs Message format and broker concepts –P–Partitioning, Keyed messages, Replication Producer / Consumer APIs Operation considerations Kafka ecosystem If time permits: Kafka as a real-time processing backbone Brief intro to Storm Kafka-Storm wordcount demo 27

28 Stream processing Kafka as a processing pipeline backbone Producer Kafka topic1 Kafka topic2 Process1 Process2 System1 System2

29 29 What is Storm? Distributed real-time computation system with design goals: – Guaranteed processing – No orphaned tasks – Horizontally scalable – Fault tolerant – Fast Use cases: Stream processing, DRPC, Continuous computation 4 basic concepts: streams, spouts, bolts, topologies In Apache incubator Implemented in Clojure

30 30 Streams (t4,s2,e2) (t3,s3) (t2,s1,e2) (t1,s1,e1) (timestamp,sessionid,exception stacktrace) Spouts a source of streams (t4,s2,e2) (t3,s3) (t2,s1,e2) (t1,s1,e1) Connects to queues, logs, API calls, event data. Some features like transactional topologies (which gives exactly-once messaging semantics) is only possible using the Kafka-TransactionalSpout-consumer an [infinite] sequence (of tuples)

31 31 Bolts (t2,s1,h2) (t1,s1,h1) (t3,s3) (t4,s2,e2) (t5,s4) Filters Transformations Apply functions Aggregations Access DB, APIs etc. Emitting new streams Trident = a high level abstraction on top of Storm

32 32 Topologies (t2,s1,h2) (t1,s1,h1) (t3,s3) (t4,s2,e2) (t5,s4) (t6,s6) (t7,s7) (t8,s8)

33 33 Storm cluster Nimbus Supervisor Zookeeper Topology Deploy (JobTracker) Compare with Hadoop: (TaskTrackers) Mesos/YARN

34 Links 34 Apache Kafka: Papers and presentations Main project page Small Mediawiki case study Storm: Introductory article Realtime discussing blog post Kafka+Storm for realtime BigData Trifecta blog post: Kafka+Storm+Cassandra IBM developer article BigData Quadfecta blog post


Download ppt "Apache Kafka A high-throughput distributed messaging system Johan Lundahl."

Similar presentations


Ads by Google