Download presentation
Presentation is loading. Please wait.
Published byGrace Lee Modified over 8 years ago
1
Very Large Scale Stream Processing inside Alibaba longda@alibaba Alibaba
2
Current 1 Alibaba
3
Current NextFuture 3 团队介绍 Apache Storm PMC The First Storm Team in China Storm 0.5.1/0.5.4/0.6.0/0.6.2/0.7.0/0.7.1 Jstorm 0.7.1/0.9.0/0.9.1/0.9.2/0.9.3/0.9.3.1/0.9.4/0.9.4.1/0.9.5/0.9.5.1/0.9.6/0.9.6.1/0. 9.6.2/0.9.6.3/0.9.7/0.9.7.1/0.9.7.2/0.9.8/2.0.4/2.1.0 Our job – Do Everything: Application Development JStorm Platform Evolve JStorm/Storm Technology Support Maintain all Cluster
4
Current NextFuture 4 In Alibaba Everywhere 1600 machines, 70 K machines will deploy More 1000 Applications, 1500 topology 1.5 PB 2 Trillion Messages
5
Current NextFuture 5 Tlog/eagleeye 1000 Billion Message, 700 TB log, monitor 200K machines log. Rds Monitor200 TB Log CTU Security 200 Billion Message, monitor all of trade/user actions, 500w DB Monitor 200 Billion Message, 500w BI Realtime Monitor200 Billion Message, more than 2000 KPI. Alimama Anti Cheat100 Billion Message, Living Room11.11 Living Room, 12.12 Living Room, Spring Festival Living Room OthersAll kinds of monitor System Large Scale Application
6
Current NextFuture 6 Advanced Features User Side Functionality Stability Enhancement Performance Improvement
7
Current NextFuture 7 Stable Customer Feedback No one accident since the switch to Jstorm in the Alimama Cluster
8
Current NextFuture 8 Improve Stability Redesign Metric System Backpressure Resource Isolation Nimbus HA Topology Manager Redesign ZK usage Modify OS setting in RPM Advanced Feature – Improve Stability
9
Current NextFuture 9 Redesign Metric System Key point: Every Tuple Stage RT, including wait-time between stages, network cost. Avoid noise Pluginable Provide API to fetch all metrics Koala Simple Directly Display all metrics
10
Current NextFuture 10 New UI
11
Current NextFuture 11 Backpressure The paper about Heron is too simple to use The design is complicated Works well on our online system, 6 times than the normal
12
Current NextFuture 12 Resource Isolation Cluster Isolation, control through one unified porter –Koala In one cluster: Cgroup , share + limit CPU User-defined Scheduler, force topology run on special nodes.
13
Current NextFuture 13 Nimbus HA Nimbus HA, Run more than 20 months Stable
14
Current NextFuture 14 TopologyMaster Topology’s central control, move some jobs from Nimbus Backpressure coordinator Metrics collector/calculator Hearbeat collector
15
Current NextFuture 15 Redesign ZK usage No dynamic data stored on ZK, especially metrics and hearbeat ZK can’t support more than 400 Storm nodes. ZK can support 2000 Jstorm node, current in Alibaba, a lot of Jstorm ZK support 800 node.
16
Current NextFuture 16 RPM Setting Easy install Jstorm Modify Local temporary port range Ulimit Cronjob Environment viriable
17
Current NextFuture 17 Advanced Features – From User Side User Side Functionality User-Defined Scheduler User-Defined Log User-Defined Metrics Gently Shutdown Dynamic Expand/Reload/Restart Customized Memory Usage Different Netty Policy Classloader
18
Current NextFuture 18 User-Defined Scheduler Just Using API: Customize every worker’s CPU/Memory usage Customized topology assignment Assign Topology by used Bind several component into one worker ( such as spout/bolt ) Bind upstream/downstream component into one worker Force one component run on special machines Force one component’s task run on different machines Force topology run on special machines Force using old assignment
19
Current NextFuture 19 Used-Define Log Switch to user log configuration Switch between logback and log4j Redirect System.out to any file Add tags ( clustername/hostname/topologyname/workerid/taskid ) Dynamic change log setting: Enable/Disable debug, debug log sample rate
20
Current NextFuture 20 User-Defined Metrics Using java metrics Use-defined metrics Web UI display Using Alimonitor All metrics will be sent to Alimonitor Used defined Alarm Display history Koala System – JStorm porter All metrics will be sent to Koala System Display history User Defined Alarm
21
Current NextFuture 21 Gently shutdown Resolve problem: No data loss during shutdown All worker must be killed ZK is clean
22
Current NextFuture 22 Dynamic Expand/Reload/Restart Expand Don’t kill current worker, don’t impact current data flow Restart Reset all configuration Modify worker/component parallel Reload Reload binary Reload Configuration
23
Current NextFuture 23 Customized memory usage Customize Worker memory -- worker.memory.size Modify gc worker.gc.childopts Using user-define scheduler api Queue mode Capacity limited/unlimited
24
Current NextFuture 24 Advanced Netty Feature Sync /Async Mode Async mode blocking policy Async cache policy
25
Current NextFuture 25 classloader Resolve class conflict between Application and JStorm
26
Current NextFuture 26 6 Servers (24core/98G) 18 Spout/18 Bolt/18 Acker
27
Current NextFuture 27 Performance Improvement 1.Smart Batch Policy 2.Add one thread to deserialize Tuple in every task 3.Remove total send/receive stage 4.Separate send and receive operation in Spout 5. Fix several bug which leading to CPU empty run. 6.Reduce metrics system performance influence. 7.Tuning Acker code 8.Tuning GC
28
Current NextFuture 28 Archeture zookeeper ui nimbussupervisor worker task
29
Current Next Future 29 Merge into Storm Replace the clojure core
30
Current Next Future 30 Redesign our SQL Engine The SQL Engine is customized, no general
31
CurrentNext Future 31 1.A more powerful SQL Engine 2.A more powerful high level program framework 1.Easier to learn, to debug 2.Provide higher thoroughput 3.A high level scheduler 1.I don’t prefer to offline system – liking Hadoop/Spark/Yarn 2.I prefer to online system – Elastic Online Scheduler/Docker/virtual machine 3.More light What should Storm/Jstorm go Alibaba
32
Thanks ! Welcome join us : QQ/ 微信 : 32147704 Alibaba
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.