Very Large Scale Stream Processing inside Alibaba Alibaba.

Slides:



Advertisements
Similar presentations
How We Manage SaaS Infrastructure Knowledge Track
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
DynaTrace Platform.
VTS INNOVATOR SERIES Real Problems, Real solutions.
13,000 Jobs and counting…. Advertising and Data Platform Our System.
Making Fly Parviz Deyhim
CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor.
1 Generic logging layer for the distributed computing by Gene Van Buren Valeri Fine Jerome Lauret.
Web Applications Development Using Coldbox Platform Eddie Johnston.
Resource Management with YARN: YARN Past, Present and Future
Adding scalability to legacy PHP web applications Overview Mario A. Valdez-Ramirez.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 11 Managing and Monitoring a Windows Server 2008 Network.
Loupe /loop/ noun a magnifying glass used by jewelers to reveal flaws in gems. a logging and error management tool used by.NET teams to reveal flaws in.
Module 14: Scalability and High Availability. Overview Key high availability features available in Oracle and SQL Server Key scalability features available.
SERVICE BROKER. SQL Server Service Broker SQL Server Service Broker provides the SQL Server Database Engine native support for messaging and queuing applications.
Apache Jakarta Tomcat Suh, Junho. Road Map Tomcat Overview Tomcat Overview History History What is Tomcat? What is Tomcat? Servlet Container.
Understanding and Managing WebSphere V5
Platform as a Service (PaaS)
Winter Consolidated Server Deployment Guide for Hosted Messaging and Collaboration version 3.5 Philippe Maurent Principal Consultant Microsoft.
NovaBACKUP 10 xSP Technical Training By: Nathan Fouarge
User Group 2015 Version 5 Features & Infrastructure Enhancements.
Deniss Gaplevsky System engineer at inbox.lv. The portal inbox.lv is a leading national e-service in Latvia More than 80% Latvian inhabitants use inbox.lv.
Convergence /20/2017 © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks.
Module 18 Monitoring SQL Server 2008 R2. Module Overview Monitoring Activity Capturing and Managing Performance Data Analyzing Collected Performance Data.
Apache Tomcat Web Server SNU OOPSLA Lab. October 2005.
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
PRESIDIO.COM MARCH  Presidio Overview  What’s New in VDP and VDPA  VDPA Features  Backup and Restore Job Creation  Q&A.
Pepper: An Elastic Web Server Farm for Cloud based on Hadoop Author : S. Krishnan, J.-S. Counio Date : Speaker : Sian-Lin Hong IEEE International.
What’s new in Stack 3.2 Michael Youngstrom. Disclaimer This IS a presentation – So sit back and relax Please ask questions.
Company LOGO An Introduction of JStorm
TechEd /22/2017 5:40 AM © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks.
Our Experience Running YARN at Scale Bobby Evans.
Chokchai Junchey Microsoft Product Specialist Certified Technical Training Center.
Windows Vista Inside Out Chapter 22 - Monitoring System Activities with Event Viewer Last modified am.
Managing the Oracle Application Server with Oracle Enterprise Manager 10g.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
1 Oracle Enterprise Manager Slides from Dominic Gélinas CIS
Scale Fail or, how I learned to stop worrying and love the downtime.
Stairway to the cloud or can we take the highway? Taivo Liik.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Monitoring with InfluxDB & Grafana
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
1 Crosstalk iON Release 3. 2 New Live Chat Features iON 3  Session Notes –Add and modify notes to a customer session –Review from Chat History or Live.
Part III BigData Analysis Tools (Storm) Yuan Xue
Michael Mast Senior Architect Applications Technology Oracle Corporation.
Adaptive Online Scheduling in Storm Paper by Leonardo Aniello, Roberto Baldoni, and Leonardo Querzoni Presentation by Keshav Santhanam.
Automating operational procedures with Daniel Fernández Rodríguez - Akos Hencz -
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
© 2015 MetricStream, Inc. All Rights Reserved. AWS server provisioning © 2015 MetricStream, Inc. All Rights Reserved. By, Srikanth K & Rohit.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
eBay Marketplaces Ming Ma June 27 th, 2013.
SQL Database Management
Heron: a stream data processing engine
Platform as a Service (PaaS)
HERON.
TensorFlow– A system for large-scale machine learning
Platform as a Service (PaaS)
How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Spark and YARN: Better Together
性能测试那些事儿 刘博 ..
StratusLab Final Periodic Review
StratusLab Final Periodic Review
CREAM-CE/HTCondor site
Storage elements discovery
湖南大学-信息科学与工程学院-计算机与科学系
Apache Tomcat Web Server
From Rivulets to Rivers: Elastic Stream Processing in Heron
Mark Quirk Head of Technology Developer & Platform Group
Presentation transcript:

Very Large Scale Stream Processing inside Alibaba Alibaba

Current 1 Alibaba

Current NextFuture 3 团队介绍 Apache Storm PMC The First Storm Team in China Storm 0.5.1/0.5.4/0.6.0/0.6.2/0.7.0/0.7.1 Jstorm 0.7.1/0.9.0/0.9.1/0.9.2/0.9.3/ /0.9.4/ /0.9.5/ /0.9.6/ / / /0.9.7/ / /0.9.8/2.0.4/2.1.0 Our job – Do Everything: Application Development JStorm Platform Evolve JStorm/Storm Technology Support Maintain all Cluster

Current NextFuture 4 In Alibaba Everywhere 1600 machines, 70 K machines will deploy More 1000 Applications, 1500 topology 1.5 PB 2 Trillion Messages

Current NextFuture 5 Tlog/eagleeye 1000 Billion Message, 700 TB log, monitor 200K machines log. Rds Monitor200 TB Log CTU Security 200 Billion Message, monitor all of trade/user actions, 500w DB Monitor 200 Billion Message, 500w BI Realtime Monitor200 Billion Message, more than 2000 KPI. Alimama Anti Cheat100 Billion Message, Living Room11.11 Living Room, Living Room, Spring Festival Living Room OthersAll kinds of monitor System Large Scale Application

Current NextFuture 6 Advanced Features User Side Functionality Stability Enhancement Performance Improvement

Current NextFuture 7 Stable Customer Feedback No one accident since the switch to Jstorm in the Alimama Cluster

Current NextFuture 8 Improve Stability Redesign Metric System Backpressure Resource Isolation Nimbus HA Topology Manager Redesign ZK usage Modify OS setting in RPM Advanced Feature – Improve Stability

Current NextFuture 9 Redesign Metric System Key point: Every Tuple Stage RT, including wait-time between stages, network cost. Avoid noise Pluginable Provide API to fetch all metrics Koala Simple Directly Display all metrics

Current NextFuture 10 New UI

Current NextFuture 11 Backpressure The paper about Heron is too simple to use The design is complicated Works well on our online system, 6 times than the normal

Current NextFuture 12 Resource Isolation Cluster Isolation, control through one unified porter –Koala In one cluster: Cgroup , share + limit CPU User-defined Scheduler, force topology run on special nodes.

Current NextFuture 13 Nimbus HA Nimbus HA, Run more than 20 months Stable

Current NextFuture 14 TopologyMaster Topology’s central control, move some jobs from Nimbus Backpressure coordinator Metrics collector/calculator Hearbeat collector

Current NextFuture 15 Redesign ZK usage No dynamic data stored on ZK, especially metrics and hearbeat ZK can’t support more than 400 Storm nodes. ZK can support 2000 Jstorm node, current in Alibaba, a lot of Jstorm ZK support 800 node.

Current NextFuture 16 RPM Setting Easy install Jstorm Modify Local temporary port range Ulimit Cronjob Environment viriable

Current NextFuture 17 Advanced Features – From User Side User Side Functionality User-Defined Scheduler User-Defined Log User-Defined Metrics Gently Shutdown Dynamic Expand/Reload/Restart Customized Memory Usage Different Netty Policy Classloader

Current NextFuture 18 User-Defined Scheduler Just Using API: Customize every worker’s CPU/Memory usage Customized topology assignment Assign Topology by used Bind several component into one worker ( such as spout/bolt ) Bind upstream/downstream component into one worker Force one component run on special machines Force one component’s task run on different machines Force topology run on special machines Force using old assignment

Current NextFuture 19 Used-Define Log Switch to user log configuration Switch between logback and log4j Redirect System.out to any file Add tags ( clustername/hostname/topologyname/workerid/taskid ) Dynamic change log setting: Enable/Disable debug, debug log sample rate

Current NextFuture 20 User-Defined Metrics Using java metrics Use-defined metrics Web UI display Using Alimonitor All metrics will be sent to Alimonitor Used defined Alarm Display history Koala System – JStorm porter All metrics will be sent to Koala System Display history User Defined Alarm

Current NextFuture 21 Gently shutdown Resolve problem: No data loss during shutdown All worker must be killed ZK is clean

Current NextFuture 22 Dynamic Expand/Reload/Restart Expand Don’t kill current worker, don’t impact current data flow Restart Reset all configuration Modify worker/component parallel Reload Reload binary Reload Configuration

Current NextFuture 23 Customized memory usage Customize Worker memory -- worker.memory.size Modify gc worker.gc.childopts Using user-define scheduler api Queue mode Capacity limited/unlimited

Current NextFuture 24 Advanced Netty Feature Sync /Async Mode Async mode blocking policy Async cache policy

Current NextFuture 25 classloader Resolve class conflict between Application and JStorm

Current NextFuture 26 6 Servers (24core/98G) 18 Spout/18 Bolt/18 Acker

Current NextFuture 27 Performance Improvement 1.Smart Batch Policy 2.Add one thread to deserialize Tuple in every task 3.Remove total send/receive stage 4.Separate send and receive operation in Spout 5. Fix several bug which leading to CPU empty run. 6.Reduce metrics system performance influence. 7.Tuning Acker code 8.Tuning GC

Current NextFuture 28 Archeture zookeeper ui nimbussupervisor worker task

Current Next Future 29 Merge into Storm Replace the clojure core

Current Next Future 30 Redesign our SQL Engine The SQL Engine is customized, no general

CurrentNext Future 31 1.A more powerful SQL Engine 2.A more powerful high level program framework 1.Easier to learn, to debug 2.Provide higher thoroughput 3.A high level scheduler 1.I don’t prefer to offline system – liking Hadoop/Spark/Yarn 2.I prefer to online system – Elastic Online Scheduler/Docker/virtual machine 3.More light What should Storm/Jstorm go Alibaba

Thanks ! Welcome join us : QQ/ 微信 : Alibaba