Presentation is loading. Please wait.

Presentation is loading. Please wait.

Beijing Institute of Technology December 2015

Similar presentations


Presentation on theme: "Beijing Institute of Technology December 2015"— Presentation transcript:

1 Beijing Institute of Technology December 2015
Applications on Spark Prof. Harold Liu Beijing Institute of Technology December 2015

2 Who Are Using Spark These Days?
2

3 From the figure above, over 1,000 companies have taken Spark platform into productions, including famous traditional manufacturers like TOYOTA and O2O company like Uber and airbnb. It indicates that the Spark user field has been expanded, not only in the Internet based industry, but also to traditional industries. Lots of big data framework distributors, including the former Hadoop distributors like Hortonworks and Cloudera, are beginning to take Spark into deployment, which will have a bigger impact in its spread. 3

4 Open Source Spark Community
由图看出 Spark 社区从 2010 年到 2014 年开源贡献者的数量不断增加,在这些代码贡献者中出现很多中国公司和开发者的身影。例如目前世界上最大的 Spark集群在腾讯,拥有高达 8000 个节点;最大的单任务处理数据量达到 1PB,这项记录是由阿里巴巴公司和 databricks 公司共同持有。 The figure shows that the number of contributors has increased rapidly from 2010 to Among these contributors, lots of Chinese organizations and developers show their enthusiasm on Spark. Now, the biggest Spark cluster of over 8,000 nodes is in Tencent and the highest amount of processed data per job is 1PB, recorded by Alibaba and Databricks. 4

5 Architecture of Spark Spark 的体系结构不同于Hadoop的MapReduce和HDFS, Spark主要包括Spark Core和在Spark Core基础之上建立的应用框架Spark SQL、 Spark Streaming、 MLlib 和GraphX。他们分别应对交互式查询,流计算,机器学习和图计算,下面讲述的Spark企业应用,将主要就这几个方向的实际应用展开。 5

6 Entertainment: Tecent
Company Background: The biggest social service provider in China. Data Background: By the end of 2015, the active QQ users per month have exceeded 8,000 million. The active Wechat user per month have exceeded 6,000 million. They will bring over 200TB data every day. Business Requirement: Over 90% data need to be processed online. 6

7 Tencent Distributed Data Warehouse
TDW collects all product level data and provides data storage and analysis services. TDW supports PB-level data storage and computing. It has two parts: one is off-line M/R and the other is online computing by Storm. 7

8 Hadoop V.S Spark on M/R MapReduce 200 Map+100 Reduce 120 693872 Spark
Running Mode Compute Resource Running Time(min) Cost(Slot*s) MapReduce 200 Map+100 Reduce 120 693872 Spark 200 Executor 33 396000 400 Executor 21 504000 Spark works much faster than Hadoop. The running time is only a quarter of that of Hadoop. Compute efficiency can be faster when adding more executors. Overall, when facing data mining problems, traditional Hadoop M/R framework has serious performance problem, while the Spark can deal with the problem based on its iterative and in-memory computing. 这张图是Hadoop执行mapreduce算法与Spark执行统一MapReduce算法的运行性能比较图。 可以看出,基于内存计算的Spark的运行时间明显小于MapReduce,时间仅仅是hadoop的四分之一左右,当增加Spark的Executor(执行器),运算能更快。 总之,数据挖掘业务大多具有复杂的处理逻辑,传统的MapReduce类计算框架在应对此类数据处理任务时存在着严重的性能问题。针对这些任务需求,利用Spark的迭代计算和内存计算优势,将会大幅降低运行时间和计算成本。 8

9 E-commerce:Taobao Company Background
The biggest C2C e-commerce company in China and the Spark pioneer user (since 2012) Data Background Up to 2014, Taobao has over 5,000 million registered members and 1,200 million active members. Taobao has over 90 billion turnovers on November 11, 2014. Its various businesses bring TB-lever data every day. Business Requirement In the past few years, Taobao has been using Yun Ti based on Hadoop. But Hadoop will encounter lots of problems in iterative computing. So Spark comes to its view. 9 9

10 Spark in Taobao The figure shows the history of using Spark in Taobao.
10 nodes cluster Yarn version:0.23.7 200 nodes Yarn cluster The figure shows the history of using Spark in Taobao. Taobao has been using Spark when Spark is very young (2012). 10

11 Spark Development Process in Taobao
Before putting the job into production servers, the job will be tested on test servers. And the code will be merged to local repository or push to the open source community. 11

12 Recommender System in Taobao
The recommender system combines Spark, Spark MLlib and Spark Streaming frameworks. It can perform both offline and online analysis that covers most parts of business requests in Taobao. 12

13 Test of K-Means Algorithm
From the memory aspect, increasing worker’s memory will cut the running time. And increase worker numbers will have better performance. 13

14 Telecom: Telefonica Company Background
Telefonica is a Spanish telecommunication company who provides comprehensive services including mobile phone, internet, data and wired television services. Data Background Telefonica is the biggest multi-national enterprise in Spain who provides customer services for over 40 countries. Its various businesses bring huge data. Business Requirement As the volume of data is increasing rapidly, network security problem comes to its sight, such as DDoS attack, SQL injection attack, account theft etc. Using big data analysis technology to prevent Cyber crime has become urgent to the company. 14 14

15 Why Spark? Spark provides full stack applications (i.e., SQL, Streaming, MLlib, GraphX) Easy to use spark to analyze historical data and streaming data. Support various applications and data sources in order to deal with complex application scenarios Leverage the SQL language to use the power of Spark The number of components in Spark is much fewer than that of Hadoop 15

16 Components of Spark and Hadoop
From the figure above, the number of components in Spark is about half of that in Hadoop. Then, using Spark can potentially have much less errors because of less components. 16

17 Spark Production Architecture in Telefonica
Data collection: Kafka Data pre-processing: Storm Batch processing: Cassandra+Spark It use distributed message queue system called “Kafka” to collect data from various sources. Then, data will be consumed by Storm for pre-processing. Finally, data will be processed by Spark or saved in Cassandra. 17

18 Retail: Euclid Company Background Data Background Business Requirement
Euclid Analysis is a geo-data analysis company who provides solutions to customers based on offline positional information. Data Background Euclid mainly relies on WiFi devices to collect data from the physical world. Business Requirement Euclid’s main job is to support location based analysis services for customers. Through collecting customer behavior data, it tries to know customer’s behavior and shopping feature, and suggestion future behaviors. 18 18

19 Retail Customer Features
Through the data collected from WiFi devices, customers can be divided into three parts: frequent customers, pass-by customers and quick-leave customers. Some of them like to buy products, some spend a lot of time in store and some like to travel around in a zone.

20 Analysis Procedure with Spark
First, mobile data are collected by WiFi devices through the pinged signals, which include device MAC address, magnitude of signal and other information. Then, these data will be sent to cloud and processed on Spark cluster. Finally, customers will know the analysis result on web.

21 Other Area: PubMatic Company Background Data Background
PubMatic is an advertisement company It developed the first real-time advertisement analysis system in the world marketing field. Data Background PubMatic has 6 geo-data data centers with 6 PB data to manage. Every day it will post 12 billion ads and deal with 1,000 billion bids. Now 22TB data are produced by its system. Business Requirement Because of its owned complex and various ad data, PubMatic needs to process the data in real-time. 21

22 System Architecture in PubMatic
As we can see from the figure above, various streaming data (flows) are fed into memory which will be process by Spark. Finally, the data will be saved in HDFS and Amazon S3.

23 Spark v.s. Hive on Query Performance
When the data volume is 192GB, it will cost 550 seconds on Spark while Hive needs 850s to deal with the same problem. As the data volume is increasing, the running time of Spark is 40% less then Hive on average.

24 Effect of Using Spark in PubMatic
Spark supports both offline and online data processing. It has active community support and be compatible with Hadoop ecosystem. Through the use of Spark Streaming, Spark SQL and Spark Mllib technologies together, PubMatic can provide real-time ads service and business analysis report to customers in a faster speed than ever before.


Download ppt "Beijing Institute of Technology December 2015"

Similar presentations


Ads by Google