Presentation on theme: "Shark:SQL and Rich Analytics at Scale"— Presentation transcript:
1 Shark:SQL and Rich Analytics at Scale Presentaed ByKirti DigheDrushti Gawade
2 What is Shark? Built on the top of the RDD and spark A new data analysis systemBuilt on the top of the RDD and sparkCompatible with Apache Hive data, metastores, and queries(HiveQL, UDFs, etc)Similar speedups of up to 100xSupports low-latency, interactive queries through in-memory computationSupports both SQL and complex analytics such as machine learning
3 Shark Architecture Diagram of Architecture Used to query an existing Hive warehouse returns result much faster without modificationDiagram of Architecture
4 Spark Support partial DAG execution Optimization of joint algorithm Features of sharkSupports general computationProvides in-memory storage abstraction-RDDEngine is optimized for low latency
5 RDD Sparks main abstraction-RDD Collection stored in external storage system or derived data setContains arbitrary data typesBenefits of RDD’sReturn at the speed of DRAMUse of lineageSpeedy recoveryImmutable-foundation for relational processing.
6 Fault tolerance guarantees Shark can tolerate the loss of any set of worker nodes.Recovery is parallelized across the cluster.The deterministic nature of RDDs also enables straggler mitigationRecovery works even in queries that combine SQL and machine learning UDFs
7 Executing sql over RDDs Process of executing sql queries which includesQuery parsingLogical plan generationPhysical plan generation
8 Engine extension Partial DAG execution(PDE) Static query optimization Dynamic query optimizationModification of statisticsExample of statisticsPartition size record countList of “heavy hitters”Approximate histogram
10 Columnar Memory StoreSimply catching records as JVM objects is insuffiecientShark employs column oriented storage , a partition of columns is one MaoReduce “record”Benefits: compact representation, cpu efficient compression, cache locality
11 Machine learning support Shark supports machine learning-first class citizenProgramming model design to express machine learning algorithm:1. Language IntegrationShark allows queries to perform logistic regression over a user database.Ex: Data analysis pipeline that performs logistic regression over database.
12 2. Execution Engine Integration Common abstraction allows machine learning computation and SQl queries to share workers and cached data.Enables end to end fault tolerance
13 Implementation Minimize tail latency CPU cost processing of each How to improve Query Processing SpeedMinimize tail latencyCPU cost processing of eachMemory-based shuffleTemporary object creationBytecode compilation of expression evaluation
14 Experiments Evaluation of the shark using database Pavlo et al. Benchmark: 2.1 TB of data reproducing Pavlo etal.’s comparison of MapReduce vs. analytical DBMSs .TPC-H Dataset: 100 GB and 1 TB datasets generated by the DBGEN program .Real Hive Warehouse: 1.7 TB of sampled Hive warehouse data from an early industrial user of Shark.Machine Learning Dataset: 100 GB synthetic dataset to measurethe performance of machine learning algorithms.Shark perform 100x faster than hive
15 Methodology and cluster setup Amazon EC2 with 100m2.4xlarge nodes8 virtual code68 GB of memory1.6 TB of local storagePavlo etal. Benchmarks1 GB/node ranking table20 GB/node uservisits tableSelection Query (cluster index)SELECT pageURL, pageRankFROM rankings WHERE pageRank > X;
16 Aggregation QueriesSELECT sourceIP, SUM(adRevenue)FROM uservisits GROUP BY sourceIP;SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue)FROM uservisits GROUP BY SUBSTR(sourceIP, 1, 7);
17 Join QuerySELECT INTO Temp sourceIP, AVG(pageRank), SUM(adRevenue) as totalRevenue FROM rankings AS R, uservisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(’ ’) AND Date(’ ’) GROUP BY UV.sourceIP;Join query runtime from Join stategiesPavlo Benchmark chosen by optimizers
18 Data Loading Micro-Benchmarks To query data in HDFS directly,which means its data ingress rate is at least as fast as Hadoop’s.Micro-BenchmarksAggregation performanceSELECT [GROUP_BY_COLUMN], COUNT(*) FROM lineitem GROUP BY [GROUP_BY_COLUMN]
19 Join selection at runtime Fault tolerenceMeasuring sharks performance in presence of node failures –simulate failures and measure query performance, before,during and after failure recovery.
20 Real hive warehouse1. Query 1 computes summary statistics in 12 dimensions for users of a specific customer on a specific day.2. Query 2 counts the number of sessions and distinct customer/client combination grouped by countries with filter cates on eight columns.3. Query 3 counts the number of sessions and distinct users forall but 2 countries.4. Query 4 computes summary statistics in 7 dimensions groupingby a column, and showing the top groups sorted in descendingorder.
21 Machine learning Algorithms Compare performance of shark running the same work flow in Hive and HadoopWorkflow consisted of three steps:1)Selecting the data of interesr from the warehouse using SQL2)Extracting Features3)Applying Iterartive AlgorithmsLogistic RegresionK-Means Clustering