Www.unicomlearning.com Lightning Fast Big Data Analytics using Apache Spark Lightning Fast Big Data Analytics using Apache Spark www. bigdatainnovation.org.

www.unicomlearning.com Lightning Fast Big Data Analytics using Apache Spark Lightning Fast Big Data Analytics using Apache Spark www. bigdatainnovation.org Manish Gupta Solutions Architect – Product Engineering and Development 30 th Jan 2014 - Delhi

www. bigdatainnovation.org Agenda Of The Talk: www.unicomlearning.com Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap

www. bigdatainnovation.org www.unicomlearning.com What is Hadoop? It’s an open-sourced software for distributed storage of large datasets on commodity class hardware in a highly fault-tolerant, scalable and a flexible way. MR It also provide a programming model/framework for processing these large datasets in a massively-parallel, fault-tolerant and data-location aware fashion. HDFS Map Reduce InputOutput

 Slow due to replication, serialization, and disk IO  Inefficient for: Iterative algorithms (Machine Learning, Graphs & Network Analysis) Interactive Data Mining (R, Excel, Adhoc Reporting, Searching) www. bigdatainnovation.org www.unicomlearning.com Limitations of Map Reduce Input iter. 1 iter. 2... HDFS read HDFS write HDFS read HDFS write Map Reduce InputOutput

www. bigdatainnovation.org www.unicomlearning.com Approach: Leverage Memory?  Memory bus >> disk & SSDs  Many datasets fit into memory  1TB = 1 billion records @ 1 KB  Memory Capacity also follows the Moore’s Law A single 8GB stick of RAM is about $80 right now. In 2021, you’d be able to buy a single stick of RAM that contains 64GB for the same price.

www. bigdatainnovation.org www.unicomlearning.com Hadoop – A Quick Introduction An Introduction To Spark & Shark Spark – Architecture & Programming Model Example & Demo Spark Current Users & Roadmap Agenda Of The Talk:

www. bigdatainnovation.org www.unicomlearning.com Spark  Open Sourced originally developed in AMPLab at UC Berkley.  Provides In-Memory analytics which is faster than Hadoop/Hive (upto 100x).  Designed for running Iterative algorithms & Interactive analytics  Highly compatible with Hadoop’s Storage APIs.  - Can run on your existing Hadoop Cluster Setup.  Developers can write driver programs using multiple programming languages. “A big data analytics cluster-computing framework written in Scala.” …

www. bigdatainnovation.org www.unicomlearning.com Spark HDFS Datanode.... Spark Worker.... Cache Block Cluster Manager Spark Driver (Master)

www. bigdatainnovation.org www.unicomlearning.com Spark iter. 1 iter. 2... Input HDFS read HDFS write HDFS read HDFS write

www. bigdatainnovation.org www.unicomlearning.com Spark iter. 1 iter. 2... Input Not tied to 2 stage Map Reduce paradigm 1.Extract a working set 2.Cache it 3.Query it repeatedly Logistic regression in Hadoop and Spark HDFS read

www. bigdatainnovation.org www.unicomlearning.com Spark A simple analytical operation: pagecount = spark.textFile( "/wiki/pagecounts“ ) pagecount.count() englishPages = pagecount.filter( _.split(" ")(1) == "en“ ) englishPages.cache() englishPages.count() englishTuples = englishPages.map( line => line.split(" ") ) englishKeyValues = englishTuples.map( line => (line(0), line(3).toInt) ) englishKeyValues.reduceByKey( _+_, 1).collect 1 2 Select count(*) from pagecounts Select Col1, sum(Col4) from pagecounts Where Col2 = “en” Group by Col1

www. bigdatainnovation.org www.unicomlearning.com Shark  HIVE on SPARK = SHARK  A large scale data warehouse system just like Apache Hive.  Highly compatible with Hive (HQL, metastore, serialization formats, and UDFs)  Built on top of Spark (thus a faster execution engine).  Provision of creating In-memory materialized tables (Cached Tables).  And cached tables utilizes columnar storage instead of raw storage. 1 Column Storage 23 ABCXYZPPP 4.13.56.4 Row Storage 1ABC4.1 2XYZ3.5 3PPP6.4

www. bigdatainnovation.org www.unicomlearning.com Shark Meta store HDFS Client Driver SQL Parser Query Optimizer Physical Plan Execution CLIJDBC Map Reduce HIVE

www. bigdatainnovation.org www.unicomlearning.com Shark SHARK Meta store HDFS Client Driver SQL Parser Physical Plan Execution CLIJDBC Spark Cache Mgr. Query Optimizer

Datanode HDFS Datanode … www. bigdatainnovation.org www.unicomlearning.com Spark Programming Model User (Developer) Writes sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map Driver Program SparkContext Cluster Manager Worker Node Executer Cache Task Worker Node Executer Cache Task

www. bigdatainnovation.org www.unicomlearning.com Spark Programming Model User (Developer) Writes sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.filter(…) rDD.Cache rDD.Count rDD.map Driver Program RDD (Resilient Distributed Dataset) RDD (Resilient Distributed Dataset) Immutable Data structure In-memory (explicitly) Fault Tolerant Parallel Data Structure Controlled partitioning to optimize data placement Can be manipulated using rich set of operators. Immutable Data structure In-memory (explicitly) Fault Tolerant Parallel Data Structure Controlled partitioning to optimize data placement Can be manipulated using rich set of operators.

www. bigdatainnovation.org www.unicomlearning.com RDD  Programming Interface: Programmer can perform 3 types of operations: Transformations Create a new dataset from and existing one. Lazy in nature. They are executed only when some action is performed. Example : Map(func) Filter(func) Distinct() Transformations Create a new dataset from and existing one. Lazy in nature. They are executed only when some action is performed. Example : Map(func) Filter(func) Distinct() Actions Returns to the driver program a value or exports data to a storage system after performing a computation. Example: Count() Reduce(funct) Collect Take() Actions Returns to the driver program a value or exports data to a storage system after performing a computation. Example: Count() Reduce(funct) Collect Take() Persistence For caching datasets in-memory for future operations. Option to store on disk or RAM or mixed (Storage Level). Example: Persist() Cache() Persistence For caching datasets in-memory for future operations. Option to store on disk or RAM or mixed (Storage Level). Example: Persist() Cache()

www. bigdatainnovation.org www.unicomlearning.com Spark How Spark Works:  RDD: Parallel collection with partitions  User application create RDDs, transform them, and run actions.  This results in a DAG (Directed Acyclic Graph) of operators.  DAG is compiled into stages  Each stage is executed as a series of Task (one Task for each Partition).

www. bigdatainnovation.org www.unicomlearning.com Spark Example: sc.textFile(“/wiki/pagecounts”) RDD[String] textFile

www. bigdatainnovation.org www.unicomlearning.com Spark Example: sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)) RDD[String] textFile map RDD[List[String]]

www. bigdatainnovation.org www.unicomlearning.com Spark Example: sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1]))) RDD[String] textFile map RDD[List[String]] RDD[(String, Int)] map

www. bigdatainnovation.org www.unicomlearning.com Spark Example: sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1]))).reduceByKey(_+_, 3) RDD[String] textFile map RDD[List[String]] RDD[(String, Int)] map RDD[(String, Int)] reduceByKey

www. bigdatainnovation.org www.unicomlearning.com Spark Example: sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1]))).reduceByKey(_+_, 3).collect() RDD[String] textFile map RDD[List[String]] RDD[(String, Int)] map RDD[(String, Int)] reduceByKey Array[(String, Int)] collect

www. bigdatainnovation.org www.unicomlearning.com Spark textFile map reduceByKey collect Execution Plan: Above logical plan gets compiled by the DAG scheduler into a Plan comprising of Stages as…

www. bigdatainnovation.org www.unicomlearning.com Spark textFile map reduceByKey collect Execution Plan: Stage 1 Stage 2 Stages are sequences of RDDs, that don’t have a Shuffle in between

www. bigdatainnovation.org www.unicomlearning.com Spark textFile map reduceByKey collect Stage 1 Stage 2 Stage 1 Stage 2 1.Read HDFS split 2.Apply both the maps 3.Start Partial reduce 4.Write shuffle data 1.Read shuffle data 2.Final reduce 3.Send result to driver program

www. bigdatainnovation.org www.unicomlearning.com Spark Stage Execution: Stage 1 Task 1 Task 2  Create a task for each Partition in the new RDD  Serialize the Task  Schedule and ship Tasks to Slaves And all this happens internally (you need to do anything)

www. bigdatainnovation.org www.unicomlearning.com Spark Task Execution: Task is the fundamental unit of execution in Spark Fetch Input Execute Task Write Output time HDFS / RDD HDFS / RDD / intermediate shuffle output

www. bigdatainnovation.org www.unicomlearning.com Spark Spark Executor (Slaves) Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Fetch Input Execute Task Write Output Core 1 Core 2 Core 3

www. bigdatainnovation.org www.unicomlearning.com Spark Summary of Components  Task: The fundamental unit of execution in Spark  Stage: Set of Tasks that run parallel  DAG: Logical Graph of RDD operations  RDD: Parallel dataset with partitions

www. bigdatainnovation.org www.unicomlearning.com Example & Demo Cluster Details:  6 m1.Xlarge EC2 nodes.  1 machine is Master Node  5 worker node machines  64 bit, 4 vCPU  15 GB Ram

www. bigdatainnovation.org www.unicomlearning.com Example & Demo  Wiki Page View Stats  20 GB of webpage view counts  3 days worth of data Dataset: Base RDD to All Wiki Pages val allPages = sc.textFile("/wiki/pagecounts") allPages.take(10).foreach(println) allPages.count() Transformed RDD for all English pages (cached) val englishPages = allPages.filter(_.split(" ")(1) == "en") englishPages.cache() englishPages.count()

www. bigdatainnovation.org www.unicomlearning.com Example & Demo  Wiki Page View Stats  20 GB of webpage view counts  3 days worth of data Dataset: Select date, sum(pageviews) from pagecounts group by date englishPages.map(line => line.split(" ")).map(line => (line(0).substring(0, 8), line(3).toInt)).reduceByKey(_+_, 1).collect.foreach(println) Select date, count(distinct pageURL) from pagecounts group by date englishPages.map(line => line.split(" ")).map(line => (line(0).substring(0, 8), line(2))).distinct().countByKey().foreach(println) Select distinct(datetime) from pagecounts order by datetime englishPages.map(line => line.split(" ")).map(line => (line(0), 1)).distinct().sortByKey().collect().foreach(println)

www. bigdatainnovation.org www.unicomlearning.com Example & Demo  Network Datasets  Directed and Bi-directed Graphs  One small Facebook Social Network  127 nodes (Friends)  1668 Edges (Friendships)  Bi-directed graph  Google’s internal site network  15713 Nodes (web pages)  170845 Edges (hyperlinks)  Directed Graph Dataset:

www. bigdatainnovation.org www.unicomlearning.com Example & Demo Page Rank Calculation: Estimate the node importance Each directed link from A -> B is a vote to B from A. More links to a page, more important a page is. When a page with higher PR, points to something, then it’s vote weighs more. 1.Start each page at a rank of 1 2. On each iteration, have page p contribute (rank of p) / (no. of neighbors of p) to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs

www. bigdatainnovation.org www.unicomlearning.com Example & Demo Scala Code: var iters = 100 val lines = sc.textFile("/dataset/google/edges.csv",1) val links = lines.map{ s => val parts = s.split( "\t“ ) (parts(0), parts(1)) }.distinct().groupByKey().cache() var ranks = links.mapValues(v => 1.0) for (i <- 1 to iters) { val contribs = links.join(ranks).values.flatMap{ case (urls, rank) => val size = urls.size urls.map(url => (url, rank / size)) } ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _) } val output = ranks.map(l=>(l._2,l._1)).sortByKey(false).map(l=>(l._2,l._1)) output.take(20).foreach(tup => println( tup._2 + " : " + tup._1 ))

2 seconds

38 seconds Page RankPage URL 761.1985177 google 455.7028756 google/about.html 259.6052388 google/privacy.html 192.7257649 google/jobs/ 144.0349154 google/support 134.1566312 google/terms_of_service.html 130.3546324 google/intl/en/about.html 123.4014613 google/imghp 120.0661165 google/accounts/Login 118.6884515 google/intl/en/options/ 112.2309539 google/preferences 108.8375347 google/sitemap.html 106.9724799 google/press/ 105.822426 google/language_tools 105.1554798 google/support/toolbar/ 99.97741309 google/maps 97.90651416 google/advanced_search 90.7910291 google/intl/en/services/ 90.70522689 google/intl/en/ads/ 87.4353413 google/adsense/

www. bigdatainnovation.org www.unicomlearning.com Spark Current Users & Roadmap Source: Apache - Powered By Spark

www. bigdatainnovation.org www.unicomlearning.com Roadmap

www. bigdatainnovation.org www.unicomlearning.com Conclusion Because of In-memory processing, computations are very fast. Developers can write iterative algorithms without writing out a result set after each pass through the data. Suitable for scenarios when sufficient memory available in your cluster. It provides an integrated framework for advanced analytics like Graph processing, Stream Processing, Machine Learning etc. This simplifies integration. It’s community is expanding and development is happening very aggressively. It’s comparatively newer than Hadoop and only few users.

www.unicomlearning.com Topic: Organized by UNICOM Trainings & Seminars Pvt. Ltd. contact@unicomlearning.com Speaker name: MANISH GUPTA Email ID: manish.gupta@globallogic.com Thank You www.bigdatainnovation.org

Backup Slides

www. bigdatainnovation.org www.unicomlearning.com Spark Internal Components Hadoop I/OMesos backendStandalone backend Interpreter Spark core Operators Block manager Scheduler Networking Accumulators Broadcast

www. bigdatainnovation.org www.unicomlearning.com In-Memory But what if I run out of memory?

www. bigdatainnovation.org www.unicomlearning.com Benchmarks  AMPLab performed a quantitative and qualitative comparisons of 4 system  HIVE, Impala, Redshift and Shark  Done on Common Crawl Corpus Dataset  81 TB size  Consists of 3 tables:  Page Rankings  User Visits  Documents  Data was partitioned in such a way that each node had:  25GB of User Visits  1GB of Ranking  30GB of Web Crawl (document) Source: https://amplab.cs.berkeley.edu/benchmark/#

www. bigdatainnovation.org www.unicomlearning.com Benchmarks

www. bigdatainnovation.org www.unicomlearning.com Benchmarks Hardware Configuration

www. bigdatainnovation.org www.unicomlearning.com Benchmarks Redshift outperforms for on-disk data. Shark and Impala outperform Hive by 3-4X. For larger result-sets, Shark outperforms Impala.

www. bigdatainnovation.org www.unicomlearning.com Benchmarks Redshift columnar storage outperforms every time. Shark in-memory is 2 nd best in all cases.

www. bigdatainnovation.org www.unicomlearning.com Benchmarks Redshift bigger cluster has an advantage. Shark and Impala competing.

www. bigdatainnovation.org www.unicomlearning.com Benchmarks Impala & Redshift don’t have UDF. Shark outperforms hive.

www. bigdatainnovation.org www.unicomlearning.com Roadmap

www. bigdatainnovation.org www.unicomlearning.com Spark In Last 6 months of Year 2013

Www.unicomlearning.com Lightning Fast Big Data Analytics using Apache Spark Lightning Fast Big Data Analytics using Apache Spark www. bigdatainnovation.org.

Similar presentations

Presentation on theme: "Www.unicomlearning.com Lightning Fast Big Data Analytics using Apache Spark Lightning Fast Big Data Analytics using Apache Spark www. bigdatainnovation.org."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Www.unicomlearning.com Lightning Fast Big Data Analytics using Apache Spark Lightning Fast Big Data Analytics using Apache Spark www. bigdatainnovation.org.

Similar presentations

Presentation on theme: "Www.unicomlearning.com Lightning Fast Big Data Analytics using Apache Spark Lightning Fast Big Data Analytics using Apache Spark www. bigdatainnovation.org."— Presentation transcript:

Similar presentations

About project

Feedback