RDDs and Spark.

RDDs and Spark

The paper itself Great model for a systems paper
Talk about something that is useful + used by many many real users Argue not just that your techniques are good but also that your limitations are not fundamentally bad Extensive experiments to back it up. Awesome performance numbers always help. Won the best paper award at NSDI’12

Memory vs. Disk (borrowed)
L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 100 ns Main memory reference 100 ns Compress 1K bytes with Zippy 10,000 ns Send 2K bytes over 1 Gbps network 20,000 ns Read 1 MB sequentially from memory 250,000 ns Round trip within same datacenter 500,000 ns Disk seek 10,000,000 ns Read 1 MB sequentially from network 10,000,000 ns Read 1 MB sequentially from disk 30,000,000 ns Send packet CA->Netherlands->CA 150,000,000 ns

Spark Primitives vs. MapReduce

Disadvantages of MapReduce
1. Extremely rigid data flow M R Other flows constantly hacked in M M R M Join, Union Split Chains 2. Common operations must be coded by hand Join, filter, projection, aggregates, sorting, distinct 3. Semantics hidden inside map-reduce functions Difficult to maintain, extend, and optimize

Not the first time! Similar proposals have been made to natively support other relational operators on top of MapReduce. PIG: Imperative style, like Spark. From Yahoo!

Another Example: PIG visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

Another Example: DryadLINQ
Get SM G S O Take string uri PartitionedTable<LineRecord> input = PartitionedTable.Get<LineRecord>(uri); string separator = ","; var words = input.SelectMany(x => SplitLineRecord(separator)); var groups = words.GroupBy(x => x); var counts = groups.Select(x => new Pair(x.Key, x.Count())); var ordered = counts.OrderByDescending(x => x[2]); var top = ordered.Take(k); top.ToDryadPartitionedTable("matching.pt"); Execution Plan Graph

Not the first time! Similar proposals have been made to natively support other relational operators on top of MapReduce. Unlike Spark, most of them cannot have datasets persist across queries. PIG: Imperative style, like Spark. From Yahoo! DryadLINQ: Imperative programming interface. From Microsoft. HIVE: SQL like. From Facebook HadoopDB: SQL like (hybrid of MR + databases). From Yale

Spark: Control Spark leaves control of data, algorithms, persistence to the user. Is this a good idea?

Spark: Control Spark leaves control of data, algorithms, persistence to the user. Is this a good idea? Good idea: User may know which datasets need to be used and how Bad idea: System may be able to optimize and schedule computation across nodes Standard argument of declarative vs. imperative

What are other ways Spark can be optimized?

What are other ways Spark can be optimized?
More Declarative than Imperative Relational Query Optimization Reordering predicates Caching, fault-tolerance only when needed Careful scheduling Careful partitioning, co-location, and persistence Indexes

RDDs and Spark.

Similar presentations

Presentation on theme: "RDDs and Spark."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

RDDs and Spark.

Similar presentations

Presentation on theme: "RDDs and Spark."— Presentation transcript:

Similar presentations

About project

Feedback