UC Berkeley 1 Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC Berkeley Joint work with: Michael Armbrust, Peter.

UC Berkeley 1 Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC Berkeley Joint work with: Michael Armbrust, Peter Bodik, Kristal Curtis, Armando Fox, Randy Katz, Mike Jordan, Nick Lanham, David Patterson, Scott Shenker, Ion Stoica, Beth Trushkowsky, Stephen Tu and Matei Zaharia Image: John Curley http://www.flickr.com/photos/jay_que/1834540/

Save the Date(s): CIDR 2011 Conference 2 Abstracts Due: Sept 24, 2010 Papers Due: October 1, 2010 Focus: innovative and visionary approaches to data systems architecture and use. Regular CIDR track plus CCC-sponsored “outrageous ideas” track. Website coming soon! 5 th Biennial Conference on Innovative Data Systems Research CIDR 2011 Jan 9-12 Asilomar, CA

Continuous Improvement of Client Devices

4 Computing as a Commodity

5 Ubiquitous Connectivity

AMP: Algorithms, Machines, People Adaptive/Active Machine Learning and Analytics Cloud ComputingCrowdSourcing 6 Massive and Diverse Data Massive and Diverse Data

The Scalability Dilemma 7 State-of-the Art Machine Learning techniques do not scale to large data sets. Data Analytics frameworks can’t handle lots of incomplete, heterogeneous, dirty data. Processing architectures struggle with increasing diversity of programming models and job types. Adding people to a late project makes it later. Exactly Opposite of what we Expect and Need

RAD Lab 5-year Mission Enable 1 person to develop, deploy, operate a next-generation Internet application at scale Initial Technical Bet: Machine Learning to make large-scale systems self-managing Multi-area faculty, postdocs, & students Systems, Networks, Databases, Security, Statistical Machine Learning all in a single, open, collaborative space Corporate Sponsorship and intensive industry interaction –Bi-annual 2.5 day offsite research retreats with sponsors 8

PIQL + SCADS 9 SCADS: Distributed Key Value Store PIQL: Query Interface & Executor Flexible Consistency Management “Active PIQL” (don’t ask)

UC Berkeley SCADS: Scale Independent Storage 10

Scale Independence As a site’s user base grows and workload volatility increases: –No changes to application required –Cost per user remains constant –Request latency SLA is unchanged Key techniques –Model-Driven Scale Up and Scale Down –Performance Insightful Query Language –Declarative Performance/Consistency Tradeoffs 11

Over-provisioning a stateless system Wikipedia example 12 overprovision by 25% to handle spike Michael Jackson dies

Over-provisioning a stateful system Wikipedia example 13 overprovision by 300% to handle spike (assuming data stored on ten servers)

Data storage configuration Shared-nothing storage cluster –(key,value) pairs in a namespace, e.g. (user,email) –Each node stores set of data ranges, –Data ranges can be split until some minimum size promised by PIQL, to ensure range queries don’t touch more than one node 14 A-C F A-C F D-E F-G G G

Workload-based policy stages 15 Storage nodes Stage 1: Replicate Workload threshold Bins

16 Storage nodes Stage 2: Data Movement Workload threshold Bins destination Workload-based policy stages

17 Storage nodes Stage 3: Server Allocation Workload threshold Bins Workload-based policy stages

Workload-based policy Policy input: –Workload per histogram bin –Cluster configuration Policy output: –Short actions (per bin) 18 SCADS namespace SCADS namespace Policy actions Action Executor Action Executor Performance Model Workload Histogram actions sampled workload as histogram smoothe d workload config Considerations: –Performance model –Overprovision buffer Action Executor –Limit actions to X kb/s

Example Experiment Workload Ebates.com + wikipedia’s MJ spike (see Bodik et al. SOCC 2010 for workload generation) One million (key,value) pairs, each ~256 bytes Model: max sustainable workload per server Cost: machine cost: 1 unit/10 minutes SLA: 99th percentile of get/put latency Deployment using m1.small instances on EC2, 1GB of RAM server boot up time: 48 seconds Delay server removal until 2 minutes left 19

Goal: selectively absorb hotspot 20 thousand req / sec

Actions during the spike 21 10:0010:14 Add replica Move data, partition Move data, coalesce data movement and actions during the spike Kb/s

Configuration at end of Spike 22 Per server workload and # keys after added replicas

Cost-comparison to fixed and optimal Fixed allocation policy: 648 server units Optimal policy: 310 server units 23 Overprovision factor get/put SLA (ms) # server units % savings (vs fixed alloc) 0.5180/25035848 0.6140/22538940 0.7120/20042235

PIQL [Armbrust et al. SIGMOD 2010 (demo) and SOCC 2010 (design paper)] “Performance Insightful” language subset Compiler reasons about operation bounds –Unbounded queries are disallowed –Queries above specified threshold generate a warning –Predeclare query templates: Optimizer decides what indexes are needed (i.e., materialized views) Provides: Bounded number of operations + Strong SLAs = Predictable performance? 24 RDBMS NoSQL

PIQL DDL 25 ENTITY User { string username, string password, PRIMARY KEY(username)} ENTITY Subscription { boolean approved, string owner, string target, FOREIGN KEY owner REF User, FOREIGN KEY target REF User MAX 5000, PRIMARY KEY(owner, target)} ENTITY Thought { int timestamp, string owner, string text, FOREIGN KEY owner REFERENCES User PRIMARY KEY(owner, timestamp)} F.K.s are Required for Joins Cardinality Limits required for un-paginated Joins

More Queries 26 “Return the most recent thoughts from all of my “approved” subscriptions.” Operations are bounded via schema and limit max

PIQL:Help Fix “Bad” Queries Interactive Query Visualizer –Shows record counts and # ops –Highlights unbounded parts of query –SIGMOD’10 Demo: piql.knowsql.org RDBMS NoSQL

PIQL + SCADS Goals are “Scale Independence” and “Performance Insightfulness” SCADS provides scalable foundation with SLA adherence PIQL uses language restrictions, schema limits, and precomputed views to bound # of SCADS operations per query. These work together to bridge the gap between “SQL” and “NoSQL” worlds. 28

UC Berkeley Spark: Support for Iterative Data-Intensive Computing M. Zaharia et al. HotClouds Workshop 2010 29

Analytics: Logistic Regression Goal: find best line separating 2 datasets + – + + + + + + + + – – – – – – – – + target – random initial line

Serial Version val data = readData(...) var w = Vector.random(D) for (i <- 1 to ITERATIONS) { var gradient = Vector.zeros(D) for (p <- data) { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y gradient += scale * p.x } w -= gradient } println("Final w: " + w)

Spark Version val data = spark.hdfsTextFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { var gradient = spark.accumulator(Vector.zeros(D)) for (p <- data) { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y gradient += scale * p.x } w -= gradient.value } println("Final w: " + w)

Spark Version val data = spark.hdfsTextFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { var gradient = spark.accumulator(Vector.zeros(D)) data.foreach(p => { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y gradient += scale * p.x }) w -= gradient.value } println("Final w: " + w)

Spark Version with FP val data = spark.hdfsTextFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { w -= data.map(p => { val scale = (1/(1+exp(-p.y*(w dot p.x))) - 1) * p.y scale * p.x }).reduce(0.0, _+_) } println("Final w: " + w)

Iterative Processing Dataflow Hadoop / DryadSpark... w f(x,w) w x x x w

Performance 40s / iteration first iteration 60s further iterations 2s

UC Berkeley What about the People? 38

39 Participatory Culture – “Indirect” John Murrell: GM SV 9/17/09 …every time we use a Google app or service, we are working on behalf of the search sovereign, creating more content for it to index and monetize or teaching it something potentially useful about our desires, intentions and behavior.

40 Participatory Culture - Direct

Crowdsourcing Example 41 From: Yan, Kumar, Ganesan, CrowdSearch: Exploiting Crowds for Accurate Real-time Image Search on Mobile Phones, Mobisys 2010.

Mechanical Turk vs. Cluster Computing What challenges are similar? What challenges are new? Allocation, Cost, Reliability, Quality, Bias, Making jobs appealing, ….

AMP: Algorithms, Machines, People Adaptive/Active Machine Learning and Analytics Cloud ComputingCrowdSourcing 43 Massive and Diverse Data Massive and Diverse Data

Clouds and Crowds Interactive CloudAnalytic CloudPeople Cloud Data Acquisition Transactional systems Data entry … + Sensors (physical & software) … + Web 2.0 ComputationGet and PutMap Reduce Parallel DBMS Stream Processing … + Collaborative Structures (e.g., Mechanical Turk, Intelligence Markets) Data ModelRecordsNumbers, Media… + Text, Media, Natural Language Response Time SecondsHours/Days… +Continuous 44 The Future Cloud will be a Hybrid of These.

AMPLab Technical Plan Machine Learning & Analytics (Jordan, Fox, Franklin) –Error Bars on all Answers –Active learning, continuous/adaptive improvement Data Management (Franklin, Joseph) –Pay-as-you-go integration and structure –Privacy Infrastructure (Stoica, Shenker, Patterson, Katz) –Nexus cloud OS and analytics languages Hybrid Crowd/Cloud Systems (Bayen, Waddell) –Incentive structures, systems aspects 45

Guiding Use Cases Crowdsourced Sensing, Work, Policy, Journalism Urban Micro- Simulation 46

Alogorithms, Machines & People A holistic view of the entire stack. Highly interdisciplinary faculty & students Developing a five-year plan; will dovetail with RADLab completion For more information: franklin@cs.berkeley.edu franklin@cs.berkeley.edu 47 Enable many people to collaborate to collect, generate, clean, make sense of and utilize lots of data. Data Visualization, Collaboration, HCI, Policies Text analytics Machine Learning and Stats Database, OLAP, MapReduce Security and Privacy MPP,Data Centers, Networks Multi-Core Parallelism

SCADS 48 NS 1 NS 2 NS 3 WebAp p PIQL WebAp p PIQL WebApp PIQ L Dir Xtrace + Chukwa (monitoring) Dir Hadoop + HDFS SPARK, SEJITS NEXUS M/R scheduling log mining MPI Batch/Analytics Query Feedbac k RAD Lab Prototype: System Architecture SLA, policies Dir Interactive

UC Berkeley 1 Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC Berkeley Joint work with: Michael Armbrust, Peter.

Similar presentations

Presentation on theme: "UC Berkeley 1 Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC Berkeley Joint work with: Michael Armbrust, Peter."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UC Berkeley 1 Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC Berkeley Joint work with: Michael Armbrust, Peter.

Similar presentations

Presentation on theme: "UC Berkeley 1 Solving the Scalability Dilemma with Clouds, Crowds, and Algorithms Michael Franklin UC Berkeley Joint work with: Michael Armbrust, Peter."— Presentation transcript:

Similar presentations

About project

Feedback