Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Wang Lam, Lu Liu, STS Prasad, Anand Rajaraman, Zoheb Vacheri, AnHai Muppet Scalable MapUpdate data-stream processing.

Similar presentations


Presentation on theme: "1 Wang Lam, Lu Liu, STS Prasad, Anand Rajaraman, Zoheb Vacheri, AnHai Muppet Scalable MapUpdate data-stream processing."— Presentation transcript:

1 1 Wang Lam, Lu Liu, STS Prasad, Anand Rajaraman, Zoheb Vacheri, AnHai Muppet Scalable MapUpdate data-stream processing

2 Road Map Motivation The MapUpdate framework An example data-stream computation Muppet implementation 2

3 The challenge Growing numbers of large, fast data streams –300+ million Twitter status updates daily –5+ million Foursquare checkins daily –3+ billion Facebook Likes and comments daily Streams never stop Growing numbers of applications for data streams –Computations need to scale with the data –Applications need to stay up-to-date (“What’s going on now?”) Machines fail 3

4 The wish list Deliver low-latency processing –Application stays near real-time with its input stream –Computed data can be queried live Scale up on commodity hardware with computation and stream rate Easy to program –Simple model to enable rapid development of many applications –Ideally resemble widely adopted MapReduce 4

5 Data-stream computation Big data: MapReduce (Hadoop) –Map and Reduce steps –Batch process large input (e.g., from HDFS) –Hadoop distributes computation Fast data: MapUpdate (Muppet) –Map and Update steps –Continuously process streaming input (e.g., from network) –Muppet maintains computation and manages memory/storage 5

6 The MapReduce framework (Hadoop) Event –A pair of data Map –A function that performs (stateless) computation on incoming events Reduce –A function that combines all input for a particular key Application –Map -> Reduce 6

7 The MapUpdate framework (Muppet) Event –A pair of data Map –A function that performs (stateless) computation on incoming events Update –A function that updates a slate using incoming events Application –A directed graph of Mappers and Updaters 7

8 A MapUpdate application 8

9 An example Muppet application Checkin counts on Foursquare Identify Foursquare checkins at various retailers Maintain a live count of retailer checkins Enable a display of the current counts at any time 9

10 An example Muppet application Checkin counts on Foursquare Source: Read Foursquare stream and create key-value-pair events. Map: For each checkin event, identify a retailer and publish if found. Update: For each retailer checkin, increment appropriate count. Updater slates hold live retailer check-in counts. 10

11 An example Muppet application Source: Read Foursquare stream and create key-value-pair events. Input (excerpt): { "checkin": { "created": , "venue": { "id": , "name": "Walmart Neighborhood Market" } 11

12 An example Muppet application Source: Read Foursquare stream and create key-value-pair events. Output: , { "checkin": { "created": , "venue": { "id": , "name": "Walmart Neighborhood Market" } 12

13 An example Muppet application Map: For each checkin event, identify a retailer and publish if found. Input: , { "checkin": { "created": , "venue": { "id": , "name": "Walmart Neighborhood Market" } 13

14 An example Muppet application Map: For each checkin event, identify a retailer and publish if found. Output: Walmart , { "checkin": { "created": , "venue": { "id": , "name": "Walmart Neighborhood Market" } }, "kosmix": { "timeslot": , "interval": 900, "retailer": "Walmart" } 14

15 An example Muppet application Update: For each retailer checkin, increment appropriate count. Input: Walmart , { "checkin": { "created": , "venue": { "id": , "name": "Walmart Neighborhood Market" } }, "kosmix": { "timeslot": , "interval": 900, "retailer": "Walmart" } 15

16 An example Muppet application Update: For each retailer checkin, increment appropriate count. Slate: Walmart , { "retailer": "Walmart", "timeslot": , "interval": 900, "count": 1 } 16

17 The Source (stream receiver) while ($checkin = ) { $checkin =~ s/^[^{]*//; next if ($checkin eq ""); $checkin_count++; my $event; eval { $event = decode_json($checkin); }; if or (!defined($event->{checkin}))) { $invalid_count++; } else { $event = $event->{checkin}; my $checkin_time = $event->{created}; my $venue = $event->{venue}->{id}; $self->publish("FoursquareCheckin", $event, $venue); } 17

18 The Map (Foursquare::CheckinMapper) sub map { my $self = shift; my $event = shift; my $checkin = $event->{checkin}; my $timeslot = int($checkin->{created} / 900) * 900; $event->{kosmix}->{timeslot} = $timeslot; $event->{kosmix}->{interval} = 900; my $venue_name = $checkin->{venue}->{name}; my $retailer = 0; $retailer = 'ToysRUs' if ($venue_name =~ /toys.*r.*us/i); $retailer = 'Walmart' if ($venue_name =~ /wal.*mart/i); $retailer = 'SamsClub' if ($venue_name =~ /sam.*club/i); if ($retailer) { $event->{kosmix}->{retailer} = $retailer; $self->publish("FoursquareRetailerCheckin", $event, $retailer.".".$timeslot); } 18

19 The Update (Foursquare::RetailerUpdater) use Muppet::Updater; package = qw( Muppet::Updater ); use strict; sub update { my $self = shift; my $event = shift; my $slate = shift; my $config = shift; my $key = shift; $slate->{timeslot} = $event->{kosmix}->{timeslot}; $slate->{interval} = $event->{kosmix}->{interval}; $slate->{retailer} = $event->{kosmix}->{retailer}; $slate->{count} += 1; return $slate; } 1; 19

20 The application configuration (flow graph) { "performer" : "foursquare_mapper", "type" : "perl", "class" : "Foursquare::CheckinMapper", "muppet_type" : "Mapper", "subscribes_to" : [ "FoursquareCheckin" ], "publishes_to" : [ "FoursquareRetailerCheckin" ] }, { "performer" : "foursquare_retailer", "type" : "perl", "class" : "Foursquare::RetailerUpdater", "muppet_type" : "Updater", "workers" : 4, "slate_cache_max" : 10000, "slate_cache_write_after" : 1, "subscribes_to" : [ "FoursquareRetailerCheckin" ] } 20

21 Example results 21

22 Implementation 22

23 Implementation Slate management –Slates are cached for performance –Cache is sharded by key for load distribution across machines –Slates are written to distributed key-value store for durability Event flow –Event queues buffer transient load spikes within an application –Host failover remaps load away from an unresponsive machine 23

24 Challenges Host failover Hotspots (uneven load) Parallelization Slate caching Overload stability 24

25 Hotspots Some key distributions are highly nonuniform (e.g., Zipfian) –Keys based on natural-language word usage –Keys based on a set of varying popularity Mappers: Run any event anywhere. Updaters: Popular keys need access to the same slate. –Split associative and commutative computations Split computation parallelizes partial results. Propagate partial results to final result. –Reduce slate serialization/deserialization overhead 25

26 Usage Time –Running since mid-2010 Developers –More than a dozen developers at WalmartLabs have used Muppet to develop their applications Data –Billions of events, tens of millions of slates processed 26

27 Related work MapReduce work toward incremental batch runs of MapReduce, rather than continuous event processing in a revised framework (e.g., MapUpdate) –MapReduce Online (Condie et al.) –Nova (Olston et al.) Event-flow systems systems that focus on the dispatch of events, leaving application state and storage (cf. MapUpdate slates) as a problem for the application developer –S4 (Neumeyer et al.) –Storm (Marz et al.) Streaming-query systems systems that run and optimize queries in a prescribed query language (contrast low-level, general-purpose MapUpdate operators) –Aurora (StreamBase Systems) (Zdonik et al.) –SPADE for System S (InfoSphere Streams) (Gedik et al.) 27

28 Conclusion Big Data : MapReduce :: Fast Data : MapUpdate Create soft-real-time applications on a simple programming model. Distributed stream-processing infrastructure scales computation across cores. 28

29 29 Muppet Scalable data-stream processing Big Fast


Download ppt "1 Wang Lam, Lu Liu, STS Prasad, Anand Rajaraman, Zoheb Vacheri, AnHai Muppet Scalable MapUpdate data-stream processing."

Similar presentations


Ads by Google