Presentation is loading. Please wait.

Presentation is loading. Please wait.

Muppet Scalable MapUpdate data-stream processing

Similar presentations


Presentation on theme: "Muppet Scalable MapUpdate data-stream processing"— Presentation transcript:

1 Muppet Scalable MapUpdate data-stream processing Wang Lam, Lu Liu, STS Prasad, Anand Rajaraman, Zoheb Vacheri, AnHai Doan @WalmartLabs

2 Road Map Motivation The MapUpdate framework
An example data-stream computation Muppet implementation

3 The challenge Growing numbers of large, fast data streams
300+ million Twitter status updates daily 5+ million Foursquare checkins daily 3+ billion Facebook Likes and comments daily Streams never stop Growing numbers of applications for data streams Computations need to scale with the data Applications need to stay up-to-date (“What’s going on now?”) Machines fail

4 The wish list Deliver low-latency processing
Application stays near real-time with its input stream Computed data can be queried live Scale up on commodity hardware with computation and stream rate Easy to program Simple model to enable rapid development of many applications Ideally resemble widely adopted MapReduce

5 Data-stream computation
Big data: MapReduce (Hadoop) Map and Reduce steps Batch process large input (e.g., from HDFS) Hadoop distributes computation Fast data: MapUpdate (Muppet) Map and Update steps Continuously process streaming input (e.g., from network) Muppet maintains computation and manages memory/storage

6 The MapReduce framework (Hadoop)
Event A <key, value> pair of data Map A function that performs (stateless) computation on incoming events Reduce A function that combines all input for a particular key Application Map -> Reduce

7 The MapUpdate framework (Muppet)
Event A <key, value> pair of data Map A function that performs (stateless) computation on incoming events Update A function that updates a slate using incoming events Application A directed graph of Mappers and Updaters

8 A MapUpdate application
Note that each Update function has its own universe of slates. (Each Update function maintains its own separate slate for events of key k.)

9 An example Muppet application
Checkin counts on Foursquare Identify Foursquare checkins at various retailers Maintain a live count of retailer checkins Enable a display of the current counts at any time

10 An example Muppet application
Checkin counts on Foursquare Source: Read Foursquare stream and create key-value-pair events. Map: For each checkin event, identify a retailer and publish if found. Update: For each retailer checkin, increment appropriate count. Updater slates hold live retailer check-in counts.

11 An example Muppet application
Source: Read Foursquare stream and create key-value-pair events. Input (excerpt): { "checkin": { "created": , "venue": { "id": , "name": "Walmart Neighborhood Market" }

12 An example Muppet application
Source: Read Foursquare stream and create key-value-pair events. Output: 453407, { "checkin": { "created": , "venue": { "id": , "name": "Walmart Neighborhood Market" }

13 An example Muppet application
Map: For each checkin event, identify a retailer and publish if found. Input: 453407, { "checkin": { "created": , "venue": { "id": , "name": "Walmart Neighborhood Market" }

14 An example Muppet application
Map: For each checkin event, identify a retailer and publish if found. Output: Walmart , { "checkin": { "created": , "venue": { "id": , "name": "Walmart Neighborhood Market" } }, "kosmix": { "timeslot": , "interval": 900, "retailer": "Walmart"

15 An example Muppet application
Update: For each retailer checkin, increment appropriate count. Input: Walmart , { "checkin": { "created": , "venue": { "id": , "name": "Walmart Neighborhood Market" } }, "kosmix": { "timeslot": , "interval": 900, "retailer": "Walmart"

16 An example Muppet application
Update: For each retailer checkin, increment appropriate count. Slate: Walmart , { "retailer": "Walmart", "timeslot": , "interval": 900, "count": 1 }

17 The Source (stream receiver)
while ($checkin = <$sock>) { $checkin =~ s/^[^{]*//; next if ($checkin eq ""); $checkin_count++; my $event; eval { $event = decode_json($checkin); }; if or (!defined($event->{checkin}))) { $invalid_count++; } else { $event = $event->{checkin}; my $checkin_time = $event->{created}; my $venue = $event->{venue}->{id}; $self->publish("FoursquareCheckin", $event, $venue); }

18 The Map (Foursquare::CheckinMapper)
sub map { my $self = shift; my $event = shift; my $checkin = $event->{checkin}; my $timeslot = int($checkin->{created} / 900) * 900; $event->{kosmix}->{timeslot} = $timeslot; $event->{kosmix}->{interval} = 900; my $venue_name = $checkin->{venue}->{name}; my $retailer = 0; $retailer = 'ToysRUs' if ($venue_name =~ /toys.*r.*us/i); $retailer = 'Walmart' if ($venue_name =~ /wal.*mart/i); $retailer = 'SamsClub' if ($venue_name =~ /sam.*club/i); if ($retailer) { $event->{kosmix}->{retailer} = $retailer; $self->publish("FoursquareRetailerCheckin", $event, $retailer.".".$timeslot); }

19 The Update (Foursquare::RetailerUpdater)
use Muppet::Updater; package = qw( Muppet::Updater ); use strict; sub update { my $self = shift; my $event = shift; my $slate = shift; my $config = shift; my $key = shift; $slate->{timeslot} = $event->{kosmix}->{timeslot}; $slate->{interval} = $event->{kosmix}->{interval}; $slate->{retailer} = $event->{kosmix}->{retailer}; $slate->{count} += 1; return $slate; } 1;

20 The application configuration (flow graph)
{ "performer" : "foursquare_mapper", "type" : "perl", "class" : "Foursquare::CheckinMapper", "muppet_type" : "Mapper", "subscribes_to" : [ "FoursquareCheckin" ], "publishes_to" : [ "FoursquareRetailerCheckin" ] }, "performer" : "foursquare_retailer", "class" : "Foursquare::RetailerUpdater", "muppet_type" : "Updater", "workers" : 4, "slate_cache_max" : 10000, "slate_cache_write_after" : 1, "subscribes_to" : [ "FoursquareRetailerCheckin" ] }

21 Example results

22 Implementation

23 Implementation Slate management Slates are cached for performance
Cache is sharded by key for load distribution across machines Slates are written to distributed key-value store for durability Event flow Event queues buffer transient load spikes within an application Host failover remaps load away from an unresponsive machine

24 Challenges Host failover Hotspots (uneven load) Parallelization
Slate caching Overload stability

25 Hotspots Some key distributions are highly nonuniform (e.g., Zipfian)
Keys based on natural-language word usage Keys based on a set of varying popularity Mappers: Run any event anywhere. Updaters: Popular keys need access to the same slate. Split associative and commutative computations Split computation parallelizes partial results. Propagate partial results to final result. Reduce slate serialization/deserialization overhead

26 Usage Time Running since mid-2010 Developers
More than a dozen developers at WalmartLabs have used Muppet to develop their applications Data Billions of events, tens of millions of slates processed

27 Related work MapReduce work toward incremental batch runs of MapReduce, rather than continuous event processing in a revised framework (e.g., MapUpdate) MapReduce Online (Condie et al.) Nova (Olston et al.) Event-flow systems systems that focus on the dispatch of events, leaving application state and storage (cf. MapUpdate slates) as a problem for the application developer S4 (Neumeyer et al.) Storm (Marz et al.) Streaming-query systems systems that run and optimize queries in a prescribed query language (contrast low-level, general-purpose MapUpdate operators) Aurora (StreamBase Systems) (Zdonik et al.) SPADE for System S (InfoSphere Streams) (Gedik et al.)

28 Conclusion Big Data : MapReduce :: Fast Data : MapUpdate Create soft-real-time applications on a simple programming model. Distributed stream-processing infrastructure scales computation across cores.

29 Muppet Scalable data-stream processing Big Fast


Download ppt "Muppet Scalable MapUpdate data-stream processing"

Similar presentations


Ads by Google