Muppet Scalable MapUpdate data-stream processing

Name: Muppet Scalable MapUpdate data-stream processing
Uploaded: 2017-10-02T02:52:48+00:00
Duration: PTM13S40
Channel: Dominique Cabell
Description: Muppet Scalable MapUpdate data-stream processing

Muppet Scalable MapUpdate data-stream processing Wang Lam, Lu Liu, STS Prasad, Anand Rajaraman, Zoheb Vacheri, AnHai Doan @WalmartLabs

Road Map Motivation The MapUpdate framework
An example data-stream computation Muppet implementation

The challenge Growing numbers of large, fast data streams
300+ million Twitter status updates daily 5+ million Foursquare checkins daily 3+ billion Facebook Likes and comments daily Streams never stop Growing numbers of applications for data streams Computations need to scale with the data Applications need to stay up-to-date (“What’s going on now?”) Machines fail

The wish list Deliver low-latency processing
Application stays near real-time with its input stream Computed data can be queried live Scale up on commodity hardware with computation and stream rate Easy to program Simple model to enable rapid development of many applications Ideally resemble widely adopted MapReduce

Data-stream computation
Big data: MapReduce (Hadoop) Map and Reduce steps Batch process large input (e.g., from HDFS) Hadoop distributes computation Fast data: MapUpdate (Muppet) Map and Update steps Continuously process streaming input (e.g., from network) Muppet maintains computation and manages memory/storage

The MapReduce framework (Hadoop)
Event A <key, value> pair of data Map A function that performs (stateless) computation on incoming events Reduce A function that combines all input for a particular key Application Map -> Reduce

The MapUpdate framework (Muppet)
Event A <key, value> pair of data Map A function that performs (stateless) computation on incoming events Update A function that updates a slate using incoming events Application A directed graph of Mappers and Updaters

A MapUpdate application
Note that each Update function has its own universe of slates. (Each Update function maintains its own separate slate for events of key k.)

An example Muppet application
Checkin counts on Foursquare Identify Foursquare checkins at various retailers Maintain a live count of retailer checkins Enable a display of the current counts at any time

Checkin counts on Foursquare Source: Read Foursquare stream and create key-value-pair events. Map: For each checkin event, identify a retailer and publish if found. Update: For each retailer checkin, increment appropriate count. Updater slates hold live retailer check-in counts.

Source: Read Foursquare stream and create key-value-pair events. Input (excerpt): { "checkin": { "created": , "venue": { "id": , "name": "Walmart Neighborhood Market" }

Source: Read Foursquare stream and create key-value-pair events. Output: 453407, { "checkin": { "created": , "venue": { "id": , "name": "Walmart Neighborhood Market" }

Map: For each checkin event, identify a retailer and publish if found. Input: 453407, { "checkin": { "created": , "venue": { "id": , "name": "Walmart Neighborhood Market" }

Map: For each checkin event, identify a retailer and publish if found. Output: Walmart , { "checkin": { "created": , "venue": { "id": , "name": "Walmart Neighborhood Market" } }, "kosmix": { "timeslot": , "interval": 900, "retailer": "Walmart"

Update: For each retailer checkin, increment appropriate count. Input: Walmart , { "checkin": { "created": , "venue": { "id": , "name": "Walmart Neighborhood Market" } }, "kosmix": { "timeslot": , "interval": 900, "retailer": "Walmart"

Update: For each retailer checkin, increment appropriate count. Slate: Walmart , { "retailer": "Walmart", "timeslot": , "interval": 900, "count": 1 }

The Source (stream receiver)
while ($checkin = <$sock>) { $checkin =~ s/^[^{]*//; next if ($checkin eq ""); $checkin_count++; my $event; eval { $event = decode_json($checkin); }; if or (!defined($event->{checkin}))) { $invalid_count++; } else { $event = $event->{checkin}; my $checkin_time = $event->{created}; my $venue = $event->{venue}->{id}; $self->publish("FoursquareCheckin", $event, $venue); }

The Map (Foursquare::CheckinMapper)
sub map { my $self = shift; my $event = shift; my $checkin = $event->{checkin}; my $timeslot = int($checkin->{created} / 900) * 900; $event->{kosmix}->{timeslot} = $timeslot; $event->{kosmix}->{interval} = 900; my $venue_name = $checkin->{venue}->{name}; my $retailer = 0; $retailer = 'ToysRUs' if ($venue_name =~ /toys.*r.*us/i); $retailer = 'Walmart' if ($venue_name =~ /wal.*mart/i); $retailer = 'SamsClub' if ($venue_name =~ /sam.*club/i); if ($retailer) { $event->{kosmix}->{retailer} = $retailer; $self->publish("FoursquareRetailerCheckin", $event, $retailer.".".$timeslot); }

The Update (Foursquare::RetailerUpdater)
use Muppet::Updater; package = qw( Muppet::Updater ); use strict; sub update { my $self = shift; my $event = shift; my $slate = shift; my $config = shift; my $key = shift; $slate->{timeslot} = $event->{kosmix}->{timeslot}; $slate->{interval} = $event->{kosmix}->{interval}; $slate->{retailer} = $event->{kosmix}->{retailer}; $slate->{count} += 1; return $slate; } 1;

The application configuration (flow graph)
{ "performer" : "foursquare_mapper", "type" : "perl", "class" : "Foursquare::CheckinMapper", "muppet_type" : "Mapper", "subscribes_to" : [ "FoursquareCheckin" ], "publishes_to" : [ "FoursquareRetailerCheckin" ] }, "performer" : "foursquare_retailer", "class" : "Foursquare::RetailerUpdater", "muppet_type" : "Updater", "workers" : 4, "slate_cache_max" : 10000, "slate_cache_write_after" : 1, "subscribes_to" : [ "FoursquareRetailerCheckin" ] }

Example results

Implementation

Implementation Slate management Slates are cached for performance
Cache is sharded by key for load distribution across machines Slates are written to distributed key-value store for durability Event flow Event queues buffer transient load spikes within an application Host failover remaps load away from an unresponsive machine

Challenges Host failover Hotspots (uneven load) Parallelization
Slate caching Overload stability

Hotspots Some key distributions are highly nonuniform (e.g., Zipfian)
Keys based on natural-language word usage Keys based on a set of varying popularity Mappers: Run any event anywhere. Updaters: Popular keys need access to the same slate. Split associative and commutative computations Split computation parallelizes partial results. Propagate partial results to final result. Reduce slate serialization/deserialization overhead

Usage Time Running since mid-2010 Developers
More than a dozen developers at WalmartLabs have used Muppet to develop their applications Data Billions of events, tens of millions of slates processed

Related work MapReduce work toward incremental batch runs of MapReduce, rather than continuous event processing in a revised framework (e.g., MapUpdate) MapReduce Online (Condie et al.) Nova (Olston et al.) Event-flow systems systems that focus on the dispatch of events, leaving application state and storage (cf. MapUpdate slates) as a problem for the application developer S4 (Neumeyer et al.) Storm (Marz et al.) Streaming-query systems systems that run and optimize queries in a prescribed query language (contrast low-level, general-purpose MapUpdate operators) Aurora (StreamBase Systems) (Zdonik et al.) SPADE for System S (InfoSphere Streams) (Gedik et al.)

Conclusion Big Data : MapReduce :: Fast Data : MapUpdate Create soft-real-time applications on a simple programming model. Distributed stream-processing infrastructure scales computation across cores.

Muppet Scalable data-stream processing Big Fast

Muppet Scalable MapUpdate data-stream processing

Similar presentations

Presentation on theme: "Muppet Scalable MapUpdate data-stream processing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Muppet Scalable MapUpdate data-stream processing

Similar presentations

Presentation on theme: "Muppet Scalable MapUpdate data-stream processing"— Presentation transcript:

Similar presentations

About project

Feedback