Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.com.

Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.com

Hadoop Hadoop is an open source distributed platform for data storage and computations, which runs on commodity hardware Adapted from the slides of Donald Miner

HDFS Works on top of native (for example ext3, xfs, etc.) file system Data is organized into files & directories – Files are divided into blocks, (64-128MB) – Files are distributed across cluster nodes – Files are write-once – The location of blocks can be used to optimize the Map/Reduce execution – Blocks are replicated for fault tolerance – Data integrity is ensured via checksums HDFS is not good for random reads HDFS is optimized for steaming reads of files HDFS is based on design of Google File System

Map/Reduce Paradigm Jobs are described in terms of Mappers and Reducers Mappers receive input records and eject key/value pairs Pairs from mappers are automatically – Grouped by the key – Sorted for each reducer Reducers get key/value pairs and emit the key/value result/s

Example 1: words count

Mapper class

Reducer class

Distribute the documents among K computers Map For each doc, return a set of (word,frequency) pairs Count the occurrences of each word To be or not to be …. (to,1), (to, 1) (be,1), (be,1) (or,1), …. (to,1), (to, 1) (be,1), (be,1) (or,1), …. Redu ce (to,1,1,..), To: 180 Map …. Map …. Map …. Redu ce (be,1, 1), (come,1, 1,1), … Be: 251 Come: 123 Be: 251 Come: 123 Redu ce … … … bebe toto

Example 2: inner join from MapReduce design patterns book “MapReduce design patterns” by Donald Miner and Adam Shook

Mapper class: users records “MapReduce design patterns” by Donald Miner and Adam Shook

Mapper class: comments records “MapReduce design patterns” by Donald Miner and Adam Shook

Reducer: The actual join logic

Cool things about Hadoop No schema imposed- decide what you want when loading – Keep full original data! – Store anything – media, text, logs Transparent Parallelism and network programming. Fault tolerance – Blocks are replicated – Only active nodes get assigned to jobs – Map-Reduce can handle for slow mappers jobs - a dupe of a slow running mapper is created automatically and the results of the first finishing mapper will be used Scalability Check it http://developer.yahoo.com/hadoop/tutorial/module2.html

Hadoop eco system Higher-level languages like Pig and Hive Cascalog

Pig Pig is a SQL-like query language that computes using MapReduce jobs It is higher-level than Map/Reduce: FOREACH, GROUP BY,JOIN, DISTINCT, FILTER etc. Custom loaders and storage functions Reads both structured and unstructured data It is a Data flow language

Why to use PIG Easier to adopt by non-Java programmers No-compilations runs Faster to write (not necessarily faster to execute) Word count example A = load './input.txt'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = group B by word; D = foreach C generate COUNT(B), group; store D into './wordcount’; Join example A = JOIN comments BY userID, users BY userID; Built-in functions - count, group by, joins, filter Built-in optimizations of executions Can still use map/reduce from pig (use mapreduce keyword) Very good for quick analytics

Pig drawbacks Might be clumsy to write tests for (but usually you don’t need tests for one-off analytics) – But cool for development- use Hawk! – You can’t do everything (for example, ifs) Pig is not good for Advanced string manipulations (can use UDFs) Complex joins Math Complex aggregates Iterative algorithms But the majority can be addressed with UDF Hard to reuse code (macros have limited functionality)

Pig UDF REGISTER mylibrary.jar; DEFINE ToUpperCase com.mine.pig.udf. ToUpperCase(); A = LOAD ’words_data' AS (word: chararray, position: int); B = FOREACH A GENERATE ToUpperCase(word);

Cascalog Cascalog - a compiler that produces sequences of Map-Reduce programs Clojure-based (functional programming language) Compiles to Java byte code => can access directly all your Java-based code Granular testing and mocking Runs directly on Hadoop and EMR Wide variety of built-in functionality – Inner and outer joins – Aggregators – Functions – Subqueries – Sorting High performance Check it out https://github.com/nathanmarz/cascalog/wiki http://www.slideshare.net/nathanmarz/cascalog http://www.slideshare.net/nathanmarz/cascalog-at-strange-loop

Examples Example 1: clojure + 1 2 3 * 3 5 Check it out http://www.slideshare.net/nathanmarz/cascalog-at-strange-loop

More examples Inner join user=> (?<- (stdout) [?person ?age ?gender] (age ?person ?age) (gender ?person ?gender)) Full outer join user=> (?<- (stdout) [?person !!age !!gender] (age ?person !!age) (gender ?person !!gender)) Count of followers user=> (?<- (stdout) [?person ?count] (person ?person) (follows ?person !!follower) (c/!count !!follower :> ?count)) The numbers that equal their squares user=> (? ?n)) Cascalog detects that we are trying to rebind the ?n variable and will automatically filter out tuples where the output of the * predicate is not equal to the input. Check it out http://nathanmarz.com/blog/new-cascalog-features-outer-joins-combiners- sorting-and-more.html

What’s hot in Big Data Arena in New York Etsy Foursquare Spotify Knewton IntentMedia

Etsy’s skyline Etsy – the world’s largest hand-made vintage market place Practice continuous development (30-60 deploys per day) Optimized for recovering from failure, rather than avoiding it Bunch of metrics (250K) are outputted and routed to failure detection software – Skyline Kind of real time – approx. 70 seconds lag Runs on – 150 nodes hadoop cluster Check it out http://g33ktalk.com/etsy-a-deep-dive-into-monitoring-with-skyline/

Skyline: continued Anomalies are detected through consensus model A metric is anomalous if it latest value is over 3 s.d. above its moving average (statistical process control) By histogram By linear regression (distribution of residuals) Exponentially weighted moving averages (time series with decay factor)

Skyline: continued Problems – Seasonality – Spike influence (raises the moving average) – Normality – Parameters – As of now, generates too much of noise

Spotify Swedish company that allows users to search for songs and play them on demand 20m tracks, 20K more are added per day Runs on – Hadoop 700 nodes cluster Trying to – Recommend music to users – Provide Intelligent search functionality Recommendations – precomputed overnight – Collaborative-filtering type – Use signals like time user started streaming the track, when did she stop, ip address location, no rating info (can use number of streams) – Build vectors (fingerprint) of users and tracks – Use cos to find top scoring recommendations Algos: matrix factorization, probabilistic latent semantic filtering, k- nearest neighbors to narrow down the potential candidates for recommendations Problems: new users and new tracks Check it out http://vimeo.com/71889190

Foursquare Mobile app that allows to explore the city and connect to friends Utilizes location data Based on people checking-in into the restaurants, events etc – 30m people – 50m places – 3.5b check-ins – 5m check-ins per day Use big data for – Place recommendation -how to influence users to go to some place – Place matching (where the user is checking from) Algos: ensemble of simple models,Naïve Bayes, linear models, random forests, Gaussian mixture combined with personal history and friends’ history Check it out http://vimeo.com/71889190

Foursquare Spatial models – they compile Gaussian mixture models –eg what’s the probability of being at this place given the info received from the phone Sentiment detection based on users reviews (Naïve Bayes) Collaborative filtering – amazon style- people who like this also like that Real-time places recommendations based on – Location – Time of day – Personal check-in history – Friends preferences – Venue similarities – Aggregate historical data – Familiarity

Knewton Adaptive learning platform Real-time recommendations tailored for a student – Trying to determine what the student should work on next and how to learn it (depending on the learning style – visual, geometric approach etc) Their big clients: Arizona State University and University of Alabama. Model model engagement, boredom, frustration, proficiency, the extent to which a student knows or doesn’t know a particular topic. Algos: Item Response Theory Model (estimates the probability that a student is able to do something based on an answer to a particular question). Signals: click stream history (did they check review page? Or checked the hint? How long it took them to answer? Did they change their mind when answering a question) Runs on amazon web services Check it out http://www.knewton.com/http://www.knewton.com/

IntentMedia End-to-end solution for e-commerce sites seeking to monetize their website traffic through advertising while still protecting conversions. Online travel agencies convert perhaps 3% to 5% of site visitors – IntentMedia can help sites monetize on the rest of the visitors Combines consumer-intent data with Intent Media predictive analysis to serve up competitors’ ads to consumers who are deemed unlikely to convert on the initial publisher’s site. Runs on: Amazon web services, uses Pig, Cascalog, Hadoop Largest job: 25m records, 440 features signals, Check it out http://intentmedia.com

Q&A? Thanks!

Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.com.

Similar presentations

Presentation on theme: "Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.com."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.com.

Similar presentations

Presentation on theme: "Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.com."— Presentation transcript:

Similar presentations

About project

Feedback