BIG DATA ANALYTICS Maryam Amir Haeri. 2  The origins of these slides are as follows:  

BIG DATA ANALYTICS Maryam Amir Haeri

2  The origins of these slides are as follows:  http://www.mmds.org/ http://www.mmds.org/  http://occc.ir/presentations/ http://occc.ir/presentations/  http://web.cs.wpi.edu/~cs561/s14/Lectures/ http://web.cs.wpi.edu/~cs561/s14/Lectures/

Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value and knowledge from them 3

Introduction to Big Data What is Big Data? What makes data, “Big” Data? 4

Big Data Definition  No single standard definition… “Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it… 5

Characteristics of Big Data: 1-Scale (Volume)  Data Volume  44x increase from 2009 2020  From 0.8 zettabytes to 35zb  Data volume is increasing exponentially 6 Exponential increase in collected/generated data

Characteristics of Big Data: 2-Complexity (Varity)  Various formats, types, and structures  Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…  Static data vs. streaming data  A single application can be generating/collecting many types of data 7 To extract knowledge  all these types of data need to linked together

Characteristics of Big Data: 3-Speed (Velocity)  Data is begin generated fast and need to be processed fast  Online Data Analytics  Late decisions  missing opportunities  Examples  E-Promotions: Based on your current location, your purchase history, what you like  send promotions right now for store next to you  Healthcare monitoring: sensors monitoring your activities and body  any abnormal measurements require immediate reaction 8

Some Make it 4V’s 9

Veracity 10 Veracity: refers to messiness or trustworthiness of data. With many forms of big data quality and accuracy are less controllable (just think Twitter posts with hash tags, abbreviations, typos and colloquial speech as well as the reliability and accuracy of content) but technology now allows us to work with this type of data.

Value 11 There is another V to take into account when looking at big data: Value. Having access to big data is no good unless we can turn it into value. Companies are starting to generate amazing value from their big data.

Who’s Generating Big Data Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Mobile devices (tracking all objects all the time) Sensor technology and networks (measuring all kinds of data)  The progress and innovation is no longer hindered by the ability to collect data  But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion 12

The Model Has Changed…  The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data 13

Challenges in Handling Big Data  The Bottleneck is in technology  New architecture, algorithms, techniques are needed  Also in technical skills  Experts in using the new technology and dealing with big data 14

15 Data contains value and knowledge

Data Mining 16  But to extract the knowledge data needs to be  Stored  Managed  And ANALYZED  this presentation Data Mining ≈ Big Data Analytics ≈ Predictive Analytics ≈ Data Science

Good news: Demand for Data Mining J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17

What is Data Mining? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18  Given lots of data  Discover patterns and models that are:  Valid: hold on new data with some certainty  Useful: should be possible to act on the item  Unexpected: non-obvious to the system  Understandable: humans should be able to interpret the pattern

Data Mining Tasks J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19  Descriptive methods  Find human-interpretable patterns that describe the data Example: Clustering  Predictive methods  Use some variables to predict unknown or future values of other variables Example: Recommender systems

What matters when dealing with data? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20 Scalability Streaming Context Quality Usage

What is Data Mining J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21 Non-trivial extraction of implicit, previously unknown and useful information from data Slides by Jure Leskovec, Stanford: Mining Massive Datasets 24

Mining Large Data Sets – Motivation There is often information “hidden” in the data that is not readily evident Human analysts take weeks to discover useful information Much of the data is never analyzed at all 23 19971998 199

Data Streams  What are Data Streams?  Continuous streams  Huge, Fast, and Changing  Why Data Streams?  The arriving speed of streams and the huge amount of data are beyond our capability to store them.  “Real-time” processing  Window Models  Landscape window (Entire Data Stream)  Sliding Window  Damped Window  Mining Data Stream 23

Important Data Mining Tasks  Classification [Predictive]  Clustering [Descriptive]  Association Rule Discovery [Descriptive]  Sequential Pattern Discovery [Descriptive]  Regression [Predictive]

Big Data Analytics 25  Web mining  Graph mining  Clustering documents and users  Classification  Document duplications  Spam detection  Advertising  Social network analysis  Fraud detection  Cognitive science  Collaborative filtering

Collaborative Filtering J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26 Given database of user preferences, predict preference of new user Example: - Predict what new movies you will like based on your past preferences others with similar past preferences their preferences for the new movies Example: - Predict what books/CDs a person may want to buy - (and suggest it, or give discounts to tempt customer) Slides by Jure Leskovec, Stanford: Mining Massive Datasets 28

Anomaly Detection J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27 Detect significant deviations from normal behavior Applications: - Credit Card Fraud Detection - Network Intrusion Detection Slides by Jure Leskovec, Stanford: Mining Massive Datasets 29

Association Rule Discovery Supermarket shelf management: - Goal: To identify items that are bought together by sufficiently many customers. - Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items. - A classic rule: If a customer buys diaper and milk, then he is likely to coke. So, don’t be surprised if you find six-packs stacked next to diapers! TIDItems Massive Datasets 30

Statistical Limits on Data Mining 29  Total Information Awareness  In 2002, the Bush administration put forward a plan to mine all the data it could find, including credit-card receipts, hotel records, travel data, and many other kinds of information in order to track terrorist activity.  The concern raised by many is that if you look at so much data, and you try to find within it activities that look like terrorist behavior, are you not going to find many innocent activities – or even illicit activities that are not terrorism – that will result in visits from the police and maybe worse than just a visit?

Meaningfulness of Analytic Answers 30  A risk with “Data mining” is that an analyst can “discover” patterns that are meaningless  Statisticians call it Bonferroni’s principle:  Roughly, if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap

Meaningfulness of Analytic Answers 31 Example:  We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day  10 9 people being tracked  1,000 days  Each person stays in a hotel 1% of time (1 day out of 100)  Hotels hold 100 people (so 10 5 hotels)  If everyone behaves randomly (i.e., no terrorists) will the data mining detect anything suspicious?  Expected number of “suspicious” pairs of people:  250,000  … too many combinations to check – we need to have some additional evidence to find “suspicious” pairs of people in some more efficient way

Processing Big Data 32  Sampling  Map-Reduce  Hashing  Extracting Statistical Properties  Approximation

Large-scale Computing 33  Large-scale computing for data mining problems on commodity hardware  Challenges:  How do you distribute computation?  How can we make it easy to write distributed programs?  Machines fail: One server may stay up 3 years (1,000 days) If you have 1,000 servers, expect to loose 1/day People estimated Google had ~1M machines in 2011 1,000 machines fail every day!

Idea and Solution 34  Issue: Copying data over a network takes time  Idea:  Bring computation close to the data  Store files multiple times for reliability  Map-reduce  Google’s computational/data manipulation model  Elegant way to work with big data  Storage Infrastructure – File system Google: GFS. Hadoop: HDFS  Programming model Map-Reduce

Storage Infrastructure J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35  Problem:  If nodes fail, how to store data persistently?  Answer:  Distributed File System: Provides global file namespace Google GFS; Hadoop HDFS;  Typical usage pattern  Huge files (100s of GB to TB)  Data is rarely updated in place  Reads and appends are common

Distributed File System J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36  Reliable distributed file system  Data kept in “chunks” spread across machines  Each chunk replicated on different machines  Seamless recovery from disk or machine failure C0C0 C1C1 C2C2 C5C5 Chunk server 1 D1D1 C5C5 Chunk server 3 C1C1 C3C3 C5C5 Chunk server 2 … C2C2 D0D0 D0D0 C0C0 C5C5 Chunk server N C2C2 D0D0

Programming Model: MapReduce J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37 Warm-up task:  We have a huge text document  Count the number of times each distinct word appears in the file  Sample application:  Analyze web server logs to find popular URLs

MapReduce: Overview 38  Sequentially read a lot of data  Map:  Extract something you care about  Group by key: Sort and Shuffle  Reduce:  Aggregate, summarize, filter or transform  Write the result Outline stays the same, Map and Reduce change to fit the problem

MapReduce: The Reduce Step 39 kv … kv kv kv Intermediate key-value pairs Group by key reduce kvkvkv … kv … kv kvv vv Key-value groups Output key-value pairs

MapReduce: Word Counting The crew of the space shuttle Endeavor recently returned to Earth as ambassadors, harbingers of a new era of space exploration. Scientists at NASA are saying that the recent assembly of the Dextre bot is the first step in a long-term space-based man/mache partnership. '"The work we're doing now -- the robotics we're doing - - is what we're going to need …………………….. Big document (The, 1) (crew, 1) (of, 1) (the, 1) (space, 1) (shuttle, 1) (Endeavor, 1) (recently, 1) …. (crew, 1) (space, 1) (the, 1) (shuttle, 1) (recently, 1) … (crew, 2) (space, 1) (the, 3) (shuttle, 1) (recently, 1) … MAP: Read input and produces a set of key-value pairs Group by key: Collect all pairs with same key Reduce: Collect all values belonging to the key and output (key, value) Provided by the programmer (key, value) Sequentially read the data Only sequential reads 40

HASHING 41

10 nearest neighbors from a collection of 20,000 images Scene Completion Problem 42 [Hays and Efros, SIGGRAPH 2007]

10 nearest neighbors from a collection of 2 million images Scene Completion Problem 43 [Hays and Efros, SIGGRAPH 2007]

A Common Metaphor  Many problems can be expressed as finding “similar” sets:  Find near-neighbors in high-dimensional space  Examples:  Pages with similar words For duplicate detection, classification by topic  Customers who purchased similar products Products with similar customer sets  Images with similar features Users who visited similar websites 44

Problem for Today’s Lecture 45

Min-Hashing  Goal: Find a hash function h(·) such that:  if sim(C 1,C 2 ) is high, then with high prob. h(C 1 ) = h(C 2 )  if sim(C 1,C 2 ) is low, then with high prob. h(C 1 ) ≠ h(C 2 )  Clearly, the hash function depends on the similarity metric:  Not all similarity metrics have a suitable hash function  There is a suitable hash function for the Jaccard similarity: It is called Min-Hashing 46

USING STATISTICAL FEATURES 47

k–means Algorithm(s)  Assumes Euclidean space/distance  Start by picking k, the number of clusters  Initialize clusters by picking one point per cluster  Example: Pick one point at random, then k-1 other points, each as far away as possible from the previous points 48

Populating Clusters  1) For each point, place it in the cluster whose current centroid it is nearest  2) After all points are assigned, update the locations of centroids of the k clusters  3) Reassign all points to their closest centroid  Sometimes moves points between clusters  Repeat 2 and 3 until convergence  Convergence: Points don’t move between clusters and centroids stabilize 49

Example: Assigning Clusters 50 x x x x x x x x x … data point … centroid x x x Clusters after round 1

Example: Assigning Clusters J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 51 x x x x x x x x x … data point … centroid x x x Clusters after round 2

Example: Assigning Clusters 52 x x x x x x x x x … data point … centroid x x x Clusters at the end

BFR Algorithm  BFR [Bradley-Fayyad-Reina] is a variant of k-means designed to handle very large (disk-resident) data sets  Assumes that clusters are normally distributed around a centroid in a Euclidean space  Standard deviations in different dimensions may vary Clusters are axis-aligned ellipses  Efficient way to summarize clusters (want memory required O(clusters) and not O(data)) 53

BFR Algorithm  Points are read from disk one main-memory-full at a time  Most points from previous memory loads are summarized by simple statistics  To begin, from the initial load we select the initial k centroids by some sensible approach:  Take k random points  Take a small random sample and cluster optimally  Take a sample; pick a random point, and then k–1 more points, each as far from the previously selected points as possible 54

Three Classes of Points 3 sets of points which we keep track of:  Discard set (DS):  Points close enough to a centroid to be summarized  Compression set (CS):  Groups of points that are close together but not close to any existing centroid  These points are summarized, but not assigned to a cluster  Retained set (RS):  Isolated points waiting to be assigned to a compression set 55

BFR: “Galaxies” Picture 56 A cluster. Its points are in the DS. The centroid Compressed sets. Their points are in the CS. Points in the RS Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points

Summarizing Sets of Points For each cluster, the discard set (DS) is summarized by:  The number of points, N  The vector SUM, whose i th component is the sum of the coordinates of the points in the i th dimension  The vector SUMSQ: i th component = sum of squares of coordinates in i th dimension 57 A cluster. All its points are in the DS. The centroid

Summarizing Points: Comments  2d + 1 values represent any size cluster  d = number of dimensions  Average in each dimension (the centroid) can be calculated as SUM i / N  SUM i = i th component of SUM  Variance of a cluster’s discard set in dimension i is: (SUMSQ i / N) – (SUM i / N) 2  And standard deviation is the square root of that  Next step: Actual clustering 58 Note: Dropping the “axis-aligned” clusters assumption would require storing full covariance matrix to summarize the cluster. So, instead of SUMSQ being a d-dim vector, it would be a d x d matrix, which is too big!

The “Memory-Load” of Points Processing the “Memory-Load” of points (1):  1) Find those points that are “sufficiently close” to a cluster centroid and add those points to that cluster and the DS  These points are so close to the centroid that they can be summarized and then discarded  2) Use any main-memory clustering algorithm to cluster the remaining points and the old RS  Clusters go to the CS; outlying points to the RS 59 Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points

The “Memory-Load” of Points Processing the “Memory-Load” of points (2):  3) DS set: Adjust statistics of the clusters to account for the new points  Add Ns, SUMs, SUMSQs  4) Consider merging compressed sets in the CS  5) If this is the last round, merge all compressed sets in the CS and all RS points into their nearest cluster 60 Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points

BFR: “Galaxies” Picture 61 A cluster. Its points are in the DS. The centroid Compressed sets. Their points are in the CS. Points in the RS Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points

A Few Details…  Q1) How do we decide if a point is “close enough” to a cluster that we will add the point to that cluster?  Q2) How do we decide whether two compressed sets (CS) deserve to be combined into one? 62

How Close is Close Enough?  Q1) We need a way to decide whether to put a new point into a cluster (and discard)  BFR suggests two ways:  The Mahalanobis distance is less than a threshold  High likelihood of the point belonging to currently nearest centroid 63

Mahalanobis Distance 64 σ i … standard deviation of points in the cluster in the i th dimension

Mahalanobis Distance 65

Should 2 CS clusters be combined? Q2) Should 2 CS subclusters be combined?  Compute the variance of the combined subcluster  N, SUM, and SUMSQ allow us to make that calculation quickly  Combine if the combined variance is below some threshold  Many alternatives: Treat dimensions differently, consider density 66

BIG DATA ANALYTICS Maryam Amir Haeri. 2  The origins of these slides are as follows:  

Similar presentations

Presentation on theme: "BIG DATA ANALYTICS Maryam Amir Haeri. 2  The origins of these slides are as follows:  "— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

BIG DATA ANALYTICS Maryam Amir Haeri. 2  The origins of these slides are as follows:  

Similar presentations

Presentation on theme: "BIG DATA ANALYTICS Maryam Amir Haeri. 2  The origins of these slides are as follows:  "— Presentation transcript:

Similar presentations

About project

Feedback