Download presentation
Presentation is loading. Please wait.
Published byMorgan Ferguson Modified over 9 years ago
1
BIG DATA ANALYTICS Maryam Amir Haeri
2
2 The origins of these slides are as follows: http://www.mmds.org/ http://www.mmds.org/ http://occc.ir/presentations/ http://occc.ir/presentations/ http://web.cs.wpi.edu/~cs561/s14/Lectures/ http://web.cs.wpi.edu/~cs561/s14/Lectures/
3
Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value and knowledge from them 3
4
Introduction to Big Data What is Big Data? What makes data, “Big” Data? 4
5
Big Data Definition No single standard definition… “Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it… 5
6
Characteristics of Big Data: 1-Scale (Volume) Data Volume 44x increase from 2009 2020 From 0.8 zettabytes to 35zb Data volume is increasing exponentially 6 Exponential increase in collected/generated data
7
Characteristics of Big Data: 2-Complexity (Varity) Various formats, types, and structures Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc… Static data vs. streaming data A single application can be generating/collecting many types of data 7 To extract knowledge all these types of data need to linked together
8
Characteristics of Big Data: 3-Speed (Velocity) Data is begin generated fast and need to be processed fast Online Data Analytics Late decisions missing opportunities Examples E-Promotions: Based on your current location, your purchase history, what you like send promotions right now for store next to you Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction 8
9
Some Make it 4V’s 9
10
Veracity 10 Veracity: refers to messiness or trustworthiness of data. With many forms of big data quality and accuracy are less controllable (just think Twitter posts with hash tags, abbreviations, typos and colloquial speech as well as the reliability and accuracy of content) but technology now allows us to work with this type of data.
11
Value 11 There is another V to take into account when looking at big data: Value. Having access to big data is no good unless we can turn it into value. Companies are starting to generate amazing value from their big data.
12
Who’s Generating Big Data Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Mobile devices (tracking all objects all the time) Sensor technology and networks (measuring all kinds of data) The progress and innovation is no longer hindered by the ability to collect data But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion 12
13
The Model Has Changed… The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data 13
14
Challenges in Handling Big Data The Bottleneck is in technology New architecture, algorithms, techniques are needed Also in technical skills Experts in using the new technology and dealing with big data 14
15
15 Data contains value and knowledge
16
Data Mining 16 But to extract the knowledge data needs to be Stored Managed And ANALYZED this presentation Data Mining ≈ Big Data Analytics ≈ Predictive Analytics ≈ Data Science
17
Good news: Demand for Data Mining J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17
18
What is Data Mining? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18 Given lots of data Discover patterns and models that are: Valid: hold on new data with some certainty Useful: should be possible to act on the item Unexpected: non-obvious to the system Understandable: humans should be able to interpret the pattern
19
Data Mining Tasks J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19 Descriptive methods Find human-interpretable patterns that describe the data Example: Clustering Predictive methods Use some variables to predict unknown or future values of other variables Example: Recommender systems
20
What matters when dealing with data? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20 Scalability Streaming Context Quality Usage
21
What is Data Mining J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21 Non-trivial extraction of implicit, previously unknown and useful information from data Slides by Jure Leskovec, Stanford: Mining Massive Datasets 24
22
Mining Large Data Sets – Motivation There is often information “hidden” in the data that is not readily evident Human analysts take weeks to discover useful information Much of the data is never analyzed at all 23 19971998 199
23
Data Streams What are Data Streams? Continuous streams Huge, Fast, and Changing Why Data Streams? The arriving speed of streams and the huge amount of data are beyond our capability to store them. “Real-time” processing Window Models Landscape window (Entire Data Stream) Sliding Window Damped Window Mining Data Stream 23
24
Important Data Mining Tasks Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Sequential Pattern Discovery [Descriptive] Regression [Predictive]
25
Big Data Analytics 25 Web mining Graph mining Clustering documents and users Classification Document duplications Spam detection Advertising Social network analysis Fraud detection Cognitive science Collaborative filtering
26
Collaborative Filtering J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26 Given database of user preferences, predict preference of new user Example: - Predict what new movies you will like based on your past preferences others with similar past preferences their preferences for the new movies Example: - Predict what books/CDs a person may want to buy - (and suggest it, or give discounts to tempt customer) Slides by Jure Leskovec, Stanford: Mining Massive Datasets 28
27
Anomaly Detection J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27 Detect significant deviations from normal behavior Applications: - Credit Card Fraud Detection - Network Intrusion Detection Slides by Jure Leskovec, Stanford: Mining Massive Datasets 29
28
Association Rule Discovery Supermarket shelf management: - Goal: To identify items that are bought together by sufficiently many customers. - Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items. - A classic rule: If a customer buys diaper and milk, then he is likely to coke. So, don’t be surprised if you find six-packs stacked next to diapers! TIDItems Massive Datasets 30
29
Statistical Limits on Data Mining 29 Total Information Awareness In 2002, the Bush administration put forward a plan to mine all the data it could find, including credit-card receipts, hotel records, travel data, and many other kinds of information in order to track terrorist activity. The concern raised by many is that if you look at so much data, and you try to find within it activities that look like terrorist behavior, are you not going to find many innocent activities – or even illicit activities that are not terrorism – that will result in visits from the police and maybe worse than just a visit?
30
Meaningfulness of Analytic Answers 30 A risk with “Data mining” is that an analyst can “discover” patterns that are meaningless Statisticians call it Bonferroni’s principle: Roughly, if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap
31
Meaningfulness of Analytic Answers 31 Example: We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day 10 9 people being tracked 1,000 days Each person stays in a hotel 1% of time (1 day out of 100) Hotels hold 100 people (so 10 5 hotels) If everyone behaves randomly (i.e., no terrorists) will the data mining detect anything suspicious? Expected number of “suspicious” pairs of people: 250,000 … too many combinations to check – we need to have some additional evidence to find “suspicious” pairs of people in some more efficient way
32
Processing Big Data 32 Sampling Map-Reduce Hashing Extracting Statistical Properties Approximation
33
Large-scale Computing 33 Large-scale computing for data mining problems on commodity hardware Challenges: How do you distribute computation? How can we make it easy to write distributed programs? Machines fail: One server may stay up 3 years (1,000 days) If you have 1,000 servers, expect to loose 1/day People estimated Google had ~1M machines in 2011 1,000 machines fail every day!
34
Idea and Solution 34 Issue: Copying data over a network takes time Idea: Bring computation close to the data Store files multiple times for reliability Map-reduce Google’s computational/data manipulation model Elegant way to work with big data Storage Infrastructure – File system Google: GFS. Hadoop: HDFS Programming model Map-Reduce
35
Storage Infrastructure J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 35 Problem: If nodes fail, how to store data persistently? Answer: Distributed File System: Provides global file namespace Google GFS; Hadoop HDFS; Typical usage pattern Huge files (100s of GB to TB) Data is rarely updated in place Reads and appends are common
36
Distributed File System J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36 Reliable distributed file system Data kept in “chunks” spread across machines Each chunk replicated on different machines Seamless recovery from disk or machine failure C0C0 C1C1 C2C2 C5C5 Chunk server 1 D1D1 C5C5 Chunk server 3 C1C1 C3C3 C5C5 Chunk server 2 … C2C2 D0D0 D0D0 C0C0 C5C5 Chunk server N C2C2 D0D0
37
Programming Model: MapReduce J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37 Warm-up task: We have a huge text document Count the number of times each distinct word appears in the file Sample application: Analyze web server logs to find popular URLs
38
MapReduce: Overview 38 Sequentially read a lot of data Map: Extract something you care about Group by key: Sort and Shuffle Reduce: Aggregate, summarize, filter or transform Write the result Outline stays the same, Map and Reduce change to fit the problem
39
MapReduce: The Reduce Step 39 kv … kv kv kv Intermediate key-value pairs Group by key reduce kvkvkv … kv … kv kvv vv Key-value groups Output key-value pairs
40
MapReduce: Word Counting The crew of the space shuttle Endeavor recently returned to Earth as ambassadors, harbingers of a new era of space exploration. Scientists at NASA are saying that the recent assembly of the Dextre bot is the first step in a long-term space-based man/mache partnership. '"The work we're doing now -- the robotics we're doing - - is what we're going to need …………………….. Big document (The, 1) (crew, 1) (of, 1) (the, 1) (space, 1) (shuttle, 1) (Endeavor, 1) (recently, 1) …. (crew, 1) (space, 1) (the, 1) (shuttle, 1) (recently, 1) … (crew, 2) (space, 1) (the, 3) (shuttle, 1) (recently, 1) … MAP: Read input and produces a set of key-value pairs Group by key: Collect all pairs with same key Reduce: Collect all values belonging to the key and output (key, value) Provided by the programmer (key, value) Sequentially read the data Only sequential reads 40
41
HASHING 41
42
10 nearest neighbors from a collection of 20,000 images Scene Completion Problem 42 [Hays and Efros, SIGGRAPH 2007]
43
10 nearest neighbors from a collection of 2 million images Scene Completion Problem 43 [Hays and Efros, SIGGRAPH 2007]
44
A Common Metaphor Many problems can be expressed as finding “similar” sets: Find near-neighbors in high-dimensional space Examples: Pages with similar words For duplicate detection, classification by topic Customers who purchased similar products Products with similar customer sets Images with similar features Users who visited similar websites 44
45
Problem for Today’s Lecture 45
46
Min-Hashing Goal: Find a hash function h(·) such that: if sim(C 1,C 2 ) is high, then with high prob. h(C 1 ) = h(C 2 ) if sim(C 1,C 2 ) is low, then with high prob. h(C 1 ) ≠ h(C 2 ) Clearly, the hash function depends on the similarity metric: Not all similarity metrics have a suitable hash function There is a suitable hash function for the Jaccard similarity: It is called Min-Hashing 46
47
USING STATISTICAL FEATURES 47
48
k–means Algorithm(s) Assumes Euclidean space/distance Start by picking k, the number of clusters Initialize clusters by picking one point per cluster Example: Pick one point at random, then k-1 other points, each as far away as possible from the previous points 48
49
Populating Clusters 1) For each point, place it in the cluster whose current centroid it is nearest 2) After all points are assigned, update the locations of centroids of the k clusters 3) Reassign all points to their closest centroid Sometimes moves points between clusters Repeat 2 and 3 until convergence Convergence: Points don’t move between clusters and centroids stabilize 49
50
Example: Assigning Clusters 50 x x x x x x x x x … data point … centroid x x x Clusters after round 1
51
Example: Assigning Clusters J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 51 x x x x x x x x x … data point … centroid x x x Clusters after round 2
52
Example: Assigning Clusters 52 x x x x x x x x x … data point … centroid x x x Clusters at the end
53
BFR Algorithm BFR [Bradley-Fayyad-Reina] is a variant of k-means designed to handle very large (disk-resident) data sets Assumes that clusters are normally distributed around a centroid in a Euclidean space Standard deviations in different dimensions may vary Clusters are axis-aligned ellipses Efficient way to summarize clusters (want memory required O(clusters) and not O(data)) 53
54
BFR Algorithm Points are read from disk one main-memory-full at a time Most points from previous memory loads are summarized by simple statistics To begin, from the initial load we select the initial k centroids by some sensible approach: Take k random points Take a small random sample and cluster optimally Take a sample; pick a random point, and then k–1 more points, each as far from the previously selected points as possible 54
55
Three Classes of Points 3 sets of points which we keep track of: Discard set (DS): Points close enough to a centroid to be summarized Compression set (CS): Groups of points that are close together but not close to any existing centroid These points are summarized, but not assigned to a cluster Retained set (RS): Isolated points waiting to be assigned to a compression set 55
56
BFR: “Galaxies” Picture 56 A cluster. Its points are in the DS. The centroid Compressed sets. Their points are in the CS. Points in the RS Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
57
Summarizing Sets of Points For each cluster, the discard set (DS) is summarized by: The number of points, N The vector SUM, whose i th component is the sum of the coordinates of the points in the i th dimension The vector SUMSQ: i th component = sum of squares of coordinates in i th dimension 57 A cluster. All its points are in the DS. The centroid
58
Summarizing Points: Comments 2d + 1 values represent any size cluster d = number of dimensions Average in each dimension (the centroid) can be calculated as SUM i / N SUM i = i th component of SUM Variance of a cluster’s discard set in dimension i is: (SUMSQ i / N) – (SUM i / N) 2 And standard deviation is the square root of that Next step: Actual clustering 58 Note: Dropping the “axis-aligned” clusters assumption would require storing full covariance matrix to summarize the cluster. So, instead of SUMSQ being a d-dim vector, it would be a d x d matrix, which is too big!
59
The “Memory-Load” of Points Processing the “Memory-Load” of points (1): 1) Find those points that are “sufficiently close” to a cluster centroid and add those points to that cluster and the DS These points are so close to the centroid that they can be summarized and then discarded 2) Use any main-memory clustering algorithm to cluster the remaining points and the old RS Clusters go to the CS; outlying points to the RS 59 Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
60
The “Memory-Load” of Points Processing the “Memory-Load” of points (2): 3) DS set: Adjust statistics of the clusters to account for the new points Add Ns, SUMs, SUMSQs 4) Consider merging compressed sets in the CS 5) If this is the last round, merge all compressed sets in the CS and all RS points into their nearest cluster 60 Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
61
BFR: “Galaxies” Picture 61 A cluster. Its points are in the DS. The centroid Compressed sets. Their points are in the CS. Points in the RS Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points
62
A Few Details… Q1) How do we decide if a point is “close enough” to a cluster that we will add the point to that cluster? Q2) How do we decide whether two compressed sets (CS) deserve to be combined into one? 62
63
How Close is Close Enough? Q1) We need a way to decide whether to put a new point into a cluster (and discard) BFR suggests two ways: The Mahalanobis distance is less than a threshold High likelihood of the point belonging to currently nearest centroid 63
64
Mahalanobis Distance 64 σ i … standard deviation of points in the cluster in the i th dimension
65
Mahalanobis Distance 65
66
Should 2 CS clusters be combined? Q2) Should 2 CS subclusters be combined? Compute the variance of the combined subcluster N, SUM, and SUMSQ allow us to make that calculation quickly Combine if the combined variance is below some threshold Many alternatives: Treat dimensions differently, consider density 66
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.