Download presentation
Presentation is loading. Please wait.
1
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 1: Introduction to Data Mining Padhraic Smyth Department of Information and Computer Science University of California, Irvine
2
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Philosophy behind this class Develop an overall sense of how to extract information from data in a systematic way Emphasis on the process of data mining –understanding specific algorithms and methods is important –But also…emphasize the “big picture” of why, not just how –Less emphasis on theory and mathematics (than in 274) Builds on knowledge from ICS 273, 274, etc.
3
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Logistics Grading –30% homeworks Every 2 weeks Guidelines for collaboration Homework 1 due in 2 weeks (on the Web page) –70% class project Will discuss in next lecture Office hours –Fridays, 8:30 to 10 Web page –www.ics.uci.edu/~smyth/courses/ics278www.ics.uci.edu/~smyth/courses/ics278 Prerequisites –Either ICS 273 or 274 or equivalent Text –Principles of Data Mining, Hand/Mannila/Smyth, MIT Press, 2001
4
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Quiz 5 minute quiz to quickly assess your background Will not be used in grading for the class, for information purposes only
5
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Lecture 1: Introduction to Data Mining What is data mining? Data sets –The “data matrix” –Other data formats Data mining tasks –Prediction and description Data mining algorithms –Score functions, models, and optimization methods The dark side of data mining Required reading: Chapter 1 of PDM (Principles of Data Mining)
6
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine What is data mining?
7
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine What is data mining? “The magic phrase used to..….. - put in your resume - use in a proposal to NSF, NIH, NASA, etc - market database software - sell statistical analysis software - sell parallel computing hardware - sell consulting services”
8
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine What is data mining? “ Data-driven discovery of models and patterns from massive observational data sets ”
9
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine What is data mining? “ Data-driven discovery of models and patterns from massive observational data sets ” Statistics, Inference
10
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine What is data mining? “ Data-driven discovery of models and patterns from massive observational data sets ” Statistics, Inference Languages and Representations
11
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine What is data mining? “ Data-driven discovery of models and patterns from massive observational data sets ” Statistics, Inference Languages and Representations Engineering, Data Management
12
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine What is data mining? “ Data-driven discovery of models and patterns from massive observational data sets ” Statistics, Inference Languages and Representations Engineering, Data Management Retrospective Analysis
13
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Technological Driving Factors Larger, cheaper memory – Moore’s law for magnetic disk density “capacity doubles every 18 months” –storage cost per byte falling rapidly Faster, cheaper processors – the CRAY of 15 years ago is now on your desk Success of Relational Database technology –everybody is a “data owner” New ideas in machine learning/statistics –Boosting, SVMs, decision trees, etc
14
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Examples of data volumes MEDLINE text database –12 million published articles Google –4.2 billion Web pages indexed –80 million site visitors per day CALTRANS loop sensor data –Every 30 seconds, thousands of sensors, 2Gbytes per day NASA MODIS satellite –Coverage at 250m resolution, 37 bands, whole earth, every day Walmart transaction data –Order of 100 million transactions per day
15
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Two Types of Data Experimental Data –Hypothesis H –design an experiment to test H –collect data, infer how likely it is that H is true –e.g., clinical trials in medicine Observational or Retrospective or Secondary Data –massive non-experimental data sets e.g., human genome, atmospheric simulations, etc –assumptions of experimental design no longer valid –how can we use such data to do science? data must support model exploration, hypothesis testing
16
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Data-Driven Discovery Observation data –cheap relative to experimental data Examples: –Transaction data archives for retail stores, airlines, etc –Web logs for Amazon, Google, etc –The human/mouse/rat genome –Etc., etc makes sense to leverage available data useful (?) information may be hidden in vast archives of data Contrast data mining with traditional statistics – traditional stats: first hypothesize, then collect data, then analyze – data mining: few if any a priori hypotheses, data is usually already there analysis is typically data-driven not hypothesis driven –Nonetheless, statistical ideas are very useful in data mining, e.g., in validating whether discovered knowledge is useful
17
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Let the data speak…
18
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Let the data speak… The data may have quite a lot to say….. but it may just be noise!
19
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Origins of Data Mining pre 19601960’s1970’s 1980’s 1990’s “Pencil and Paper” EDA “Flexible Models” Hardware (sensors, storage, computation) Relational Databases AI Pattern Recognition Machine Learning “Data Dredging” Data Mining
20
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine DM: Intersection of Many Fields Data Mining Machine Learning (ML) Databases (DB) Statistics (stats) Computer Science (CS) Human Computer Interaction (HCI) Visualization (viz) High-Performance Parallel Computing
21
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Flat File or Vector Data Rows = objects Columns = measurements on objects –Represent each row as a p-dimensional vector, where p is the dimensionality In efffect, embed our objects in a p-dimensional vector space Often useful, but always appropriate Both n and p can be very large in certain data mining applications n p
22
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Sparse Matrix (Text) Data Word IDs Text Documents
23
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine “Market Basket” Data
24
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine 128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -, 128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -, 128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -, … 5115 11111151511151 77777777 111333 3333131113332232 … User 5 User 4 User 3 User 2 User 1 Sequence (Web) Data
25
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Time Series Data
26
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Image Data
27
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine
29
Spatio-temporal data
30
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Tucker Balch and Frank Dellaert, Computer Science Department, Georgia Tech
31
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Tucker Balch and Frank Dellaert, Computer Science Department, Georgia Tech
32
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Steve Gross, Physics and Biology, UC Irvine Movement of Pigment in Skin Cells of Frogs
33
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Steve Gross, Physics and Biology, UC Irvine Movement of Pigment in Skin Cells of Frogs
34
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Steve Gross, Physics and Biology, UC Irvine Movement of Pigment in Skin Cells of Frogs
35
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Steve Gross, Physics and Biology, UC Irvine Movement of Pigment in Skin Cells of Frogs
36
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Movement of Pigment in Skin Cells of Frogs Steve Gross, Physics and Biology, UC Irvine
37
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Relational Data
38
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Different Data Mining Tasks Exploratory Data Analysis Descriptive Modeling Predictive Modeling Discovering Patterns and Rules + others….
39
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Exploratory Data Analysis Getting an overall sense of the data set –Computing summary statistics: Number of distinct values, max, min, mean, median, variance, skewness,.. Visualization is widely used –1d histograms –2d scatter plots –Higher-dimensional methods Useful for data checking –E.g., finding that a variable is always integer valued or positive –Finding the some variables are highly skewed Simple exploratory analysis can be extremely valuable –You should always “look” at your data before applying any data mining algorithms
40
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Example of Exploratory Data Analysis (Pima Indians data, scatter plot matrix)
41
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Descriptive Modeling Goal is to build a “generative” or “descriptive” model, –E.g., a model that could simulate the data if needed –models the underlying process Examples: –Density estimation: estimate the joint distribution P(x 1,……x p ) –Cluster analysis: Find natural groups in the data –Dependency models among the p variables Learning a Bayesian network for the data
42
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Example of Descriptive Modeling Anemia Group Control Group
43
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Another Example of Descriptive Modeling Learning Directed Graphical Models (aka Bayes Nets) –goal: learn directed relationships among p variables –techniques: directed (causal) graphs –challenge: distinguishing between correlation and causation canceryellow fingers ? smoking –example: Do yellow fingers cause lung cancer? hidden cause: smoking
44
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Predictive Modeling Predict one variable Y given a set of other variables X –Here X could be a p-dimensional vector Classification: Y is categorical Regression: Y is real-valued In effect this is function approximation, learning the relationship between Y and X Many, many algorithms for predictive modeling in statistics and machine learning Often the emphasis is on predictive accuracy, less emphasis on understanding the model
45
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Example of Predictive Modeling Background –AT&T has about 100 million customers –It logs 200 million calls per day, 40 attributes each –250 million unique telephone numbers –Which are business and which are residential? Solution (Pregibon and Cortes, AT&T,1997) –Proprietary model, using a few attributes, trained on known business customers to adaptively track p(business|data) –Significant systems engineering: data are downloaded nightly, model updated (20 processors, 6Gb RAM, terabyte disk farm) Status: –running daily at AT&T –HTML interface used by AT&T marketing
46
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Pattern Discovery Goal is to discover interesting “local” patterns in the data rather than to characterize the data globally given market basket data we might discover that If customers buy wine and bread then they buy cheese with probability 0.9 These are known as “association rules” Given multivariate data on astronomical objects We might find a small group of previously undiscovered objects that are very self-similar in our feature space, but are very far away in feature space from all other objects
47
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Example of Pattern Discovery ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDD DCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBBCBBCCBB BDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBAC DCADCBACCADCCCACCDDADCBCADADBAACCDDDCBDBDCCCCACACACCDAB DDBCADADBCBDDADABCCABDAACABCABACBDDDCBADCBDADDDDCDDCADC CBBADABBAAADAAABCCBCABDBAADCBCDACBCABABCCBACBDABDDDADAA BADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCD CCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBC BDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBC CDBAAADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB
48
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Example of Pattern Discovery ADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADAADABDBBDABABBCDD DCDDABDCBBDBDBCBBABBBCBBABCBBACBBDBAACCADDADBDBBCBBCCBB BDCABDDBBADDBBBBCCACDABBABDDCDDBBABDBDDBDDBCACDBBCCBBAC DCADCBACCADCCCACCDDADCBCADADBAACCDDDCBDBDCCCCACACACCDAB DDBCADADBCBDDADABCCABDAACABCABACBDDDCBADCBDADDDDCDDCADC CBBADABBAAADAAABCCBCABDBAADCBCDACBCABABCCBACBDABDDDADAA BADCDCCDBBCDBDADDCCBBCDBAADADBCAAAADBDCADBDBBBCDCCBCCCD CCADAADACABDABAABBDDBCADDDDBCDDBCCBBCCDADADACCCDABAABBC BDBDBADBBBBCDADABABBDACDCDDDBBCDBBCBBCCDABCADDADBACBBBC CDBAAADDDBDDCABACBCADCDCBAAADCADDADAABBACCBB
49
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Example of Pattern Discovery IBM “Advanced Scout” System –Bhandari et al. (1997) –Every NBA basketball game is annotated, e.g., time = 6 mins, 32 seconds event = 3 point basket player = Michael Jordan This creates a huge untapped database of information –IBM algorithms search for rules of the form “If player A is in the game, player B’s scoring rate increases from 3.2 points per quarter to 8.7 points per quarter” –IBM claimed around 1998 that all NBA teams except 1 were using this software…… the “other team” was Chicago.
50
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Structure: Models and Patterns Model = abstract representation of a process e.g., very simple linear model structure Y = a X + b –a and b are parameters determined from the data –Y = aX + b is the model structure –Y = 0.9X + 0.3 is a particular model – “All models are wrong, some are useful” (G.E. Box) Pattern represents “local structure” in a data set –E.g., if X>x then Y >y with probability p – or a pattern might be a small cluster of outliers in multi- dimensional space
51
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Components of Data Mining Algorithms Representation: –Determining the nature and structure of the representation to be used; Score function –quantifying and comparing how well different representations fit the data Search/Optimization method –Choosing an algorithmic process to optimize the score function; and Data Management –Deciding what principles of data management are required to implement the algorithms efficiently.
52
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Task What’s in a Data Mining Algorithm? Representation Score Function Search/Optimization Data Management Models, Parameters
53
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Task An Example: Multivariate Linear Regression Representation Score Function Search/Optimization Data Management Models, Parameters Regression Y = Weighted linear sum of X’s Least-squares Gaussian elimination None specified Regression coefficients
54
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Task An Example: Decision Trees (C4.5 or CART) Representation Score Function Search/Optimization Data Management Models, Parameters Classification Hierarchy of axis-parallel linear class boundaries Cross-validated accuracy Greedy Search None specified Decision tree classifier
55
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Task An Example: Hierarchical Clustering Representation Score Function Search/Optimization Data Management Models, Parameters Clustering Tree of clusters Various Greedy search None specified Dendrogram
56
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Task An Example: Association Rules Representation Score Function Search/Optimization Data Management Models, Parameters Pattern Discovery Rules: if A and B then C with prob p No explicit score Systematic search Multiple linear scans Set of Rules
57
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Data Mining: the downside Hype Data dredging, snooping and fishingData dredging –Finding spurious structure in data that is not real historically, ‘data mining’ was a derogatory term in the statistics community –making inferences from small samples The Super Bowl fallacy Bangladesh butter prices and the US stock market The challenges of being interdisciplinary –computer science, statistics, domain discipline
58
Data Mining Lectures Lecture 1: Introduction Padhraic Smyth, UC Irvine Next Lecture Discussion of class projects Chapter 2 –Measurement and data –Distance measures –Data quality
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.