Presentation on theme: "Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer."— Presentation transcript:
Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer Science and Mathematics Division Oak Ridge National Laboratory
Who are the data producers? What data? Application Data Application Category: Finance Producer: Wall Street Data: stocks, stock prices, stock purchases,… Application Category: Academia Producer: NCSU Data: students admission data (name, DOB, GRE scores, transcripts, GPA, university/school attended, recommendation letters, personal statement, etc. 3
What questions to ask about the data? Data Questions Academia:NCSU:Admission data 1.Is there any correlation between the students’ GRE scores and their successful completion of a PhD program? 2.What are the groups of students that share common academic performance? 3.Are there any admitted students who would stand out as an anomaly? What type of anomaly is that? 4.If the student majors in Physics, what other major is he/she likely double-major? 5
Questions by Types? Correlation, similarity, comparison,… Association, causality, co-occurrence,… Grouping, clustering,… Categorization, classification,… Frequency or rarity of occurrence,… Anomalous or normal objects, events, behaviors, Forecasting: future classes, future activity,… … 6
What information we need to answer? Questions Data Objects and Object Features Academia:NCSU:Admission data –Objects: Students –Object’s Features=Variables=Attributes=Dimensions & Types Name:String (e.g., Name=Neil Shah) GPA:Numeric (e.g., GPA=5.0) Recommendation:Text (e.g., … the top 2% in my career…) Etc. 7
How to compare two objects? Data Object Object Pairs Academia:NCSU:Admission data –Objects: Students –Based on a single feature: Similar GPA The same first letter in the last name –Based on a set of features: Similar academic records (GPA, GRE, etc.) Similar demographic records –Can you compute a numerical value for your similarity measure used for comparison? Why or Why not? 8
How to represent data mathematically? Data Object & its Features Data Model 9 What mathematical objects have you studied? –Scalar –Points –Vectors –Vector spaces –Matrices –Sets –Graphs, networks (maybe) –Tensors (maybe) –Time series (maybe) –Topological manifolds (maybe) –… 9
Data object as vector with components… 10 City=(Latitude, Longitude)--2-dimensional object Vector components: Features, or Attributes, or Dimensions Raleigh=(35.46, 78.39) Boston=(42.21, 71.5) Proximity(Raleigh, Boston)=? Geodesic distance Euclidean distance Length of the interstate route
A set of data objects as vector spaces 11 3-dimensional vector space Latitude Longitude Altitude Raleigh Moscow Mining such data ~ studying vector spaces
Multi-dimensional vectors… 12 S1=(John Smith, 5.0, 180, 6.0, 200) S2=(Jane Doe, 3.0, 140, 5.4, 70) Vector components: Features, or Attributes, or Dimensions Student=(Name, GPA, Weight, Height, Income in K, …) - mutli-dimensional Proximity(S1, S2)=? How to compare when vector components are of heterogeneous type, or different scales? How to show the results of the comparison?
as matrices… 13 Original Documents t-d term-document matrix Terms=Features=Dimensions Parsed Documents Example: A collection of text documents on the Web Mining such data ~ studying matrices
or as trees 14 t-d term-document matrix president government party election political elected national districts held district independence vice minister parties population area climate city miles province land topography total season 1999 square rate economy million products 1996 growth copra economic 1997 food scale exports rice fish D3 D2 document terms Is D2 similar to D3? What if there are 10,000 terms? Mining such data ~ studying trees
0r as networks, or graphs w/ nodes & links 15 population area climate city miles province land topography total season 1999 square rate president government party election political elected national districts held district independence vice minister parties economy million products 1996 growth copra economic 1997 food scale exports rice fish Nodes=Documents Links=Document similarity (e.g., if document references another document ) Mining such data ~ studying graphs, or graph mining
What apps naturally deal w/ graphs? 16 Credit: Images are from Google images via search of keywords Semantic Web Social Networks World Wide Web Drug Design, Chemical compounds Computer networks Sensor networks
What questions to ask about graph data? Graph Data Graph Mining Questions Academia:NCSU:Admission data 1.Nodes=students; links=similar academics/demographics 2.How many distinct academically performing groups of students admitted to NCSU? 3.Which academic group is the largest? 4.Given a new student applicant, can we predict which academic group the student will likely belong to? 5.Are groups of student with similar demographics usually share similar academic performance? 6.Over the last decade, has the diversity in demographics of accepted student groups increased or decreased? 7.… 17
Recap: Data Mining and Graph Mining 18 Data Application Questions Data Objects + Features Mathematical Data Representation (Data Model) Vectors Matrices Graphs Time series Tensors Sets Manifolds Not one hat fits all More than one models are needed Models are related
19 How much data? Astrophysics Cosmology Climate Biology Ecology Web 30TB/day 20-40TB/simulation 1PB/year 850TB 1 TB (TeraByte) – 10 12 Bytes 1 PB (PetaByte) – 10 15 Bytes My laptop: 60 GB (GigaBytes) – 10 9 Bytes
20 It is not just the Size Petabytes Data Noisy Non-linear correlations ‘+’ and ‘―’ feedbacks High-dimensional – but the Complexity
21 Data Describes Complex Patterns/Phenomena How to untangle the riddles of the complexity? Complex regulation Single gene ~30k genes 50 trans elements control single gene expression Challenge: How to “connect the dots” to answer important science/business questions? Analytical tools that find the “dots” from data significantly reduce data.
22 Connecting the Dots Sheer Volume of Data Climate Now: 20-40 Terabytes/year 5 years: 5-10 Petabytes/year Fusion Now: 100 Megabytes/15 min 5 years: 1000 Megabytes/2 min Advanced Math+Algorithms Huge dimensional space Combinatorial challenge Complicated by noisy data Requires high-performance computers Providing Predictive Understanding Produce bioenergy Stabilize CO 2 Clean toxic waste Understanding the DotsFinding the DotsConnecting the Dots
23 Why Would Data Mining Matter? Enables solving many large-scale data problems Finding the Dots Connecting the Dots Understanding the Dots How to effectively produce bioenergy? How to effectively produce bioenergy? How to stabilize carbon dioxide? How to stabilize carbon dioxide? How to convert toxic into non-toxic waste? How to convert toxic into non-toxic waste?... Science Questions
24 kB/s GB/$M MIPS/$M CPU, Disk, Network Trend CPU: every 1.2 years Disk: every 1.4 years WAN: 0.7 years Doubling: Src: Richard Mount, SLAC How to Move and Access the Data? Technology trends are a rate limiting factor Most of these data will NEVER be touched! Latency and Speed – Storage Performance 10 5 Retrieval Rate Mbytes/s log10(Object Size Bytes) Memory Disk Tape J. W. Toigo, Avoiding a Data Crunch, Scientific American, May 2000 Naturally distributed but effectively immovable Streaming/Dynamic but not re-computable Data doubles every 9 months; CPU ―18 months.
25 How to Make Sense of Data? Know Your Limits & Be Smart To see 1 percent of a petabyte at 10 megabytes per second takes: Terabytes Petabytes Gigabytes Megabytes Scalability of analysis in full context More analysis More data Human Bandwidth Overload? Ultrascale Computations: Must be smart about which probe combinations to see! Physical Experiments: Must be smart about probe placement! Not humanly possible to browse a petabyte of data. Analysis must reduce data to quantities of interest. 35 8-hour days!
26 What Analysis Algorithms to Use? Even a simple big O analysis can elucidate simplicity. Algorithmic Complexity: Calculate meansO(n) Calculate FFT O(n log(n)) Calculate SVDO(r c) Clustering algorithmsO(n 2 ) For illustration chart assumes 10 -12 sec. (1Tflop/sec) calculation time per data point 3 yrs. 0.1 sec. 10 -2 sec. 10GB 3 hrs10 -3 sec. 10 -4 sec.100MB 1 sec.10 -5 sec. 10 -6 sec.1MB 10 -4 sec. 10 -8 sec. 10KB 10 -8 sec. 10 -10 sec. 100B n2n2 nlog (n)n Algorithm Complexity Data size n Analysis algorithms fail for a few gigabytes. If n=10GB, then what is O(n) or O(n 2 ) on a teraflop computers? 1GB = 10 9 bytes 1Tflop = 10 12 op/sec