Download presentation
Presentation is loading. Please wait.
Published byEthelbert Foster Modified over 6 years ago
1
Data Mining Motivation: “Necessity is the Mother of Invention”
Automated data collection tools and mature database technology have led to tremendous amounts of stored data. We are drowning in data, but starving for knowledge! Solution: Data mining Extract interesting rules, patterns, constraints) (reduce volume, raise information/knowledge levels)
2
What Is Data Mining? Data mining: Alternative names:
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases Alternative names: Knowledge discovery in dbs (KDD), knowledge extraction, data/pattern analysis, data prospecting, data archeology, data dredging, information harvesting, business intelligence, etc. What is not data mining? (Deductive) query processing.
3
Applications Database analysis and decision support Other Applications
Market analysis and management target marketing, customer relation management, market basket analysis, market segmentation Risk analysis and management Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and management Other Applications Text mining (news group, , documents) and Web analysis. Intelligent query answering
4
More Applications Sports Astronomy Internet Web Surf-Aid
IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat Astronomy 22 quasars discovered with the help of data mining Internet Web Surf-Aid IBM Surf-Aid applies data mining algorithms to Web access logs to discover customer preference and behaviors, analyzing effectiveness of Web marketing, improving Web site organization, etc.
5
Data Mining: A KDD Process
Knowledge Data mining: the core of the knowledge discovery process. Pattern Evaluation Data Mining Classification Clustering ARM Task-relevant Data Data Warehouse Selection Data Cleaning/ Integration: missing data, outliers, noise, errors Feature extraction, attribute selection Databases
6
Association Rule Mining: The “Walmart” Example
Rule: {Diaper, Milk} => Beer (Diaper, Milk, Beer} Support = = 0.4 |D| Confidence = = 0.66 (Diaper, Milk} TID Items 1 Bread, Milk 2 Beer, Diaper, Bread, Eggs 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Bread, Diaper, Milk
7
Precision Ag example: Find image antecedents that imply high yield
TIFF image Yield Map High Green reflectance High Yield (obvious) High (NearInfraRed – Red) High Yield (higher confidence)
8
Grasshopper Infestation Prediction
Grasshopper caused significant economic loss last year. These insects are likely to visit again this year. Early prediction of the infestation is a key step to decrease damage. Association rule mining on remotely sensed imagery holds significant potential to achieve early detection. How do we signature initial infestation from RGB bands???
9
Gene Regulation Pathway Discovery example
Results of clustering may indicated that nine genes are involved in a pathway. High confident rule mining on that cluster will discover the relationships among the genes in which the expression of one gene (e.g., Gene2) is regulated by others. Other genes (e.g., Gene4 and Gene7) may not be directly involved in regulating Gene2 and can therefore be excluded. Gene1 Gene2, Gene3 Gene4, Gene 5, Gene6 Gene7, Gene8 Gene9 Clustering ARM Gene4 Gene7 Gene1 Gene3 Gene8 Gene6 Gene9 Gene5 Gene2
10
Data Mining: Confluence of Multiple Disciplines
Database Technology Statistics Data Mining Machine Learning Visualization Information Science Other Disciplines
11
Spatial Data Formats (Cont.)
BAND-1 ( ) ( ) ( ) ( ) BAND-2 ( ) ( ) ( ) ( ) BSQ format (2 files) Band 1: Band 2:
12
Spatial Data Formats (Cont.)
BAND-1 ( ) ( ) ( ) ( ) BAND-2 ( ) ( ) ( ) ( ) BSQ format (2 files) Band 1: Band 2: BIL format (1 file)
13
Spatial Data Formats (Cont.)
BAND-1 ( ) ( ) ( ) ( ) BAND-2 ( ) ( ) ( ) ( ) BSQ format (2 files) Band 1: Band 2: BIL format (1 file) BIP format (1 file)
14
Spatial Data Formats (Cont.)
BAND-1 ( ) ( ) ( ) ( ) BAND-2 ( ) ( ) ( ) ( ) BSQ format (2 files) Band 1: Band 2: BIL format (1 file) BIP format (1 file) bSQ format (16 files) B11 B12 B13 B14 B15 B16 B17 B18 B21 B22 B23 B24 B25 B26 B27 B28
15
Peano Count Tree (P-tree)
P-trees are a lossless representation of data in a compressed, recursive quadrant-orientation. NDSU holds patents on P-tree Technology
16
An example of Ptree Peano or Z-ordering quadrant Root Count 55 16 8 15
3 4 1 55 1 3 1 16 15 16 8 1 4 1 4 3 4 4 1 1 Peano or Z-ordering quadrant Root Count
17
An example of Ptree Level Pure (Pure-1/Pure-0) quadrant Fan-out
001 55 16 8 15 3 4 1 1 2 3 2 3 111 Level Fan-out QID (Quadrant ID) Pure (Pure-1/Pure-0) quadrant Root Count ( 7, 1 ) ( 111, 001 )
18
Tuple Count Cube (T-cube)
The (v1,v2,v3)th cell of the T-cube contains the Root Count of P(v1,v2,v3) = P1,v1 AND P2,v2 AND P3,v3
19
High confidence Association Rules
Assume minimum confidence threshold 80%, minimum support threshold 10% Start with 1-bit values and 2 bands, B1 and B2 30 34 sums 24 27.2 thresholds 5 19 25 15 1,0 1,1 2,0 2,1 32 40 C: B1={0} => B2={0} c = 83.3%
20
The End Thank you |:~)
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.