Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining. Data Mining (DM)/ Knowledge Discovery in Databases (KDD) “The nontrivial extraction of implicit, previously unknown, and potentially useful.

Similar presentations


Presentation on theme: "Data Mining. Data Mining (DM)/ Knowledge Discovery in Databases (KDD) “The nontrivial extraction of implicit, previously unknown, and potentially useful."— Presentation transcript:

1 Data Mining

2 Data Mining (DM)/ Knowledge Discovery in Databases (KDD) “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data” [Frawley et al, 1992]

3 Need for Data Mining u Increased ability to generate data u Remote sensors and satellites u Bar codes for commercial products u Computerization of businesses

4 Need for Data Mining u Increased ability to store data u Media: bigger magnetic disks, CD-ROMs u Better database management systems u Data warehousing technology

5 Need for Data Mining u Examples u Wal-Mart records 20,000,000 transactions/day u Healthcare transactions yield multi-GB databases u Mobil Oil exploration storing 100 terabytes u Human Genome Project, multi-GBs and increasing u Astronomical object catalogs, terabytes of images u NASA EOS, 1 terabyte/day

6 Something for Everyone u Bell Atlantic u MCI u Land’s End u Visa u Bank of New York u FedEx

7 Market Analysis and Management u Customer profiling u Data mining can tell you what types of customers buy what products (clustering or classification) or what products are often bought together (association rules). u Identifying customer requirements u Discover relationship between personal characteristics and probability of purchase u Discover correlations between purchases

8 Fraud Detection and Management u Applications: u Widely used in health care, retail, credit card services, telecommunications, etc. u Approach: u Use historical data to build models of fraudulent behavior and use data mining to help identify similar instances. u Examples: u Auto Insurance u Money Laundering u Medical Insurance

9 IBM Advertisement

10 DM step in KDD Process

11 Database AI Statistics Data Mining Hardware

12 Mining Association Rules

13 u Assocation rule mining: u Finding associations or correlations among a set of items or objects in transaction databases, relational databases, and data warehouses. u Applications: u Basket data analysis, cross-marketing, catalog design, loss- leader analysis, clustering, etc. u Examples:  Rule form: “Body  ead [support, confidence]”.  Buys=Diapers  Buys=Beer [0.5%, 60%]  Major=CS ^ Class=DataMining  Grade=A [1%, 75%]

14 Rule Measures: Support and Confidence u Find all the rules X & Y  Z with minimum confidence and support u support, s, probability that a transaction contains {X, Y, Z} u confidence, c, conditional probability that a transaction having {X, Y} also contains Z. For minimum support 50%, minimum confidence 50%: A  C (50%, 66.6%) C  A (50%, 100%) Customer buys diaper Customer buys both Customer buys beer

15 Association Rule u Given u Set of items I = {i 1, i 2,.., i m } u Set of transactions D u Each transaction T in D is a set of items u An association rule is an implication u X and Y are itemsets, u Rule meets minimum confidence c (c% of transactions in D which contain X contain Y) u A minimum support s is also met XYXc/  DYXs/ 

16 Mining Strong Association Rules in Transaction DBs u Measurement of rule strength in a transaction DB. A  B [support, confidence] support = Prob(  B) = confidence = Prob(B|A) = u We are often interested in only strong associations, i.e. support  min_sup and confidence  min_conf. u Examples. milk  bread [5%, 60%]. tire  auto_accessories  auto_services [2%, 80%]. #_of_trans_containing_all_the_items_in A  B total_#_of_trans #_of_trans_that_contain_both A and B #_of_trans_containing A

17 Methods for Mining Associations u Apriori u Partition Technique: u Sampling technique u Anti-Skew u Multi-level or generalized association u Constraint-based or query-based association

18 Apriori (Levelwise) u Scan database multiple times u For ith scan, find all large itemsets of size i with min support u Use the large itemsets from scan i as input to scan i+1 u Create candidates, subsets of size i+1 which contain only large itemsets as subsets u Notation: Large k-itemset, L k Set of candidate large itemsets of size k, C k u Note: If {A,B} is not a large itemset, then no superset of it can be either.

19 Mining Association Rules -- Example For rule A  C: support = support({A, C}) = 50% confidence = support({A, C})/support({A}) = 66.6% Apriori principle: Any subset of a frequent itemset must be frequent. Min. support 50% Min. confidence 50%

20 L 1 = {(A, 3), (B, 2), (C, 2), (D, 1), (E, 1), (F, 1)} Minsup = 0.25, Minconf = 0.5 C 2 = {(A,B), (A,C), (A,D), (A,E), (A,F), (B,C),.., (E,F)} L 2 = {(A,B, 1), (A,C, 2), (A,D, 1), (B,C, 1), (B,E, 1), (B,F, 1), (E,F, 1)} C 3 = {(A,B,C), (A,B,D), (A,C,D), (B,C,E), (B,C,F), (B,E,F)} L 3 = {(A,B,C, 1), (B,E,F, 1)} C 4 = {}, L4 = {}, End of program Possible Rules A=>B (c=.33,s=1), B=>A (c=.5,s=1), A=>C (c=.67,s=2), C=>A (c=1.0,s=2) A=>D (c=.33,s=1), D=>A (c=1.0,s=1), B=>C (c=.5,s=1), C=>B (.5,s=1), B=>E (c=.5,s=1), E=>B(c=1,s=1), B=>F (c=.5,s=1), F=>B(c=1,s=1) A=>B&C (c=.33,s=1), B=>A&C (c=.5,s=1), C=>A&B (c=.5,s=1), A&B=>C(c=1,s=1), A&C=>B (c=.5,s=1), B&C=>A (c=1,s=1), B=>E&F (c=.5,s=1), E=>B&F(c=1,s=1), F=>B&E (c=1,s=1), B&E=>F (c=1,s=1), B&F=>E(c=1,s=1), E&F=>B (c=1,s=1)

21 Example

22 Partitioning u Requires only two passes through external database u Divide database into n partitions, each fits in main memory u Scan 1: Process one partition in memory at a time, finding local large itemsets u Candidate large itemsets are the union of all local large itemsets (superset of actual large itemsets, contains false +) u Scan 2: Calculate support, determine actual large itemsets u If data is skewed, partitioning may not work well. The chance that a local large itemset is a global large itemset may be small.

23 Partitioning u Will any large itemsets be missed? u If l  Li, then t1(l)/t1 < MS & t2(l)/t2 < MS & … & tn(l)/tn < MS thus t1(l) + t2(l) + … + tn(l) < MS * (t1 + t2 + … + tn)

24 How do run times compare?

25 PlayTennis Training Examples DayOutlookTemperatureHumidityWindPlayTennis D1SunnyHotHighWeakNo D2SunnyHotHighStrongNo D3OvercastHotHighWeakYes D4RainMildHighWeakYes D5RainCoolNormalWeakYes D6RainCoolNormalStrongNo D7OvercastCoolNormalStrongYes D8SunnyMildHighWeakNo D9SunnyCoolNormalWeakYes D10RainMildNormalWeakYes D11SunnyMildNormalStrongYes D12OvercastMildHighStrongYes D13OvercastHotNormalWeakYes D14RainMildHighStrongNo

26 Association Rule Visualization: DBMiner

27 DBMiner

28 Association Rule Graph

29 Clementine (UK, bought by SPSS) The Web Node shows the strength of associations in the data - i.e. how often field values coincide

30 Multi-Level Association u A descendant of an infrequent itemset cannot be frequent u A transaction database can be encoded by dimensions and levels Food bread milk skim SunsetFraser 2%white wheat

31 Encoding Hierarchical Information in Transaction Database l A taxonomy for the relevant data items l Conversion of bar_code into generalized_item_id. food milk 2%chocolate...... DairylandForemost bread old MillsWonder...

32 Mining Surprising Temporal Patterns u Find prevalent rules that hold over large fractions of data u Useful for promotions and store arrangement u Intensively researched 1990 Milk and cereal sell together! Chakrabarti et al

33 Prevalent != Interesting u Analysts already know about prevalent rules u Interesting rules are those that deviate from prior expectation u Mining’s payoff is in finding surprising phenomena 1995 1998 Milk and cereal sell together! Zzzz... Milk and cereal sell together!

34 Association Rules - Strengths & Weaknesses u Strengths u Understandable and easy to use u Useful u Weaknesses u Brute force methods can be expensive (memory and time) u Apriori is O(CD), where C = sum of sizes of candidates (2 n possible, n = #items) D = size of database u Association does not necessarily imply correlation u Validation? u Maintenance?

35 Clustering

36 Clustering u Group similar items together u Example: sorting laundry u Similar items may have important attributes / functionality in common u Group customers together with similar interests and spending patterns u Form of unsupervised learning u Cluster objects into classes using rule: u Maximize intraclass similarity, minimize interclass similarity

37 Clustering Techniques u Partition u Enumerate all partitions u Score by some criteria u K means u Hierarchical u Model based u Hypothesize model for each cluster u Find model that best fits data u AutoClass, Cobweb

38 Clustering Goal u Suppose you transmit coordinates of points drawn randomly from this dataset u Only allowed 2 bits/point u What encoder/decoder will lose least information?

39 Idea One u Break into grid u Decode each bit- pair as middle of each grid cell 0001 1110

40 Idea Two u Break into grid u Decode each bit- pair as centroid of all data in the grid cell 00 01 11 10

41 K Means Clustering 1. Ask user how many clusters (e.g., k=5)

42 K Means Clustering 1. Ask user how many clusters (e.g., k=5) 2. Randomly guess k cluster center locations

43 K Means Clustering 1. Ask user how many clusters (e.g., k=5) 2. Randomly guess k cluster center locations 3. Each data point finds closest center

44 K Means Clustering 1. Ask user how many clusters (e.g., k=5) 2. Randomly guess k cluster center locations 3. Each data point finds closest center 4. Each cluster finds new centroid of its points

45 K Means Clustering 1. Ask user how many clusters (e.g., k=5) 2. Randomly guess k cluster center locations 3. Each data point finds closest center 4. Each cluster finds new centroid of its points 5. Repeat until…

46 K Means Issues u Computationally efficient u Initialization u Termination condition u Distance measure u What should k be?

47 Hierarchical Clustering 1. Each point is its own cluster

48 Hierarchical Clustering 1. Each point is its own cluster 2. Find most similar pair of clusters

49 Hierarchical Clustering 1. Each point is its own cluster 2. Find most similar pair of clusters 3. Merge it into a parent cluster

50 Hierarchical Clustering 1. Each point is its own cluster 2. Find most similar pair of clusters 3. Merge it into a parent cluster 4. Repeat

51 Hierarchical Clustering Issues u This was bottom-up clustering (agglomerative clustering) u Can also perform top-down clustering (divisive clustering) u Define similarity between clusters u What is stopping criteria?

52 Cluster Visualization: IBM u One row per cluster (# is % size) u Charts show fields, ordered by influence u Pie: outer ring dist for whole, inner ring for cluster u Bar: solid bar for cluster, transparent bar for whole

53 Structure-based Brushing u User selects region of interest with magenta triangle u User selects level of detail with colored polyline

54 Spatial Visualization: GeoMiner

55 Spatial Associations FIND SPATIAL ASSOCIATION RULE DESCRIBING "Golf Course" FROM Washington_Golf_courses, Washington WHERE CLOSE_TO(Washington_Golf_courses.Obj, Washington.Obj, "3 km") AND Washington.CFCC <> "D81" IN RELEVANCE TO Washington_Golf_courses.Obj, Washington.Obj, CFCC SET SUPPORT THRESHOLD 0.5

56 Spatial Clustering u How can we cluster points? u What are the distinct features of the clusters? There are more customers with university degrees in clusters located in the West. Thus, we can use different marketing strategies!

57 Partek

58

59

60 Conclusions


Download ppt "Data Mining. Data Mining (DM)/ Knowledge Discovery in Databases (KDD) “The nontrivial extraction of implicit, previously unknown, and potentially useful."

Similar presentations


Ads by Google