1 Berendt: Advanced databases, first semester 2011, 1 Advanced databases – Inferring new knowledge.

1 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 1 Advanced databases – Inferring new knowledge from data(bases): Knowledge Discovery in Databases Bettina Berendt Katholieke Universiteit Leuven, Department of Computer Science http://people.cs.kuleuven.be/~bettina.berendt/teaching Last update: 15 November 2011

2 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 2 Agenda Motivation II: Types of reasoning The process of knowledge discovery (KDD) A short overview of key KDD techniques Clustering: k-means Classification (classifier learning): ID3 Association-rule learning: apriori Motivation I: Application examples

3 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 3 Which cells are cancerous? Proof positive — The difference between a normal and cancerous liver cell is shown clearly by the location of mitochondria […]. The healthy cell shows very few mitochondria near the outer cell wall; they cluster densely (red coloration) as they approach the cell's nucleous (depicted here as the black central hole). In the cancerous cell, the mitochondria are spread throughout the cell, do not cluster, and under the same lighting produce a more subdued effect.

4 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 4 What is the impact of genetically modified organisms?

5 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 5 What‘s spam and what isn‘t?

6 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 6 What makes people happy?

7 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 7 What „circles“ of friends do you have?

8 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 8 What should we recommend to a customer/user?

9 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 9 What topics exist in a collection of texts, and how do they evolve? News texts, scientific publications, …

11 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 11 Type of inference used before in this course - example foaf:mbox domain Agent range Thing (well, in fact a Mailbox) n an inverse functional property  If mbox(MaryPoppins,weasel@somedomain.com) and mbox(PeterParker, weasel@somedomain.com), then MaryPoppins = PeterParker

12 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 12 Styles of reasoning: „All swans are white“ n Deductive: towards the consequences l All swans are white. l Tessa is a swan. l  Tessa is white. n Inductive: towards a generalisation of observations l Joe and Lisa and Tex and Wili and... (all observed swans) are swans. l Joe and Lisa and Tex and Wili and... (all observed swans) are white. l  All swans are white. n Abductive: towards the (most likely) explanation of an observation. l Tessa is white. l All swans are white. l  Tessa is a swan.

13 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 13 What about truth? n Deductive: l Given the truth of the assumptions, a valid deduction guarantees the truth of the conclusion n Inductive: l the premises of an argument (are believed to) support the conclusion but do not ensure it l has been attacked several times by logicians and philosophers n Abductive: l formally equivalent to the logical fallacy affirming the consequent

14 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 14 What about new knowledge? C.S. Peirce: n Introduced „abduction“ to modern logic n (after 1900): used „abduction“ to mean: creating new rules to explain new observations (this meaning is actually closest to induction) n >  essential for scientific discovery

16 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 16 „Data mining“ and „knowledge discovery“ n (informal definition): data mining is about discovering knowledge in (huge amounts of) data n Therefore, it is clearer to speak about “knowledge discovery in data(bases)” (KDD) Second reason for preferring the term “KDD”: n “data mining” is not uniquely defined: n Some people use it to denote certain types of knowledge discovery (e.g., finding association rules, but not classifier learning)

17 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 17 „Data mining“ is generally inductive n (informal definition): data mining is about discovering knowledge in (huge amounts of) data :... Looking at all the empirically observed swans...... Finding they are white... Concluding that swans are white

18 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 18 The KDD process The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996) non-trivial process Multiple process valid Justified patterns/models novel Previously unknown useful Can be used understandable by human and machine

19 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 19 The process part of knowledge discovery CRISP-DM CRoss Industry Standard Process for Data Mining a data mining process model that describes commonly used approaches that expert data miners use to tackle problems.

20 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 20 Knowledge discovery, machine learning, data mining n Knowledge discovery = the whole process n Machine learning the application of induction algorithms and other algorithms that can be said to „learn.“ = „modeling“ phase n Data mining l sometimes = KD, sometimes = ML

21 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 21 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization

23 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 23 Data General patterns Examples Cancerous Cell Data Classification “What factors determine cancerous cells?” Classification Algorithm Mining Algorithm - Rule Induction - Decision tree - Neural Network

24 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 24 If Color = light and Tails = 1 and Nuclei = 2 Then Healthy Cell (certainty = 92%) If Color = dark and Tails = 2 and Nuclei = 2 Then Cancerous Cell (certainty = 87%) Classification: Rule Induction “What factors determine whether a cell is cancerous?”

25 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 25 Color = darkColor = light healthy Classification: Decision Trees #nuclei=1#nuclei=2 #nuclei=1#nuclei=2 #tails=1#tails=2 cancerous healthy #tails=1#tails=2 cancerous

26 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 26 Healthy Cancerous “What factors determine whether a cell is cancerous?” Classification: Neural Networks Color = dark # nuclei = 1 … # tails = 2

27 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 27 “Are there clusters of similar cells?” Light color with 1 nucleus Dark color with 2 tails 2 nuclei 1 nucleus and 1 tail Dark color with 1 tail and 2 nuclei Clustering

28 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 28 Task: Discovering association rules among items in a transaction database. An association among two items A and B means that the presence of A in a record implies the presence of B in the same record: A => B. In general: A 1, A 2, … => B Association Rule Discovery Association Rule Discovery

29 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 29 “Are there any associations between the characteristics of the cells?” If color = light and # nuclei = 1 then # tails = 1 (support = 12.5%; confidence = 50%) If # nuclei = 2 and Cell = Cancerous then # tails = 2 (support = 25%; confidence = 100%) If # tails = 1 then Color = light (support = 37.5%; confidence = 75%) Association Rule Discovery Association Rule Discovery

30 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 30 Genetic Algorithms Statistics Bayesian Networks Rough Sets Time Series Many Other Data Mining Techniques Text Mining

32 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 32 The basic idea of clustering: group similar things Group 1 Group 2 Attribute 1 Attribute 2

33 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 33 Concepts in Clustering n Defining distance between points l Euclidean distance l any other distance (cityblock metric, Levenshtein, Jaccard sim....) n A good clustering is one where l (Intra-cluster distance) the sum of distances between objects in the same cluster are minimized, l (Inter-cluster distance) while the distances between different clusters are maximized l Objective to minimize: F(Intra,Inter) n Clusters can be evaluated with “internal” as well as “external” measures l Internal measures are related to the inter/intra cluster distance l External measures are related to how representative are the current clusters to “true” classes

34 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 34 K Means Example ( K=2) Pick seeds Reassign clusters Compute centroids x x Reasssign clusters x x x x Compute centroids Reassign clusters Converged! Based on http://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppthttp://rakaposhi.eas.asu.edu/cse494/notes/f02-clustering.ppt

35 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 35 K-means algorithm

37 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 37 Input data... Q: when does this person play tennis? NoTrueHighMildRainy YesFalseNormalHotOvercast YesTrueHighMildOvercast YesTrueNormalMildSunny YesFalseNormalMildRainy YesFalseNormalCoolSunny NoFalseHighMildSunny YesTrueNormalCoolOvercast NoTrueNormalCoolRainy YesFalseNormalCoolRainy YesFalseHighMildRainy YesFalseHighHotOvercast NoTrueHighHotSunny NoFalseHighHotSunny PlayWindyHumidityTempOutlook

38 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 38 Terminology (using a popular data example) NoTrueHighMildRainy YesFalseNormalHotOvercast YesTrueHighMildOvercast YesTrueNormalMildSunny YesFalseNormalMildRainy YesFalseNormalCoolSunny NoFalseHighMildSunny YesTrueNormalCoolOvercast NoTrueNormalCoolRainy YesFalseNormalCoolRainy YesFalseHighMildRainy YesFalseHighHotOvercast NoTrueHighHotSunny NoFalseHighHotSunny PlayWindyHumidityTempOutlook Rows: Instances (think of them as objects) Days, described by: Columns: Features Outlook, Temp, … In this case, there is a feature with a special role: The class Play (does X play tennis on this day?) This is “relational DB mining“. We will later see other types of data and the mining applied to them.

39 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 39 The goal: a decision tree for classification / prediction In which weather will someone play (tennis etc.)?

40 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 40 Constructing decision trees Strategy: top down Recursive divide-and-conquer fashion  First: select attribute for root node Create branch for each possible attribute value  Then: split instances into subsets One for each branch extending from the node  Finally: repeat recursively for each branch, using only instances that reach the branch Stop if all instances have the same class

41 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 41 Which attribute to select?

42 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 42 Which attribute to select?

43 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 43 Criterion for attribute selection Which is the best attribute?  Want to get the smallest tree  Heuristic: choose the attribute that produces the “purest” nodes Popular impurity criterion: information gain  Information gain increases with the average purity of the subsets Strategy: choose attribute that gives greatest information gain

44 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 44 Computing information Measure information in bits  Given a probability distribution, the info required to predict an event is the distribution’s entropy  Entropy gives the information required in bits (can involve fractions of bits!)‏ Formula for computing the entropy:

45 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 45 Example: attribute Outlook

46 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 46 Computing information gain Information gain: information before splitting – information after splitting Information gain for attributes from weather data: gain(Outlook ) = 0.247 bits gain(Temperature ) = 0.029 bits gain(Humidity ) = 0.152 bits gain(Windy ) = 0.048 bits gain(Outlook )= info([9,5]) – info([2,3],[4,0],[3,2])‏ = 0.940 – 0.693 = 0.247 bits

47 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 47 Continuing to split gain(Temperature )= 0.571 bits gain(Humidity ) = 0.971 bits gain(Windy )= 0.020 bits

48 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 48 Final decision tree Note: not all leaves need to be pure; sometimes identical instances have different classes  Splitting stops when data can’t be split any further

49 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 49 Wishlist for a purity measure Properties we require from a purity measure:  When node is pure, measure should be zero  When impurity is maximal (i.e. all classes equally likely), measure should be maximal  Measure should obey multistage property (i.e. decisions can be made in several stages): Entropy is the only function that satisfies all three properties!

50 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 50 Properties of the entropy The multistage property: Simplification of computation: Note: instead of maximizing info gain we could just minimize information

51 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 51 Discussion / outlook decision trees Top-down induction of decision trees: ID3, algorithm developed by Ross Quinlan  Various improvements, e.g.  C4.5: deals with numeric attributes, missing values, noisy data  Gain ratio instead of information gain [see Witten & Frank slides, ch. 4, pp. 40-45] Similar approach: CART …

53 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 53 Motivation for association-rule learning/mining: store layout (Amazon, earlier: Wal-Mart,...) Where to put: spaghetti, butter?

54 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 54 Data "Market basket data": attributes with boolean domains In a table  each row is a basket (aka transaction) Transaction IDAttributes (basket items) 1Spaghetti, tomato sauce 2Spaghetti, bread 3Spaghetti, tomato sauce, bread 4bread, butter 5bread, tomato sauce

55 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 55 Solution approach: The apriori principle and the pruning of the search tree (1) spaghettiTomato saucebreadbutter Spaghetti, tomato sauce Spaghetti, bread Spaghetti, butter Tomato s., bread Tomato s., butter Bread, butter Spagetthi, Tomato sauce, Bread, butter Spagetthi, Tomato sauce, Bread Spagetthi, Tomato sauce, butter Spagetthi, Bread, butter Tomato sauce, Bread, butter 

56 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 56 spaghettiTomato saucebreadbutter Spaghetti, tomato sauce Spaghetti, bread Spaghetti, butter Tomato s., bread Tomato s., butter Bread, butter Spagetthi, Tomato sauce, Bread, butter Spagetthi, Tomato sauce, Bread Spagetthi, Tomato sauce, butter Spagetthi, Bread, butter Tomato sauce, Bread, butter  Solution approach: The apriori principle and the pruning of the search tree (2)

59 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 59 More formally: Generating large k-itemsets with Apriori Min. support = 40% step 1: candidate 1-itemsets n Spaghetti: support = 3 (60%) n tomato sauce: support = 3 (60%) n bread: support = 4 (80%) n butter: support = 1 (20%) Transaction IDAttributes (basket items) 1Spaghetti, tomato sauce 2Spaghetti, bread 3Spaghetti, tomato sauce, bread 4bread, butter 5bread, tomato sauce

60 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 60 Contd. step 2: large 1-itemsets n Spaghetti n tomato sauce n bread candidate 2-itemsets n {Spaghetti, tomato sauce}: support = 2 (40%) n {Spaghetti, bread}: support = 2 (40%) n {tomato sauce, bread}: support = 2 (40%) Transaction IDAttributes (basket items) 1Spaghetti, tomato sauce 2Spaghetti, bread 3Spaghetti, tomato sauce, bread 4bread, butter 5bread, tomato sauce

61 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 61 step 3: large 2-itemsets n {Spaghetti, tomato sauce} n {Spaghetti, bread} n {tomato sauce, bread} candidate 3-itemsets n {Spaghetti, tomato sauce, bread}: support = 1 (20%) step 4: large 3-itemsets n { } Transaction IDAttributes (basket items) 1Spaghetti, tomato sauce 2Spaghetti, bread 3Spaghetti, tomato sauce, bread 4bread, butter 5bread, tomato sauce Contd.

62 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 62 From itemsets to association rules Schema: If subset then large k-itemset with support s and confidence c n s = (support of large k-itemset) / # tuples n c = (support of large k-itemset) / (support of subset) Example: If {spaghetti} then {spaghetti, tomato sauce} n Support: s = 2 / 5 (40%) n Confidence: c = 2 / 3 (66%)

63 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 63 Outlook Motivation II: Types of reasoning The process of knowledge discovery (KDD) A short overview of key KDD techniques Clustering: k-means Classification (classifier learning): ID3 Association-rule learning: apriori Motivation I: Application examples Text mining

64 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 64 References / background reading n Knowledge discovery is now an established area with some excellent general textbooks. I recommend the following as examples of the 3 main perspectives: l a databases / data warehouses perspective: Han, J. & Kamber, M. (2001). Data Mining: Concepts and Techniques. San Francisco,CA: Morgan Kaufmann. http://www.cs.sfu.ca/%7Ehan/dmbook http://www.cs.sfu.ca/%7Ehan/dmbook l a machine learning perspective: Witten, I.H., & Frank, E. (2005). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. 2nd ed. Morgan Kaufmann. http://www.cs.waikato.ac.nz/%7Eml/weka/book.html http://www.cs.waikato.ac.nz/%7Eml/weka/book.html l a statistics perspective: Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles of Data Mining. Cambridge, MA: MIT Press. http://mitpress.mit.edu/catalog/item/default.asp?tid=3520&ttype =2 http://mitpress.mit.edu/catalog/item/default.asp?tid=3520&ttype =2 n The CRISP-DM manual can be found at http://www.spss.ch/upload/1107356429_CrispDM1.0.pdf http://www.spss.ch/upload/1107356429_CrispDM1.0.pdf

65 Berendt: Advanced databases, first semester 2011, http://people.cs.kuleuven.be/~bettina.berendt/teaching 65 Acknowledgements n The overview of data mining was taken from (with minor modifications): l Tzacheva, A.A. (2006). SIMS 422. Knowledge Inference Systems & Applications. http://faculty.uscupstate.edu/atzacheva/SIMS422/OverviewI.ppt http://faculty.uscupstate.edu/atzacheva/SIMS422/OverviewI.ppt l Tzacheva, A.A. (2006). Knowledge Discovery and Data Mining. http://faculty.uscupstate.edu/atzacheva/SIMS422/OverviewII.ppt http://faculty.uscupstate.edu/atzacheva/SIMS422/OverviewII.ppt n p. 21 was taken from l Han, J. & Kamber, M. (2006). Data Mining: Concepts and Techniques — Chapter 1 — Introduction. http://www.cs.sfu.ca/%7Ehan/bk/1intro.ppthttp://www.cs.sfu.ca/%7Ehan/bk/1intro.ppt n The ID3 part is based on l Witten, I.H., & Frank, E.(2005). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. 2nd ed. Morgan Kaufmann. http://www.cs.waikato.ac.nz/%7Eml/weka/book.htmlhttp://www.cs.waikato.ac.nz/%7Eml/weka/book.html l In particular, the instructor slides for that book available at http://books.elsevier.com/companions/9780120884070/ (chapters 1-4): http://books.elsevier.com/companions/9780120884070/ http://books.elsevier.com/companions/9780120884070/revisionnotes/01~PDFs/cha pter1.pdfhttp://books.elsevier.com/companions/9780120884070/revisionnotes/01~PDFs/cha pter1.pdf (and...chapter2.pdf, chapter3.pdf, chapter4.pdf) or http://books.elsevier.com/companions/9780120884070/revisionnotes/02~ODP%20 Files/chapter1.odphttp://books.elsevier.com/companions/9780120884070/revisionnotes/02~ODP%20 Files/chapter1.odp (and...chapter2.odp, chapter3.odp, chapter4.odp)

1 Berendt: Advanced databases, first semester 2011, 1 Advanced databases – Inferring new knowledge.

Similar presentations

Presentation on theme: "1 Berendt: Advanced databases, first semester 2011, 1 Advanced databases – Inferring new knowledge."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Berendt: Advanced databases, first semester 2011, 1 Advanced databases – Inferring new knowledge.

Similar presentations

Presentation on theme: "1 Berendt: Advanced databases, first semester 2011, 1 Advanced databases – Inferring new knowledge."— Presentation transcript:

Similar presentations

About project

Feedback