Presentation is loading. Please wait.

Presentation is loading. Please wait.

Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new.

Similar presentations


Presentation on theme: "Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new."— Presentation transcript:

1 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 1 Advanced databases – Inferring implicit/new knowledge from data(bases): Single-table and multirelational data mining Bettina Berendt Katholieke Universiteit Leuven, Department of Computer Science http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ Last update: 26 November 2007

2 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 2 Agenda Input: Concepts Input: Instances Input: Attributes and levels of measurement Data preparation for WEKA and beyond Output: Decision trees (and related patterns) Algorithm: ID3 (and variants) Multirelational data mining Motivation / Focus on decision tree learning

3 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 3 Recall: Knowledge discovery and styles of reasoning 1. Business understanding  Evaluation n Learn a model from the data (observed instances) n Generally involves induction (during Modelling) 2. Deployment n Apply the model to new instances n Corresponds to deduction l (if one assumes that the model is true)

4 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 4 Phases talked about today 1. Business understanding  Evaluation n Learn a model from the data (observed instances) n Generally involves induction (during Modelling)

5 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 5 Decision tree learning (1): Decision rules What contact lenses to give to a patient? (Could be based on background knowledge, but can also be learned from the WEKA contact lens data)

6 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 6 Decision tree learning (2): Classification / prediction In which weather will someone play (tennis etc.)? (Learned from the WEKA weather data)

7 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 7 Learned from data like this: NoTrueHighMildRainy YesFalseNormalHotOvercast YesTrueHighMildOvercast YesTrueNormalMildSunny YesFalseNormalMildRainy YesFalseNormalCoolSunny NoFalseHighMildSunny YesTrueNormalCoolOvercast NoTrueNormalCoolRainy YesFalseNormalCoolRainy YesFalseHighMildRainy YesFalseHighHotOvercast NoTrueHighHotSunny NoFalseHighHotSunny PlayWindyHumidityTempOutlook

8 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 8 Agenda Input: Concepts Input: Instances Input: Attributes and levels of measurement Data preparation for WEKA and beyond Output: Decision trees (and related patterns) Algorithm: ID3 (and variants) Multirelational data mining Motivation / Focus on decision tree learning

9 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 9 What’s a concept? Styles of learning:  Classification learning: predicting a discrete class  Association learning: detecting associations between features  Clustering: grouping similar instances into clusters  Numeric prediction: predicting a numeric quantity Concept: thing to be learned Concept description: output of learning scheme

10 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 10 Classification learning Example problems: weather data, contact lenses, irises, labor negotiations Classification learning is supervised  Scheme is provided with actual outcome Outcome is called the class of the example Measure success on fresh data for which class labels are known (test data)‏ In practice success is often measured subjectively

11 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 11 Example weather data NoTrueHighMildRainy YesFalseNormalHotOvercast YesTrueHighMildOvercast YesTrueNormalMildSunny YesFalseNormalMildRainy YesFalseNormalCoolSunny NoFalseHighMildSunny YesTrueNormalCoolOvercast NoTrueNormalCoolRainy YesFalseNormalCoolRainy YesFalseHighMildRainy YesFalseHighHotOvercast NoTrueHighHotSunny NoFalseHighHotSunny PlayWindyHumidityTempOutlook

12 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 12 Association learning Can be applied if no class is specified and any kind of structure is considered “interesting” Difference to classification learning:  Can predict any attribute’s value, not just the class, and more than one attribute’s value at a time  Hence: far more association rules than classification rules  Thus: constraints are necessary Minimum coverage and minimum accuracy

13 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 13 Clustering Finding groups of items that are similar Clustering is unsupervised  The class of an example is not known Success often measured subjectively … … … Iris virginica 1.95.12.75.8102 101 52 51 2 1 Iris virginica 2.56.03.36.3 Iris versicolor 1.54.53.26.4 Iris versicolor 1.44.73.27.0 Iris setosa 0.21.43.04.9 Iris setosa 0.21.43.55.1 TypePetal widthPetal lengthSepal widthSepal length

14 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 14 Numeric prediction Variant of classification learning where “class” is numeric (also called “regression”)‏ Learning is supervised  Scheme is being provided with target value Measure success on test data …………… 40FalseNormalMildRainy 55FalseHighHotOvercast 0TrueHighHotSunny 5FalseHighHotSunny Play-timeWindyHumidityTemperatureOutlook

15 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 15 Agenda Input: Concepts Input: Instances Input: Attributes and levels of measurement Data preparation for WEKA and beyond Output: Decision trees (and related patterns) Algorithm: ID3 (and variants) Multirelational data mining Motivation / Focus on decision tree learning

16 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 16 What’s in an example? Instance: specific type of example Thing to be classified, associated, or clustered Individual, independent example of target concept Characterized by a predetermined set of attributes Input to learning scheme: set of instances/dataset Represented as a single relation/flat file Rather restricted form of input No relationships between objects Most common form in practical data mining

17 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 17 Instances in the weather data NoTrueHighMildRainy YesFalseNormalHotOvercast YesTrueHighMildOvercast YesTrueNormalMildSunny YesFalseNormalMildRainy YesFalseNormalCoolSunny NoFalseHighMildSunny YesTrueNormalCoolOvercast NoTrueNormalCoolRainy YesFalseNormalCoolRainy YesFalseHighMildRainy YesFalseHighHotOvercast NoTrueHighHotSunny NoFalseHighHotSunny PlayWindyHumidityTempOutlook

18 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 18 Agenda Input: Concepts Input: Instances Input: Attributes and levels of measurement Data preparation for WEKA and beyond Output: Decision trees (and related patterns) Algorithm: ID3 (and variants) Multirelational data mining Motivation / Focus on decision tree learning

19 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 19 What’s in an attribute? Each instance is described by a fixed predefined set of features, its “attributes” But: number of attributes may vary in practice  Possible solution: “irrelevant value” flag Related problem: existence of an attribute may depend of value of another one Possible attribute types (“levels of measurement”):  Nominal, ordinal, interval and ratio

20 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 20 Nominal quantities Values are distinct symbols  Values themselves serve only as labels or names  Nominal comes from the Latin word for name Example: attribute “outlook” from weather data  Values: “sunny”,”overcast”, and “rainy” No relation is implied among nominal values (no ordering or distance measure)‏ Only equality tests can be performed

21 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 21 Ordinal quantities Impose order on values But: no distance between values defined Example: attribute “temperature” in weather data  Values: “hot” > “mild” > “cool” Note: addition and subtraction don’t make sense Example rule: temperature < hot  play = yes Distinction between nominal and ordinal not always clear (e.g. attribute “outlook”)‏

22 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 22 Interval quantities Interval quantities are not only ordered but measured in fixed and equal units Example 1: attribute “temperature” expressed in degrees Fahrenheit Example 2: attribute “year” Difference of two values makes sense Sum or product doesn’t make sense  Zero point is not defined!

23 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 23 Ratio quantities Ratio quantities are ones for which the measurement scheme defines a zero point Example: attribute “distance”  Distance between an object and itself is zero Ratio quantities are treated as real numbers  All mathematical operations are allowed But: is there an “inherently” defined zero point?  Answer depends on scientific knowledge (e.g. Fahrenheit knew no lower limit to temperature)‏

24 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 24 Agenda Input: Concepts Input: Instances Input: Attributes and levels of measurement Data preparation for WEKA and beyond Output: Decision trees (and related patterns) Algorithm: ID3 (and variants) Multirelational data mining Motivation / Focus on decision tree learning

25 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 25 Notes on Preparing the input (CRISP-DM Data preparation stage) Denormalization is not the only issue Problem: different data sources (e.g. sales department, customer billing department, …)‏  Differences: styles of record keeping, conventions, time periods, data aggregation, primary keys, errors  Data must be assembled, integrated, cleaned up  “Data warehouse”: consistent point of access External data may be required (“overlay data”)‏ Critical: type and level of data aggregation

26 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 26 The ARFF format % % ARFF file for weather data with some numeric features % @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no} @data sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes...

27 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 27 Additional attribute types ARFF supports string attributes: n Similar to nominal attributes but list of values is not pre- specified It also supports date attributes: n Uses the ISO-8601 combined date and time format yyyy- MM-dd-THH:mm:ss @attribute description string @attribute today date

28 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 28 Sparse data In some applications most attribute values in a dataset are zero n E.g.: word counts in a text categorization problem ARFF supports sparse data This also works for nominal attributes (where the first value corresponds to “zero”)‏ 0, 26, 0, 0, 0,0, 63, 0, 0, 0, “class A” 0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B” {1 26, 6 63, 10 “class A”} {3 42, 10 “class B”}

29 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 29 Attribute types Interpretation of attribute types in ARFF depends on learning scheme  Numeric attributes are interpreted as ordinal scales if less-than and greater-than are used ratio scales if distance calculations are performed (normalization/standardization may be required)‏  Instance-based schemes define distance between nominal values (0 if values are equal, 1 otherwise)‏ Integers in some given data file: nominal, ordinal, or ratio scale?

30 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 30 Nominal vs. ordinal Attribute “age” nominal Attribute “age” ordinal (e.g. “young” < “pre-presbyopic” < “presbyopic”)‏ If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft If age  pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft

31 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 31 Missing values Frequently indicated by out-of-range entries  Types: unknown, unrecorded, irrelevant  Reasons: malfunctioning equipment changes in experimental design collation of different datasets measurement not possible Missing value may have significance in itself (e.g. missing test in a medical examination)‏  Most schemes assume that is not the case: “missing” may need to be coded as additional value  “?” in ARFF denotes a missing value

32 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 32 Inaccurate values Reason: data has not been collected for mining it Result: errors and omissions that don’t affect original purpose of data (e.g. age of customer)‏ Typographical errors in nominal attributes  values need to be checked for consistency Typographical and measurement errors in numeric attributes  outliers need to be identified Errors may be deliberate (e.g. wrong zip codes)‏ Other problems: duplicates, stale data

33 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 33 Getting to know the data Simple visualization tools are very useful  Nominal attributes: histograms (Distribution consistent with background knowledge?)‏  Numeric attributes: graphs (Any obvious outliers?)‏ 2-D and 3-D plots show dependencies Need to consult domain experts Too much data to inspect? Take a sample!

34 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 34 Agenda Input: Concepts Input: Instances Input: Attributes and levels of measurement Data preparation for WEKA and beyond Output: Decision trees (and related patterns) Algorithm: ID3 (and variants) Multirelational data mining Motivation / Focus on decision tree learning

35 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 35 Decision trees “Divide-and-conquer” approach produces tree Nodes involve testing a particular attribute Usually, attribute value is compared to constant Other possibilities: Comparing values of two attributes Using a function of one or more attributes Leaves assign classification, set of classifications, or probability distribution to instances Unknown instance is routed down the tree

36 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 36 Nominal and numeric attributes Nominal: number of children usually equal to number values  attribute won’t get tested more than once Other possibility: division into two subsets Numeric: test whether value is greater or less than constant  attribute may get tested several times Other possibility: three-way split (or multi-way split)‏ Integer: less than, equal to, greater than Real: below, within, above

37 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 37 Missing values Does absence of value have some significance? Yes  “missing” is a separate value No  “missing” must be treated in a special way  Solution A: assign instance to most popular branch  Solution B: split instance into pieces Pieces receive weight according to fraction of training instances that go down each branch Classifications from leave nodes are combined using the weights that have percolated to them

38 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 38 Classification rules Popular alternative to decision trees Antecedent (pre-condition): a series of tests (just like the tests at the nodes of a decision tree)‏ Tests are usually logically ANDed together (but may also be general logical expressions)‏ Consequent (conclusion): classes, set of classes, or probability distribution assigned by rule Individual rules are often logically ORed together  Conflicts arise if different conclusions apply

39 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 39 An example If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes If none of the above then play = yes

40 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 40 Trees for numeric prediction Regression: the process of computing an expression that predicts a numeric quantity Regression tree: “decision tree” where each leaf predicts a numeric quantity n Predicted value is average value of training instances that reach the leaf Model tree: “regression tree” with linear regression models at the leaf nodes n Linear patches approximate continuous function

41 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 41 An example …………… 40FalseNormalMildRainy 55FalseHighHotOvercast 0TrueHighHotSunny 5FalseHighHotSunny Play-timeWindyHumidityTemperatureOutlook

42 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 42 Agenda Input: Concepts Input: Instances Input: Attributes and levels of measurement Data preparation for WEKA and beyond Output: Decision trees (and related patterns) Algorithm: ID3 (and variants) Multirelational data mining Motivation / Focus on decision tree learning

43 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 43 Constructing decision trees Strategy: top down Recursive divide-and-conquer fashion  First: select attribute for root node Create branch for each possible attribute value  Then: split instances into subsets One for each branch extending from the node  Finally: repeat recursively for each branch, using only instances that reach the branch Stop if all instances have the same class

44 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 44 Which attribute to select?

45 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 45 Which attribute to select?

46 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 46 Criterion for attribute selection Which is the best attribute?  Want to get the smallest tree  Heuristic: choose the attribute that produces the “purest” nodes Popular impurity criterion: information gain  Information gain increases with the average purity of the subsets Strategy: choose attribute that gives greatest information gain

47 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 47 Computing information Measure information in bits  Given a probability distribution, the info required to predict an event is the distribution’s entropy  Entropy gives the information required in bits (can involve fractions of bits!)‏ Formula for computing the entropy:

48 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 48 Example: attribute Outlook

49 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 49 Computing information gain Information gain: information before splitting – information after splitting Information gain for attributes from weather data: gain(Outlook ) = 0.247 bits gain(Temperature ) = 0.029 bits gain(Humidity ) = 0.152 bits gain(Windy ) = 0.048 bits gain(Outlook )= info([9,5]) – info([2,3],[4,0],[3,2])‏ = 0.940 – 0.693 = 0.247 bits

50 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 50 Continuing to split gain(Temperature )= 0.571 bits gain(Humidity ) = 0.971 bits gain(Windy )= 0.020 bits

51 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 51 Final decision tree Note: not all leaves need to be pure; sometimes identical instances have different classes  Splitting stops when data can’t be split any further

52 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 52 Wishlist for a purity measure Properties we require from a purity measure:  When node is pure, measure should be zero  When impurity is maximal (i.e. all classes equally likely), measure should be maximal  Measure should obey multistage property (i.e. decisions can be made in several stages): Entropy is the only function that satisfies all three properties!

53 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 53 Properties of the entropy The multistage property: Simplification of computation: Note: instead of maximizing info gain we could just minimize information

54 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 54 Discussion / outlook decision trees Top-down induction of decision trees: ID3, algorithm developed by Ross Quinlan  Various improvements, e.g.  C4.5: deals with numeric attributes, missing values, noisy data  Gain ratio instead of information gain (see Witten & Frank slides, ch. 4, pp. 40-45) Similar approach: CART …

55 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 55 Weather data with mixed attributes Some attributes have numeric values …………… YesFalse8075Rainy YesFalse8683Overcast NoTrue9080Sunny NoFalse85 Sunny PlayWindyHumidityTemperatureOutlook If outlook = sunny and humidity > 83 then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity < 85 then play = yes If none of the above then play = yes

56 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 56 Dealing with numeric attributes Discretize numeric attributes Divide each attribute’s range into intervals  Sort instances according to attribute’s values  Place breakpoints where class changes (majority class)‏  This minimizes the total error Example: temperature from weather data 64 65 68 69 70 71 72 72 75 75 80 81 83 85 Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No …………… YesFalse8075Rainy YesFalse8683Overcast NoTrue9080Sunny NoFalse85 Sunny PlayWindyHumidityTemperatureOutlook

57 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 57 The problem of overfitting This procedure is very sensitive to noise  One instance with an incorrect class label will probably produce a separate interval Also: time stamp attribute will have zero errors Simple solution: enforce minimum number of instances in majority class per interval Example (with min = 3): 64 65 68 69 70 71 72 72 75 75 80 81 83 85 Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No 64 65 68 69 70 71 72 72 75 75 80 81 83 85 Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No

58 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 58 Agenda Input: Concepts Input: Instances Input: Attributes and levels of measurement Data preparation for WEKA and beyond Output: Decision trees (and related patterns) Algorithm: ID3 (and variants) Multirelational data mining Motivation / Focus on decision tree learning

59 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 59 Motivation: Fictitious E-commerce example – who is a big spender? Typical possible result so far: However, this might be the case:

60 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 60 On terminology (1) Reduces to propositional logic  „Propositional rule“ Info from multiple relations from a relational DB; expressed in a subset of first-order logic (aka predicate l. aka relational logic)  „Relational rule“

61 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 61 On terminology (2) Info from multiple relations from a relational DB; expressed in a subset of first-order logic (aka predicate l. aka relational logic)  „Relational rule“ „relational data mining“ „multirelational data mining“ Intersection of machine learning and logic programming: „inductive logic programming“

62 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 62 The (most common) problem: Learning the logical definition of a relation n Given: l (Training) examples: tuples that belong or do not belong to the target relation l Background knowledge: other relation definitions etc. in the same logical language n Find: l A predicate definition that defines the target relation in terms of the relations from the background knowledge n Formally (learning from entailment setting, Muggleton 1991)

63 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 63 Enter real data In most cases: must relax criteria of consistency and completeness  Statistical criteria

64 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 64 Example Given: Learn the predicate definition:

65 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 65 First solution idea: join or aggregate towards a single table Slightly modify the example: n customer(CustID; Name; Age; SpendsALot) n purchase(CustID; ProductID; Date; Value; PaymentMode) Combination 1: natural join  purchase1(CustID; ProductID; Date; Value; PaymentMode; Name; Age; SpendsALot) Combination 2: aggregation  customer1(CustID; Age; NOfPurchases; TotalValue; SpendsALot) Can we find this pattern?:

66 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 66 Second solution idea: A cleverer way of propositionalization n Early approach: LINUS (Lavrac, Dzeroski, & Grobelnik, 1991) n Use the background knowledge to create new attributes n Works for a restricted class of problems: Function-free program clauses which are l typed (each var. Is associated with a predetermined set of values), l Constrained (all var.s in the body of a clause also appear in the head), and l Nonrecursive (the predicate symbol in the head does not appear in any of the literals in the body)

67 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 67 Third solution idea: Upgrade propositional approaches n MRDM algorithms have much in common with their propositional counterparts: l Learning as search (seach for patterns valid in the given data) l A lattice of hypotheses as search space n Key differences: l Representation of data and patterns l Refinement operators l Testing coverage (whether a rule explains an example) n A general „recipe“ for upgrading: Van Laer & De Raedt (2001) l Keep as much as possible, upgrade only the key notions needed

68 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 68 An example: First-order decision trees: TILDE (Blockeel & De Raedt, 1998) with propositional counterpart and special case C4.5 This tree predicts the maintenance action that should be taken on machine M: maintenance(M,A) How to expand a node? Search in the space of clauses (cf. prop.: attributes/values) use constraints to limit no. of candidates („declarative bias“) Each refinement operator: -- App lies a sub stit utio n to the cla use -- Add s a liter al to the bod y of the cla use

69 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 69 Major difference between propositional and relational dec. trees n Is in the tests that can appear in internal nodes n Each test is a query: a conjuntion of literals with existentially quantified variables n If the query succeeds  take the „yes“ branch n Variables can be shared among nodes n The actual test in a node = the conjunction of literals in the node itself and the literals on the path from the root of the tree

70 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 70 Why order matters (another example) Forall x Exists y : haspart(x,y) Exists y Forall x : haspart(x,y)

71 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 71 Interpretation of the tree in terms of a decision list n List: the order of tests expressed by the clauses matters!

72 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 72 Next lecture Input: Concepts Input: Instances Input: Attributes and levels of measurement Data preparation for WEKA and beyond Output: Decision trees (and related patterns) Algorithm: ID3 (and variants) Multirelational data mining Motivation / Focus on decision tree learning How good are these models? Evaluation

73 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 73 References / background reading; acknowledgements n All parts on single-table processing are based on l Witten, I.H., & Frank, E.(2005). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. 2nd ed. Morgan Kaufmann. http://www.cs.waikato.ac.nz/%7Eml/weka/book.html http://www.cs.waikato.ac.nz/%7Eml/weka/book.html l In particular, pp. 8-57 are based on the instructor slides for that book available at http://books.elsevier.com/companions/9780120884070/ http://books.elsevier.com/companions/9780120884070/ (chapters 1-4): http://books.elsevier.com/companions/9780120884070/revisionnotes/01~PDFs/chapter1.pdfhttp://books.elsevier.com/companions/9780120884070/revisionnotes/01~PDFs/chapter1.pdf (and...chapter2.pdf, chapter3.pdf, chapter4.pdf) or http://books.elsevier.com/companions/9780120884070/revisionnotes/02~ODP%20Files/chapter1.odp http://books.elsevier.com/companions/9780120884070/revisionnotes/02~ODP%20Files/chapter1.odp (and...chapter2.odp, chapter3.odp, chapter4.odp) n The MRDM part is based on D ž eroski, S. (2003). Multi-Relational Data Mining: An Introduction. SIGKDD Explorations 5(1): 1-16. http://www.cs.wisc.edu/EDAM/Dzeroski.pdfhttp://www.cs.wisc.edu/EDAM/Dzeroski.pdf


Download ppt "Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new."

Similar presentations


Ads by Google