数据挖掘 Introduction to Data Mining

数据挖掘 Introduction to Data Mining
Philippe Fournier-Viger Full professor School of Natural Sciences and Humanities Spring 2018 S C

Course schedule (日程安排)
Lecture 1 Introduction What is the knowledge discovery process? Lecture 2 Exploring the data Lecture 3 Classification (part 1) Lecture 4 Classification (part 2) Lecture 5 Association analysis Lecture 6 Lecture 7 Clustering Lecture 8 Anomaly detection and advanced topics

Introduction Last time: Important: Association analysis - part 1
Assignment Important: QQ group: The PPTs are on the website.

Association analysis (关联分析) – part 2
guān关 lián联 fēn分 xī析 Association analysis (关联分析) – part 2 Partly based on Chapter 6 of Tan & Kumar:

Frequent itemset mining (频繁项集挖掘)
A transaction database: Transaction Items appearing in the transaction T1 {pasta, lemon, bread, orange} T2 {pasta, lemon} T3 {pasta, orange, cake} T4 {pasta, lemon, orange cake} For minsup = 2, the frequent itemsets are: {lemon}, {pasta}, {orange}, {cake}, {lemon, pasta}, {lemon, orange}, {pasta, orange}, {pasta, cake}, {orange, cake}, {lemon, pasta, orange}

minsup =2 frequent itemsets l = lemon Infrequent p = pasta itemsets
∅ l p b o c frequent itemsets lp lb lo lc pb po pc bo bc oc lpb lpo lpc lbo lbc loc pbo pbc poc boc l = lemon p = pasta b = bread 0 = orange c = cake Infrequent itemsets lpbo lpbc lpoc lboc pboc lpboc

{pasta, lemon, bread, orange} T2 {pasta, lemon} T3
Property 2: Let there be an itemset Y. If there exists an itemset X ⊂Y such that X is infrequent, then Y is infrequent. Example: Consider {bread, lemon}. If we know that {bread} is infrequent, then we can infer that {bread, lemon} is also infrequent. Transaction Items appearing in the transaction T1 {pasta, lemon, bread, orange} T2 {pasta, lemon} T3 {pasta, orange, cake} T4 {pasta, lemon, orange, cake}

This property is useful to reduce the search space. Example:
If « bread » is infrequent ∅ minsup =2 l p b o c lp lb lo lc pb po pc bo bc oc lpb lpo lpc lbo lbc loc pbo pbc poc boc lpbo lpbc lpoc lboc pboc lpboc

This property is useful to reduce the search space. Example:
If « bread » is infrequent, all its supersets are infrequent. ∅ minsup =2 l p b o c lp lb lo lc pb po pc bo bc oc lpb lpo lpc lbo lbc loc pbo pbc poc boc lpbo lpbc lpoc lboc pboc lpboc

Frequent itemset mining
For more information: Fournier-Viger, P., Lin, J. C.-W., Vo, B, Chi, T.T., Zhang, J., Le, H. B. (2017). A Survey of Itemset Mining. WIREs Data Mining and Knowledge Discovery, e1207 doi: /widm.1207, 18 pages. Software:

Association rule MINING (关联规则挖掘)

Introduction Finding frequent patterns in a database allows to find useful information. But it has some limitations

Introduction A transactional database D Transaction
Items in the transaction T1 {pasta, lemon, bread, orange} T2 {pasta, lemon} T3 {pasta, orange, cake} T4 {pasta, lemon, orange, cake} If minsup = 2, then {pasta, cake} is frequent. Can we conclude that people who buy pasta will also buy cakes?

Association rule An association rule is a rule of the form 𝑋→𝑌 where
X and Y are itemsets, and 𝑋∩𝑌=∅. e.g. {orange, cake} → {pasta} {lemon,orange} → {pasta} {pasta} → {bread} …

Support The support of a rule 𝑋→𝑌 is calculated as sup(𝑋→𝑌) = sup(𝑋𝑈𝑌)/ |D| where |D| is the number of transactions. Transaction Items in the transaction T1 {pasta, lemon, bread, orange} T2 {pasta, lemon} T3 {pasta, orange, cake} T4 {pasta, lemon, orange, cake} e.g. {lemon, orange} -> {pasta} has a support of 0.5 i.e. two out of four transactions.

Confidence The confidence of a rule 𝑋→𝑌 is calculated as conf(𝑋→𝑌) = sup(𝑋𝑈𝑌)/ sup(𝑋)|. Transaction Items in the transaction T1 {pasta, lemon, bread, orange} T2 {pasta, lemon} T3 {pasta, orange, cake} T4 {pasta, lemon, orange, cake} {lemon, orange} -> {pasta} has a confidence of 1.0 (100%)

Confidence The confidence of a rule 𝑋→𝑌 is calculated as conf(𝑋→𝑌) = sup(𝑋𝑈𝑌)/ sup(𝑋)|. Transaction Items in the transaction T1 {pasta, lemon, bread, orange} T2 {pasta, lemon} T3 {pasta, orange, cake} T4 {pasta, lemon, orange, cake} {pasta} -> {lemon} has a confidence of 0.75 {lemon} -> {pasta} has a confidence of 1.0

Association rule mining
Input: A transaction database (set of transactions) A parameter minsup (0≥𝑚𝑖𝑛𝑠𝑢𝑝≥1) A parameter minconf (0≥𝑚𝑖𝑛𝑐𝑜𝑛𝑓≥1) Output: each association rule 𝑋→𝑌 such that: sup 𝑋→𝑌 ≥𝑚𝑖𝑛𝑠𝑢𝑝 and conf 𝑋→𝑌 ≥𝑚𝑖𝑛𝑠𝑢𝑝 {pasta} -> {lemon} has a confidence of 0.75 {lemon} -> {pasta} has a confidence of 1.0

Example Transaction Items in the transaction T1
minsup = 0.4 minconf = 0.75 Transaction Items in the transaction T1 {pasta, lemon, bread, orange} T2 {pasta, lemon} T3 {pasta, orange, cake} T4 {pasta, lemon, orange, cake} lemon ==> pasta support: 3 confidence: 1 pasta ==> lemon support: 3 confidence: 0,75 orange ==> pasta support: 3 confidence: 1 pasta ==> orange support: 3 confidence: 0,75 cake ==> pasta support: 2 confidence: 1 cake ==> orange support: 2 confidence: 1 lemon orange ==> pasta support: 2 confidence: 1 orange cake ==> pasta support: 2 confidence: 1 pasta cake ==> orange support: 2 confidence: 1 cake ==> pasta orange support: 2 confidence: 1

Why using the support and confidence?
The support allows to: find patterns that are less likely to be random. reduce the number of patterns, make the algorithms more efficient. The confidence allows to: measure the strength of associations obtain an estimation of the conditional probability P( Y | X ). Warning: a strong association does not mean that there is causality!

How to find the association rules? Naïve approach
Create all association rules. Calculate their confidence and support by scanning the database. Keep only the valid rules. This approach is inefficient. For d items, there are: possible rules. For d =6, this means 602 rules! For d =100, this means rules!

Observation 1 Transaction Items in the transaction T1
{pasta, lemon, bread, orange} T2 {pasta, lemon} T3 {pasta, orange, cake} T4 {pasta, lemon, orange, cake} lemon ==> pasta support: 3 confidence: 1 pasta ==> lemon support: 3 confidence: 0,75 orange ==> pasta support: 3 confidence: 1 pasta ==> orange support: 3 confidence: 0,75 Observation 1. All the rules containing the same items can be viewed as having been derived from a same frequent itemset e.g. {pasta, lemon}

{pasta, lemon, bread, orange} T2 {pasta, lemon} T3 {pasta, orange, cake} T4 {pasta, lemon, orange, cake} lemon ==> pasta support: 3 confidence: 1 pasta ==> lemon support: 3 confidence: 0,75 orange ==> pasta support: 3 confidence: 1 pasta ==> orange support: 3 confidence: 0,75 Observation 2. All the rules containing the same items have the same support, but may not have the same confidence e.g. {pasta, lemon}

{pasta, lemon, bread, orange} T2 {pasta, lemon} T3 {pasta, orange, cake} T4 {pasta, lemon, orange, cake} lemon ==> pasta support: 3 confidence: 1 pasta ==> lemon support: 3 confidence: 0,75 orange ==> pasta support: 3 confidence: 1 pasta ==> orange support: 3 confidence: 0,75 Observation 3. If an itemset is infrequent, all rules derived from that itemset can be ignored e.g. If minsup = 4, rules derived from {pasta, lemon} can be ignored, since its support is 3.

How to find association rules eficiently?
Aggrawal & Srikant (1993). Two steps: Discover the frequent itemsets. Use the frequent itemsets to generate association rules having a confidence greater or equal to minconf. Step 1 is the most difficult. Thus, most studies are on improving the efficiency of Step 1.

Generating rules Each frequent itemset X of size k can produce 2k-2 rules. A rule can be created by dividing an itemset X in two non empty subsets to obtain a rule 𝑋′→𝑌−𝑋′. Then, the confidence of the rule must be calculated.

Generating rules Example: using the itemset X={1, 2, 3}, we can generate: 1,2 → 3 1,3 → 2 2,3 → 1 1 → 2,3 2 → 1,3 3 → 1,2

Calculating the confidence
Example: using the itemset X={1, 2, 3}, we can generate: 1,2 → 3 1,3 → 2 2,3 → 1 1 → 2,3 2 → 1,3 3 → 1,2 How can we calculate the confidence of rules derived from X? We must know the support of all subsets of X. We know it already, since if X is a frequent itemset, then all its subsets are frequent!

The result of a frequent itemset mining program looks like this: {pasta} support = {lemon} support = {orange} support = {cake} support = 2 {pasta, lemon} support: 3 {pasta, orange} support: 3 {pasta, cake} support: 2 {lemon, orange} support: 2 {orange, cake} support: 2 {pasta, lemon, orange} support: 2 {pasta, orange, cake} support: 2 How can we quickly search for the support of a set?

Solution 1: Itemsets are grouped by size, Itemsets having the same size are sorted by some total order ≻. binary search… {pasta, lemon} support: 3 {pasta, orange} support: 3 {pasta, cake} support: 2 {lemon, orange} support: 2 {orange, cake} support: 2 pasta ≻ lemon ≻ bread ≻ orange ≻ cake

Solution 2: itemsets are stored in a « trie » to search for itemsets in O(1) time ∅ lemon:3 orange:3 cake:2 pasta:4 lemon:3 orange:3 cake:2 orange:2 cake:2 orange:2 cake:2 This path in the tree represents the itemset {pasta} which has a support of 4

Solution 2: itemsets are stored in a « trie » to search for itemsets in O(1) time ∅ lemon:3 orange:3 cake:2 pasta:4 lemon:3 orange:3 cake:2 orange:2 cake:2 orange:2 cake:2 This path in the tree represents the itemset {pasta, orange} which has a support of 3

Solution 2: itemsets are stored in a « trie » to search for itemsets in O(1) time ∅ lemon:3 orange:3 cake:2 pasta:4 lemon:3 orange:3 cake:2 orange:2 cake:2 orange:2 cake:2 This path in the tree represents the itemset {pasta, orange, cake} which has a support of 2

Reducing the search space
Can we reduce the search space using the confidence measure? Confidence is not an anti-monotone measure. However, the following relationship between two rules can be proved: Theorem: If a rule 𝑿→𝒀−𝑿 does not satisfy the confidence threshold, any rule 𝑿′→ 𝒀−𝑿’ such that 𝑿′ ⊆𝑿 will also not satisfy the confidence threshold.

For example, consider: X= {a,b,c} Y= {a,b,c,d} X’ = {a,b}
Theorem: If a rule 𝑿→𝒀−𝑿 does not satisfy the confidence threshold, any rule 𝑿′→𝒀−𝑿’ such that 𝑿′ ⊆𝑿 will also not satisfy the confidence threshold. For example, consider: X= {a,b,c} Y= {a,b,c,d} X’ = {a,b} By the above theorem, the confidence of {a,b,c}  {d} cannot be greater than the confidence of: {a,b} {c,d}

Proof Let there be two rules 𝑿→𝒀−𝑿 and 𝑿′→𝒀−𝑿’ such that 𝑿′ ⊆𝑿.
The confidence of these rules are conf(𝑿→𝒀−𝑿) = sup⁡(𝑋𝑈𝑌) sup⁡(𝑋) conf(𝑿′→𝒀−𝑿′) = sup⁡(𝑋𝑈𝑌) sup⁡(𝑋′) Since 𝑿′ ⊆𝑿, it follows that sup⁡(𝑋′) ≥sup 𝑋 . Thus: conf(𝑿′→𝒀−𝑿′) ≤ conf(𝑿→𝒀−𝑿) and the theorem holds.

These rules can be eliminated
Illustration This shows all the rules that can be made with the itemset {a,b,c,d} Low-confidence rule These rules can be eliminated

Generating rules where F is the set of all frequent itemsets
A is the set of all proper non empty subsets of F

Evaluating associations

Evaluating associations
A large amount of patterns can be discovered How to find the most interesting patterns? Interestingness measures: objective measures: statistical reasons for selecting patterns Subjective measures: discover surprising or interesting patterns (e.g. diaper  beer is more surprising than mouse keyboard). It is more difficult to consider subjective measures in the search for patterns.

Objective measurse Independent from any domain
e.g. support and confidence Several objective measures can be calculated using a contingency table. e.g. a table with 2 binary attributes 𝐵 𝐴 𝑓11 𝑓10 𝑓1+ 𝑓01 𝑓00 𝑓0+ 𝑓+1 𝑓+0 𝑁

Limitations of the support and confidence
If we use the minsup threshold, we will find less results, it will be faster, but we may eliminate some rare patterns that are interesting.

Another problem Consider: {tea} → {coffee} support : 15% confidence : 75 % This seems like an interesting pattern… 𝐶𝑜𝑓𝑓𝑒𝑒 𝑇𝑒𝑎 150 50 200 650 800 1000 However, 80 % of the people drink coffee no matter if they drink tea or not. In fact, the probability of drinking coffee for tea drinkers is lower than for non tea drinkers (80 instead of 75 %)! This problem occurs because the confidence measure does not consider the support of the left side of rules.

The lift lift(𝑋→𝑌) = conf(𝑋∪𝑌) sup(𝑌) = sup(𝑋∪𝑌) sup(𝑋) × sup(𝑌)
if = 1, X and Y are independent if > 1, X and Y are positively correlated if < 1, X and Y are negatively correlated Example: lift({tea}→{coffee}) = , This indicates a slightly negative correlation

Limitations of the lift
Example: 𝑝 𝑞 880 50 930 20 70 1000 𝑟 𝑠 20 50 70 880 930 1000 p -> q = (880 / 1000) / (( 930 / 1000) * (930 / 1000)) lift({𝑝} →{𝑞}) = even if they appear together in 88 % of the transactions lift({𝑟} →{𝑠}) = even if they rarely appear together. In this case, using the confidence provides better results: conf({𝑝} →{𝑞}) = 94.6 % conf({𝑟} →{𝑠}) = 28.6 %

Many other measures… Tan and al. (2004) Many measures
They have different properties. e.g. symetrical, anti-monotonic, etc.

Mining patterns in sequences

Introduction Association rule mining and frequent itemset mining do not consider the time or sequential ordering between events. Several techniques to find patterns in one or multiple sequences.

Sequential pattern mining (序列模式挖掘)
Input: A sequence database (a set of sequences) A minsup threshold Output: All sub-sequences having a support greater or equal to minsup. Example: minsup = 50 % IFD sequence 1 <{a}, {a,b,c} {a, c} {d} {c, f}> 2 <{a, d}, {c} {b, c} {a, e}> 3 <{e, f}, {a, b} {d, f} {c}, {b}> 4 <{e}, {g}, {a, f} {c} {b}, {c}> Pattern support {a} 100 % <{a}, {b} > <{a, b} > 50 % …

Sequential pattern mining (序列模式挖掘)
Several algorithms: AprioriAll GSP (1996) PrefixSpan (2001) SPADE Fast (2011) CM-SPAM (2014) Fournier-Viger, P., Lin, J. C.-W., Kiran, R. U., Koh, Y. S., Thomas, R. (2017). A Survey of Sequential Pattern Mining. Data Science and Pattern Recognition (DSPR), vol. 1(1), pp Software:

Sequential rule mining
Input: A sequence database (a set of sequences) A minsup threshold, a minconf threshold Output Rules of the form X Y : if the items X appears then items Y will appear after.

Periodic pattern mining
Periodic Frequent Pattern Mining: discovering groups of items that appear periodically in a sequence of transactions. Example: {pasta, cookies, orange juice} may be a frequent periodic pattern for a particular customer, occurring every week.

Periodicity of an itemset

Novel definition of periodic pattern
An itemset X is periodic if: minAvg ≤ avgper(X) ≤ maxAvg Minper(X) ≥ minPer Maxper(X) ≤ maxper where minAvg, maxAvg, minPer, maxPer are parameters set by the user. These new parameters give more flexibility to the user.

Example Goal: find each pattern X such that
minAvg ≤ avgper(X) ≤ maxAvg Minper(X) ≥ minPer Maxper(X) ≤ maxper Example . Several algorithms: PFPM etc.

Retail Mushroom Chainstore Foodmart
Number of periodic patterns (PFPM) vs frequent itemsets (Eclat) on four datasets Retail Mushroom Chainstore Foodmart The PFPM algorithm can filter many non periodic patterns.

Frequent subgraph mining (频繁子图挖掘)

Frequent subgraph mining
A graph (图表) is a set of vertices (顶点) and edges (边) e.g. This graph has four vertices (in yellow color). Each vertice has a label (10 or 11) that may not be unique. This graph has five edges (black lines) Each edge has a label (20,21,22, 23) that may not be unique.

Types of graphs connected graph: by following the edges, it is possible to go from any vertex to any other vertices disconnected graph: a graph that is not a connected graph e.g. a graph where it is possible to go from any city to any other cities by following the roads. This graph is disconnected because Vertex A cannot be reached from the other vertices by following the edges

Types of graphs undirected graph (无向图): edges are bidirectional directed graph (有向图 ): edges are unidirectional A real-life example: graphs where vertices are cities and edges are road some roads are « one-way » while others are bidirectional

Analyzing graphs Many data mining tasks on graphs:
detecting communities, predicting friendship links, detecting influence between users, etc. what is our goal? what kind of data ? single graph? multiple graphs? directed graphs? etc. Frequent subgraph mining: discover interesting subgraph(s) appearing often in a set of graphs (a graph database)

Frequent subgraph mining
Input: a graph database (a set of graphs) a minimum support threshold (minsup). Example: minsup = 3

Output: all subgraphs appearing in a least minsup graphs. minsup = 3

all subgraphs appearing in a least minsup graphs.
Output: all subgraphs appearing in a least minsup graphs. minsup = 3 This subgraph has a support of 3

Frequent subgraph mining with a single graph
A variation of the previous problem. We want to find frequent subgraphs in a single large graph. The support of a subgraph is the number of times that it appears in the single input graph

minsup = 2

minsup = 2 This subgraph has a support of 2

Algorithms for subgraph mining
Several algorithms: FFSM, GSPAN, Gaston, etc. The same algorithm can usually be applied on a single graph or multiple graphs. Other variations: finding frequent paths finding frequent trees finding closed/maximal subgraphs… …

Performance comparison
Authors of data mining papers often do not compare their algorithms with the best ones published before. Frequent subgraph mining (before 2014) 2001 2004 2004 2003 2002 2003 2006 2007 2011 2008 Legend: arrow X  Y from an algorithm X to an algorithm Y indicates that X was shown to be a better algorithm than Y in terms of execution time by the authors of X in an aexperiment.

High-utility itemset mining

Limitations of frequent itemsets
Frequent itemset mining has many applications. However, it has important limitations many frequent patterns are not interesting, quantities of items in transactions must be 0 or 1 all items are considered as equally important (having the same weight)

High Utility Itemset Mining
A generalization of frequent itemset mining: items can appear more than once in a transaction (e.g. a customer may buy 3 bottles of milk) items have a unit profit (e.g. a bottle of milk generates 1 $ of profit) the goal is to find patterns that generate a high profit Example: {caviar, wine} is a pattern that generates a high profit, although it is rare

Input a transaction database a unit profit table minutil: a minimum utility threshold set by the user (a positive integer)

Input a transaction database a unit profit table minutil: a minimum utility threshold set by the user (a positive integer) Output All high-utility itemsets (itemsets having a utility≥𝑚𝑖𝑛𝑢𝑡𝑖𝑙) For example, if minutil = 33$, the high-utility itemsets are: {b,d,e} $ 2 transactions {b,c,d} 34$ {b,c,d,e} 40$ {b,c,e} 37 $ 3 transactions

Utility calculation Input a transaction database a unit profit table
The utility of the itemset {b,d,e} is calculated as follows: u({b,d,e}) = (5x2)+(3x2)+(3x1) + (4x2)+(2x3)+(1x3) = 36$ utility in transaction T1 utility in transaction T2 Challenge: utility is not anti-monotonic

How to solve this problem?
Several algorithms: Two-Phase (PAKDD 2005), IHUP (TKDE, 2010), UP-Growth (KDD 2011), HUI-Miner (CIKM 2012), FHM (ISMIS 2014) EFIM (2015) mHUIMiner (2017) Key idea: calculate an upper-bound on the utility of itemsets (e.g. the TWU) that respects the Apriori property to be able to prune the search space.

Conclusion Today, we discussed several other types of techniques for discovering patterns association rules, frequent subgraphs patterns in sequences … Next week, we will discuss the discovery of clusters in data (clustering)

References Chapter 8 and 9. Han and Kamber (2011), Data Mining: Concepts and Techniques, 3rd edition, Morgan Kaufmann Publishers, Chapter 4. Tan, Steinbach & Kumar (2006), Introduction to Data Mining, Pearson education, ISBN-10: Other sources.

数据挖掘 Introduction to Data Mining

Similar presentations

Presentation on theme: "数据挖掘 Introduction to Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

数据挖掘 Introduction to Data Mining

Similar presentations

Presentation on theme: "数据挖掘 Introduction to Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback