Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining High Utility Itemset in Big Data

Similar presentations


Presentation on theme: "Mining High Utility Itemset in Big Data"— Presentation transcript:

1 Mining High Utility Itemset in Big Data
Ying Chun Lin, Cheng-Wei Wu, Vincent S. Tseng Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan Good morning Ladies and gentlemen Today I am going to talk about “Mining HUI in Big data” My name is Ying Chun Lin. I am from department of CS and information engineering of national cheng kung university. The co-author of this work is chengwei wu and my advisor Vincent Tseng Intelligent Database System Lab

2 Outline Introduction Definition PHUI-Growth Experiment Conclusion
This is the outline of the presentation. In the part of introduction I will talk about what the high utility itemset mining is, why should we study high utility itemset mining in big data Second, I will briefly introduce the definition The third part is PHUI-Growth, which is the main framework of our work And, I will show the performance of PHUI-Growth in the Experiment part Last, close the presentation with brief conclusion.

3 Introduction Let’s start with introduction

4 Introduction What is High utility itemset (HUI) mining?
It is one of the most important tasks of frequent pattern mining, which can be used to discover sets of items carrying high utilities (e.g., high profits) However, in the era of Big Data, there is no parallel solution for high utility itemset mining currently. 念第一點 Like you go to a shopping mall, and buy some of products which are called items also each product has profits we want to find out which set of items are the most profitable ones 念第二點

5 Major Problems for High Utility Mining in Big Data
Large amount of transactions and varied items in big data High computational complexity Large search space Combination explosion Scalability issue Data cannot be held or processed in a single machine An parallel algorithm is needed What’s the major problems for HUI when we face with big data (1) the size of transactions is large and items vary across dataset (1-1) first high computational complexity It would face the large search space and the combination explosion problem. This leads the mining task to suffer from very expensive computational costs in practical. (1-2) second data cannot be processed in a single machine it may take months or years to solve a problem of big data in a single machine (2)As a result, a well-designed algorithm incorporated with parallel programming architecture is needed.

6 A New Framework for High Utility Mining
Parallel mining High Utility Itemsets by pattern-Growth (PHUI-Growth) Implemented on Hadoop platform Store large dataset separately on HDFS Design a new pruning strategy, Discarding local unpromising items in MapReduce framework (DLU-MR) We proposed a new framework parallel mining HUIs by pattern Growth, PHUI-Growth (1) We implemented on Hadoop platform, thus it inherits several nice properties from Hadoop, such as easy deployment in high level language, fault tolerance, low communication overheads and high scalability on commodity hardware. (2) The dataset is distributed among HDFS (3) A new pruning strategy, discarding local unpromising items in MapReduce framework, DLU-MR, is proposed. In traditional itemset mining, pattern explosion is a common problem. And we propose a strategy for pruning unpromising pattern efficiently and in parallel

7 Definition Then, it the definition part

8 Transactional Database
Problem Definition Unit Profit A 2 B 3 C 1 D E 4 F G 8 Transactional Database Total 𝑇 1 A(4), B(2), C(8), D(2) 28 𝑇 2 A(4), B(2), C(8) 22 𝑇 3 C(4), D(2), E(2), F(2) 26 𝑇 4 E(2), F(2), G(1) 24 Internal Utility Utility of the transaction Transaction Let’s introduce some definition by this example First, we have items A, B, C, E, F, and G. (press) These items are in the set of items And the right column shows their unit profit, like the unit profit of A is 2 dollars. (press) This is external utility. Then we move on to our transactional database. Take t4 as example. (press) This is an transaction. In t4, we know that it contains item E, F, and G and have the quantities 2, 2, and 1 respectively. (press) We can called those quantities internal utility How about the utility of t4. We simply multiply the quantities with the unit profits and sum them up. We can get 24. (press) This is the utility of the transaction. Last come to our main goal. (press) Which itemsets, or combination of items, are the high utility itemsets? External Utility Set of items Which itemsets are the high utility itemsets?

9 High Utility Itemset If the utility of an itemset 𝑋 is no less than a user-specified minimum utility threshold 𝜃, we call the itemset high utility itemset. 𝑢 𝑋 ≥𝜃→𝑋 𝑖𝑠 ℎ𝑖𝑔ℎ 𝑢𝑡𝑖𝑙𝑖𝑡𝑦 𝑖𝑡𝑒𝑚𝑠𝑒𝑡 What is the high utility itemset? 念投影片上的字 (press) The following is the formal definition of the high utility itemset

10 High Utility Itemset Mining
Unit Profit A 2 B 3 C 1 D E 4 F G 8 Transactional Database Total 𝑇 1 A(4), B(2), C(8), D(2) 28 𝑇 2 A(4), B(2), C(8) 22 𝑇 3 C(4), D(2), E(2), F(2) 26 𝑇 4 E(2), F(2), G(1) 24 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑢𝑡𝑖𝑙𝑖𝑡𝑦 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝜃=30 After the definition. Let’s do utility mining manually First, (press) we will set the minimum utility, like 30 in this example Then, I want to know (press) 念跳出來的字 So, we scan the transactional database to find which transactions contains both A and C. We can find it in t1 (point A and C in t1) and t2 t1(point A and C in t2) And we can calculate the transaction utility in t1(press) and t2 (press) Because the utility of A and C in the transactional database is higher than the minimum utility, 30, it can called high utility itemset in this transactional database. Is {A, C} a high utility itemset? 𝑢 𝐴,𝐶 𝑖𝑛 𝑡ℎ𝑒 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑎𝑙 𝑑𝑎𝑡𝑎𝑏𝑎𝑠𝑒 = 4×2+8×1 + 4×2+8×1 =32 >𝜃→𝐻𝑈𝐼 𝑇 1 𝑇 2

11 Basic Pruning Strategy – TWU Downward Closure Property
Transaction-Weighted Utilization(TWU) of an itemset X E.g. 𝑇𝑊𝑈 𝐸,𝐹 =26+24=50 High TWU itemset E.g. 𝑇𝑊𝑈 𝐸,𝐹 ≥(𝜃=30) TWU downward closure property The TWU downward closure property states that any superset of a low TWU itemset is not high utility itemset [Y. Liu. et. al., UBDM, 2005] Transactional Database Total 𝑇 1 A(4), B(2), C(8), D(2) 28 𝑇 2 A(4), B(2), C(8) 22 𝑇 3 C(4), D(2), E(2), F(2) 26 𝑇 4 E(2), F(2), G(1) 24 Like frequent itemset mining, we can enumerate all the possible combination of items and calculate the utility of each itemset. Then, determine whether the itemset is HUI or not. But it may face the problem pattern explosion, especially with big data, so we need an efficient pruning strategy. Let’s define something before we go into our pruning strategy. First, transaction weighted utilization of an itemset X. When the itemset is contained in a transaction, the TWU of the itemset in the transaction is the utility of the transaction. Take E and F for example. They shows up in t3 (point E and F) and t4 (point E and F). So, the TWU of E and F in the transactional database is 26 (point t3) plus 24 (point t4) and equals 50. Second, what is high TWU itemset? the same, we take itemset E and F as example. The TWU of E and F is greater than the minimum utility threshold, 30, as usual. Consequently, they are high TWU itemset. Last, we can have the feeling what the TWU downward closure property is. As we know the TWU of an itemset is the maximum possible utility a itemset could get in the transactional database. As a result, low TWU itemset can be pruned, because they are impossible to become HUIs in the transactional database

12 Proposed method PHUI-Growth
After the definition. Let’s talk about our method

13 Mining High Utility Itemset in Big Data
PHUI-Growth Counting Phase Apply DCP to prune low TWU 1-itemsets Reorganize each transaction Mining Phase HUIs k-HUIs & conditional u-transactions Mapper 1 Reducer 1 Mapper 1 Reducer 1 Mapper 2 Reducer 2 Mapper 2 Reducer 2 Distributed Database Mapper 3 Reducer 3 Mapper 3 Reducer 3 This is the overview of our framework., (press) PHUI-Growth. First, during the counting phase, we can know the TWU of each item. Then, we prune the low TWU items, (press) according the TWU downward closure property and the database is transformed for the mining phase. In the mining phase, the high utility itemsets are discovered in each iteration and local unpromising items are prunes in reducer by (press) DLU-MR, which is introduced later. Mapper n Reducer m Mapper n Reducer m Iterative MapReduce Basic Pruning Strategy DLU-MR

14 Counting Phase Calculate the TWU of all items Key-value pair
Key is the item in a transaction Value is the TWU of the key item Reducer – 1 Key Value Output A <A, 28>,<A, 22> <A, 50> B <B, 28>,<B, 22> <B, 50> Mapper – 1 ABCD <A, 28>,<B, 28> <C, 28>,<D, 28> Distributed Database T1 A(8) B(6) C(8) D(6) Mapper – 2 ABC <A, 22>,<B, 22> <C, 22> Reducer – 2 Key Value Output C <C, 28>, <C, 22>, <C, 26> <C, 76> D <D, 28>, <D, 26> <D, 54> Let’s look at the counting phase. The distributed transactional database is same as previous example, and the utilities of each item in each transaction have been calculated. The minimum utility threshold is 30. In the mapper phase, we take item as the key, and the TWU of the item in the transaction as value. Like t1, separate the transaction into key-value pair as <A, 28>, <B, 28>, <C, 28> and <D, 28> In the reduce phase, all the key-value pairs with the same key go to the same reducer and the TWU of items in the transactional database is counted. Like B is in t1 and t2 (point t1 and t2). And they goes to the mapper phase become <B, 28> and <B,22> (point to the pairs). Then both two pairs go to the same reducer (point to the reducer B). Simply add the TWU of B and the TWU of B is 50 in the transactional database. T2 A(8) B(6) C(8) T3 C(4) D(6) E(8) F(8) Mapper – 3 CDEF <C, 26>,<D, 26> <E, 26>,<F, 26> T4 E(8) F(8) G(8) Reducer – 3 Key Value Output E <E, 26>, <E, 24> <E, 50> F <F, 26>, <F, 24> <F, 50> G <G, 24> Mapper – 4 EFG <E, 24>,<F, 24>, <G, 24>

15 Database Transformation
TWU A 50 B C 76 D 54 E F G 24 Original Database A(8), B(6), C(8), D(6) A(8), B(6), C(8) C(4), D(6), E(8),F(8) E(8), F(8), G(1) Transform Database A(8), B(6), D(6), C(8) A(8) B(6) C(8) E(8) F(8) D(6) C(4) E(8) F(8) Then the database is transformed for the mining phase. According to the result of counting phase, the TWUs of each items are obtained (point to the table). (press) Prune the low TWU items based on TWU downward closure property. In this example, G is a low TWU item. Last, (press) the transactional database is sorted in the TWU increasing order Prune the low TWU items Sort the items in TWU-increasing order 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑢𝑡𝑖𝑙𝑖𝑡𝑦 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝜃=30

16 Mining Phase All combination of items in a transactional database
Pattern explosion problem Pruning strategy Discarding local unpromising items in MapReduce framework (DLU-MR) The utility of an itemsets in the transactional database With the help of conditional u-transactions, the utility of an itemset is calculated easily In the mining phase, if we solve the following two tasks in an efficiently way, the HUIs in the transactional database can be found easily.

17 Combination of Items Mapper of mining phase
A transactions become conditional u-transactions Key contains an itemset Value contains the utility of the key itemset and the possible items for being combined into new itemset Combination of items is done in parallel Conditional u-transaction Mapper A(8), B(6), C(8) <{A}, 8, {B(6), C(8)}> <{B}, 6, {C(8)}> <{C}, 8, {ϕ}> Let’s look at the first task in the mining phase (press) the transaction is called u-transaction, which containing the utility of each item in the transaction. First we need to define the key-value pairs in the map phase. We separate the u-transaction into several key value pairs. Let’s look at an example directly. Like the u-transaction, it is sent to a mapper (point the mapper). Then, A, B and C is separated into three key value pairs. The key of the first pair is A and the value part is the utility of A in the transaction as well as the following items of the transaction. Also the key of second pair is B, and the value is the utility of B and the rest part of u-transaction. And the key of last pair is C, and there is no following item, as the result the value part only contains the utility of C and a empty set. And the combination of length-2 itemset is as following. We put B and C into key part and the utility of the key parts are updated and rest is the following part of the transaction. Then, we can solve the task of combination of items. And this task can be done in parallel However, this may face with pattern explosion problem. So efficient pruning strategy will be proposed latter. A(8) B(6) C(8) Mapper <{A}, 8, {B(6), C(8)}> <{A, B}, 14, {C(8)}> <{A, C}, 16, {ϕ}> <{A}, 8, {B(6), C(8)}>

18 Utility of An Itemset Reducer in the mining phase
The utility in each transaction of an itemset goes to the same reducer The utility of key itemset is calculated by adding up the utility of key itemset Each itemset goes to different reducer, as a result the calculation of utility is done in parallel Before pruning strategy, let’s look at the second task first. Calculate the utility of an itemset with the help of conditional u-transaction(press). All the utility of an itemset in each transaction will go to the same reducer. Like the example, because all utilities in each transaction of A is in the reducer. We can easily calculate the utility of A simply adding these utilities up. So the utility of A in the transactional database is 16. Due to the structure, the calculation of the utility is also done in parallel Reducer Key Value A <{A}, 8, {B(6), D(6), C(8)}> <{A}, 8, {B(6), C(8)}>

19 Pruning Strategy Discarding local unpromising items in MapReduce framework (DLU-MR) The strategy states that any local superset of a low local TWU itemset is low utility itemset The pruning task is done in parallel Local TWU B 50 C D 28 In the mining phase of PHUI-Growth, the pruning strategy, DLU-MR is proposed for discarding local unpromising items in the framework. Take the conditional u-transactions of key A as example. First, the TWU of each items in the reducer is calculated . And, (press)those items with low TWU are pruned from the conditional u-transactions of the key. The TWU of each items will gradually decrease when there is an item being pruned in the conditional u-transaction. So the pattern explosion can be solved in parallel and efficiently Reducer Key Value A <{A}, 8, {B(6), D(6), C(8)}> <{A}, 8, {B(6), C(8)}> Prune local unpromising item

20 Experiment Then, we can evaluate the performance of our method

21 Experiment Settings Environment settings Dataset 5-node Hadoop Cluster
CPU 2.6 GHz and 4 GB memory Dataset Dataset # Trans. # Items Average Trans. Length Maximum Trans. Length Retail 88,162 16,470 10 76 Chainstore 1,112,949 46,086 7 170 T10I4N10K|D| 2,00K 2,000,000 10,000 33 Chainstore x 5 5,564,745 All experiments were conducted on a five-node Hadoop Cluster. Each node is equipped with 2.60GHz CPU and 4 GB memory. Retail dataset was obtained from FIMI Repository. Chainstore is a real-life dataset. A synthetic dataset T10I4N10K|D|2M was generated from the IBM data generator. Then, the chain store of 5 times is for checking the scalability of our method.

22 Performance on Small Dataset
Comparing algorithm HUI-Miner [M. Liu et. al., CIKM, 2012] PHUI-Growth (Baseline) PHUI-Growth(DLU-MR) In this section, we compare the performance of PHUI-Growth with HUI-Miner [7], a state-of-the-art non-parallel type of HUI mining algorithms. To evaluate the effectiveness of the DLU-MR strategy, we prepared two versions of PHUI-Growth, respectively called PHUI-Growth(Baseline) and PHUI-Growth(DLU-MR). First let’s look at figure of execution time. When the minimum utility is higher than 0.02, we do spend more time than HUI-Miner. That’s because when the size of data is small, the communication overhead will dominate the execution time. However when the minimum utility is 0.01, HUI-miner takes about 4429 seconds. On the other hand, our method takes 556 seconds. How about number of candidates? When the threshold decreases, the number of HUIs dramatically increases and HUI-Miner need to produce a large amount of intermediate itemsets. However, the number of candidates produced by PHUI-Growth(DLU-MR) is up to two orders of magnitude smaller than that produced by HUI-Miner. Execution Time Number of Candidate

23 Performance on Large Datasets
PHUI-Growth(Baseline) and Growth(DLU-MR) outperform HUI-Miner significantly Mining HUI in parallel greatly improve the performance About the execution time of large dataset Results show that PHUI-Growth(Baseline) and Growth(DLU-MR) outperform HUI-Miner significantly. The reason why they perform so well is that they effectively use nodes of a cluster to parallel process HUIs across multiple machines, while HUI-Miner is executed on non-parallel single machine. Chainstore T10I4N10K|D|2,000K Chainstore x 5

24 Conclusion

25 Conclusion A new parallel framework, PHUI-Growth, for mining high utility itemsets in big data Parallel discover HUIs from distributed data across multiple computers. DLU-MR is proposed to prune the search space in parallel and greatly improve the performance for mining HUIs Empirical evaluations show that PHUI-Growth has good scalability on large datasets we propose a new framework, PHUI-Growth, for mining high utility itemsets in big data. The proposed algorithm is for efficiently parallel mining high utility itemsets from distributed data across multiple commodity computers. A novel strategy called DLU-MR is proposed to effectively prune the search space and greatly improve the performance for mining HUIs. Empirical evaluations of different types of real and synthetic datasets show that PHUI-Growth has good scalability on large datasets and outperforms the state-of-the-art algorithms.

26 Q & A


Download ppt "Mining High Utility Itemset in Big Data"

Similar presentations


Ads by Google