Presentation is loading. Please wait.

Presentation is loading. Please wait.

07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department.

Similar presentations


Presentation on theme: "07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department."— Presentation transcript:

1 07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu

2 07/03/06 - Tunisia2 Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003. DILBERT reprinted by permission of United Feature Syndicate, Inc.

3 07/03/06 - Tunisia3 Outline nWhat is Data Mining? nEMM n Spatio-temporal modeling n Rare Event Detection nBioinformatics n TCGR: DNA/RNA visualization n miRNA prediction nWeb Usage Mining

4 07/03/06 - Tunisia4 Data Mining Definition nFinding hidden information in a database nFit data to a model nSimilar terms n Exploratory data analysis n Data driven discovery n Deductive learning

5 07/03/06 - Tunisia5 Query Examples nDatabase nData Mining – Find all customers who have purchased milk – Find all items which are frequently purchased with milk. (association rules) – Find all credit applicants with last name of Smith. – Identify customers who have purchased more than $10,000 in the last month. – Find all credit applicants who are poor credit risks. (classification) – Identify customers with similar buying habits. (Clustering)

6 07/03/06 - Tunisia6 Outline nWhat is Data Mining? nEMM n Spatio-temporal modeling n Rare Event Detection nBioinformatics n TCGR: DNA/RNA visualization n miRNA prediction nWeb Usage Mining

7 07/03/06 - Tunisia7 Spatiotemporal Environment nEvents arriving in a stream nAt any time, t, we can view the state of the problem as represented by a vector of n numeric values: V t = V2V2 V2V2 …V2V2 S1S1 S 11 S 12 …S 1q S2S2 S 21 S 22 …S 2q …………… SnSn S n1 S n2 …S nq Time

8 07/03/06 - Tunisia8 Technique nSpatiotemporal modeling technique based on Markov models. nHowever – n Size of MM depends on size of dataset n The required structure of the MM is not known at the model construction time. n As the real world being modeled by the MM changes, so should the structure of the MM.

9 07/03/06 - Tunisia9 Extensible Markov Model (EMM) nTime Varying Discrete First Order Markov Model nNodes are clusters of real world states. nLearning continues during application phase. nLearning: n Transition probabilities between nodes n Node labels (centroid/medoid of cluster) n Nodes are added and removed as data arrives

10 07/03/06 - Tunisia10 EMM Learning <18,10,3,3,1,0,0><17,10,2,3,1,0,0><16,9,2,3,1,0,0><14,8,2,3,1,0,0><14,8,2,3,0,0,0><18,10,3,3,1,1,0.> 1/3 N1 N2 2/3 N3 1/1 1/3 N1 N2 2/3 1/1 N3 1/1 1/2 1/3 N1 N2 2/3 1/2 N3 1/1 2/3 1/3 N1 N2 N1 2/2 1/1 N1 1

11 07/03/06 - Tunisia11 Growth of EMM Servent Data

12 07/03/06 - Tunisia12 EMM Performance – Growth Rate Minnesota Traffic Data

13 07/03/06 - Tunisia13 EMM Water Level Prediction – Ouse Data

14 07/03/06 - Tunisia14 Rare Event nRare - Anomalous – Surprising nOut of the ordinary nNot outlier detection nEx: Snow in upstate New York is not rare n Snow in upstate New York in June is rare nRare events may change over time nApplications n Intrusion Detection n Fraud n Flooding n Unusual automobile/network traffic

15 07/03/06 - Tunisia15 Rare Event in Cisco Data

16 07/03/06 - Tunisia16 Outline nWhat is Data Mining? nEMM n Spatio-temporal modeling n Rare Event Detection nBioinformatics n TCGR: DNA/RNA visualization n miRNA prediction nWeb Usage Mining

17 07/03/06 - Tunisia17 Chaos Game Representation (CGR) n2D technique to visually see the distribution of subpatterns nOur technique is based on the following: n Generate totals for each subpattern n Scale totals to a [0,1] range. (Note scaling can be a problem) n Convert range to red/blue 0-0.5: White to Blue 0.5-1: Blue to Red AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU AC GU

18 07/03/06 - Tunisia18 CGR Example Homo Sapiens – all mature miRNA Patterns of length 3 UUC GUG

19 07/03/06 - Tunisia19 Temporal CGR (TCGR) nTemporal version of Frequency CGR n In our context temporal means the starting location of a window n2D Array n Each Row represents counts for a particular window in sequence First row – first window Last row – last window We start successive windows at the next character location n Each Column represents the counts for the associated pattern in that window Initially we have assumed order of patterns is alphabetic n Size of TCGR depends on sequence length and subpattern lengt nAs sequence lengths vary, we only examine complete windows nWe only count patterns completely contained in each window.

20 07/03/06 - Tunisia20 TCGR Example acgtgcacgtaactgattccggaaccaaatgtgcccacgtcga Moving Window Pattern Lngth: 1 2 3

21 07/03/06 - Tunisia21 TCGR – Mature miRNA (Window=5; Pattern=2) All MatureMus Musculus Homo SapiensC Elegans

22 07/03/06 - Tunisia22 Outline nWhat is Data Mining? nEMM n Spatio-temporal modeling n Rare Event Detection nBioinformatics n TCGR: DNA/RNA visualization n miRNA prediction nWeb Usage Mining

23 07/03/06 - Tunisia23 The BIG PICTURE 2003-10-0515:49:20050721435700000026210000000000 0265202652 000000000 2003-10-0516:40:49050832595900000872710001142380 0710707107 000000000 2003-10-0504:55:10050767799900000191300000670518 0000000000 000000000 2003-10-0509:43:10050781766100000603030000000000 0365700469 000000000 2003-10-0514:49:360508182420000007066200000000000811a39 0914207107 000000000 2003-10-0521:23:57050759031600000465050002794335 1199207107 000000000 2003-10-0511:30:16050730512600000465050000195747 1684600597corduroy+coats CAN’T SEE THE FOREST FOR THE TREES

24 Web Log Web Server nInterests… nMotivations… Preprocess Web Data: Cleanse Sessionize … URL Abstraction Cluster Web Sessions Markov Model User Preferred Navigation Trail Markov Model per Cluster User defined beginning/endin g Web pages Normalize d Probability Significant Usage Pattern

25 07/03/06 - Tunisia25 Experimental Result WebKDD’05 25  On average purchase sessions are longer than those sessions without purchase - review the information, compare the price, the quality and etc. - fill out the billing and shipping information to commit the purchase

26 07/03/06 - Tunisia26 WebKDD’05 26 Cluster No.No. of Sessions Threshold (  ) Average Session Length No. of StatesSUPs 117460.39.698 1. S-C 1 -C 1 -C 2 -C 3 -C 4 -C 5 -C 6 -C 7 -E 2. S-C 1 -C 1 -C 2 -C 3 -C 4 -C 5 -E 3. S-C 1 -C 1 -C 2 -C 3 -E 4. S-C 1 -C 2 -C 3 -C 3 -C 4 -C 5 -C 6 -C 7 -E 5. S-C 1 -C 2 -C 3 -C 4 -C 4 -C 5 -C 6 -C 7 -E … 22410.376.638 1. S-P 1 -P 2 -P 3 -P 3 -E 2. S-P 1 -P 2 -P 3 -P 4 -P 4 -P 5 -E 3. S-P 1 -P 2 -P 3 -P 4 -P 4 -E 4. S-P 1 -P 2 -P 3 -P 4 -P 5 -P 4 -E 5. S-P 1 -P 2 -P 3 -P 4 -P 5 -P 5 -E … 3130.33.06 1. S-C 1 -P 1 -P 2 -E 2. S-C 1 -P 1 -E 3. S-I 1 -P 1 -P 1 -P 2 -E 4. S-I 1 -P 1 -P 1 -E 5. S-I 1 -P 1 -E … S-C1-C1-C2-C3-C4-C5-C5-I1-E S-C1-C1-I1-C1-C2-C3-C4-C5-E S-I1-C1-C2-C3-C4-C5-C6-C7-E Interested in gathering information of products in different categories. Not serious visitors (the average session length is 3) Interested in reviewing general pages (to gather general information). SUPs in non-purchase cluster Experimental Result

27 07/03/06 - Tunisia27 WebKDD’05 27 Cluster No. No. of Sessions Average Session Length No. of States Threshold (  ) Beginning Web page SUPs in BNF Notation Non- Purchase 117469.698 0.3SS-{C}-E 0.25P86806P86806-{C}-E 22416.638 0.37SS-{P}-[C]-E 0.34P86806P86806-[I]-{P}-E 3133.06 0.3SS- -{P}-E 0.2P86806P86806-[{P}- [P86806]]-E Purchase 1185814.955 0.47SS-[C]-[I]-{P}-E 0.51P86806P86806-[I]-{P}-E 213239.1100 0.457SS -[{{C}|{I}}]-{P}-E 0.434P86806P86806-[{C }]-{P}-E 31031.647 0.52SS-{P}-[{I}]-[{P}]-{C}-E 0.43P86806P86806-[I]-[{P}]-{C}-E  The average length of SUPs is longer in the purchase cluster than in non-purchase cluster  SUPs in the purchase cluster have higher probability than those in non-purchase cluster. nreview the information, ncompare among products, nand fill out the payment and shipping information nhave purchase in mind vs. nrandom browsing behavior Experimental Result

28 07/03/06 - Tunisia28


Download ppt "07/03/06 - Tunisia1 ME Data Mining Research at SMU Margaret H. Dunham, DBGroup: Yu Meng, Jie Huang, Lin Lu, Donya Quick, Michael Pierce CSE Department."

Similar presentations


Ads by Google