Presentation is loading. Please wait.

Presentation is loading. Please wait.

7/10/07 - SEDE'07 1 DATA MINING APPLICATIONS Margaret H. Dunham Southern Methodist University Dallas, Texas 75275 This material is based.

Similar presentations


Presentation on theme: "7/10/07 - SEDE'07 1 DATA MINING APPLICATIONS Margaret H. Dunham Southern Methodist University Dallas, Texas 75275 This material is based."— Presentation transcript:

1

2 7/10/07 - SEDE'07 1 DATA MINING APPLICATIONS Margaret H. Dunham Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu This material is based in part upon work supported by the National Science Foundation under Grant No. 9820841 Some slides used by permission from Dr Eamonn Keogh; Some slides used by permission from Dr Eamonn Keogh; University of California Riverside; eamonn@cs.ucr.edu

3 7/10/07 - SEDE'07 2 The 2000 ozone hole over the antarctic seen by EPTOMS http://jwocky.gsfc.nasa.gov/multi/multi.html#hole

4 7/10/07 - SEDE'07 3 OBJECTIVE Explore some of the applications of data mining techniques.

5 7/10/07 - SEDE'07 4 Data Mining Applications Outline nIntroduction – Data Mining Overview n Classification (Prediction,Forecasting) n Clustering n Association Rules (Link Analysis) nApplications n Fraud Detection & Illegal Activities n Facial Recognition n Cheating & Plagiarism n Bioinformatics nConclusions

6 7/10/07 - SEDE'07 5 Data Mining Overview nFinding hidden information in a database nFit data to a model nYou must know what you are looking for nYou must know how to look for you

7 7/10/07 - SEDE'07 6 “If it looks like a duck, walks like a duck, and quacks like a duck, then it’s a duck.” Description BehaviorAssociations Classification Clustering Link Analysis (Profiling) (Similarity) “If it looks like a terrorist, walks like a terrorist, and quacks like a terrorist, then it’s a terrorist.”

8 7/10/07 - SEDE'07 7 Classification Applications nTeachers classify students’ grades as A, B, C, D, or F. nLetter Recognition nandwriting Recognition nPhishing: http://computerworld.com/action/article.do?command= viewArticleBasic&taxonomyName=cybercrime_hackin g&articleId=9002996&taxonomyId=82 http://computerworld.com/action/article.do?command= viewArticleBasic&taxonomyName=cybercrime_hackin g&articleId=9002996&taxonomyId=82 nPluto: http://www.npr.org/templates/story/story.php?storyId= 5705254 http://www.npr.org/templates/story/story.php?storyId= 5705254

9 7/10/07 - SEDE'07 8 Grasshoppers Katydids Given a collection of annotated data. (in this case 5 instances of Katydids and five of Grasshoppers), decide what type of insect the unlabeled example is. (c) Eamonn Keogh, eamonn@cs.ucr.edu Classification Example

10 7/10/07 - SEDE'07 9 Antenna Length 10 123456789 1 2 3 4 5 6 7 8 9 Grasshoppers Katydids Abdomen Length (c) Eamonn Keogh, eamonn@cs.ucr.edu

11 7/10/07 - SEDE'07 10 Clustering Applications nTargeted Marketing nDetermining Gene Functionality nIdentifying Species nClustering vs. Classification n No prior knowledge n Number of clusters n Meaning of clusters nUnsupervised learning

12 7/10/07 - SEDE'07 11 http://149.170.199.144/multivar/ca.htm

13 7/10/07 - SEDE'07 12 What is Similarity ? (c) Eamonn Keogh, eamonn@cs.ucr.edu

14 7/10/07 - SEDE'07 13 Association Rules Applications nPeople who buy diapers also buy beer nIf gene A is highly expressed in this disease then gene B is also expressed nRelationships between people nwww.amazon.comwww.amazon.com nBook Stores nDepartment Stores nAdvertising nProduct Placement

15 7/10/07 - SEDE'07 14 Data Mining Introductory and Advanced Topics, by Margaret H. Dunham, Prentice Hall, 2003. DILBERT reprinted by permission of United Feature Syndicate, Inc.

16 7/10/07 - SEDE'07 15 Data Mining Applications Outline nIntroduction – Data Mining Overview n Classification (Prediction,Forecasting) n Clustering n Association Rules (Link Analysis) nApplications n Fraud Detection & Illegal Activities n Facial Recognition n Cheating & Plagiarism n Bioinformatics nConclusions

17 7/10/07 - SEDE'07 16

18 7/10/07 - SEDE'07 17 Fraud Detection nIdentify fraudulent behavior nUsed Extensively in financial, law enforcement, health care, etc. sectors nhttp://www.aaai.org/AITopics/html/fraud.htmlhttp://www.aaai.org/AITopics/html/fraud.html nSPSS: http://www.spss.com/predictiveclaims/fraud_det ection.htm http://www.spss.com/predictiveclaims/fraud_det ection.htm nNeural Technologies: http://www.neuralt.com/fraud_management.htmlttp://www.neuralt.com/fraud_management.html

19 7/10/07 - SEDE'07 18 Law Enforcement nIdentify suspect behavior and relationships nI2 Inc. n Investigative analytic/visualization software n http://www.i2inc.com http://www.i2inc.com nSocial Network Analysis – Analyze patterns of relationships nRelationships: personal, religious, operational, etc.

20 7/10/07 - SEDE'07 19 Jialun Qin, Jennifer J. Xu, Daning Hu, Marc Sageman and Hsinchun Chen, “Analyzing Terrorist Networks: A Case Study of the Global Salafi Jihad Network” Lecture Notes in Computer Science, Publisher: Springer-Verlag GmbH, Volume 3495 / 2005, p. 287.

21 7/10/07 - SEDE'07 20 Data Mining Applications Outline nIntroduction – Data Mining Overview n Classification (Prediction,Forecasting) n Clustering n Association Rules (Link Analysis) nApplications n Fraud Detection & Illegal Activities n Facial Recognition n Cheating & Plagiarism n Bioinformatics nConclusions

22 7/10/07 - SEDE'07 21 How Stuff Works, “Facial Recognition,” http://computer.howstuf fworks.com/facial- recognition1.htm

23 7/10/07 - SEDE'07 22 Facial Recognition nBased upon features in face nConvert face to a feature vector nLess invasive than other biometric techniques nhttp://www.face-rec.orghttp://www.face-rec.org nhttp://computer.howstuffworks.com/facial- recognition.htmhttp://computer.howstuffworks.com/facial- recognition.htm nSIMS: http://www.casinoincidentreporting.com/Prod ucts.aspx

24 7/10/07 - SEDE'07 23 (c) Eamonn Keogh, eamonn@cs.ucr.edu

25 7/10/07 - SEDE'07 24 Data Mining Applications Outline nIntroduction – Data Mining Overview n Classification (Prediction,Forecasting) n Clustering n Association Rules (Link Analysis) nApplications n Fraud Detection & Illegal Activities n Facial Recognition n Cheating & Plagiarism n Bioinformatics nConclusions

26 7/10/07 - SEDE'07 25 Cheating on Multiple Choice Tests nSimilarity between tests based on number of common wrong answers. n(George O. Wesolowsky, “Detecting Excessive Similarity in Answers on Multiple Choice Exams,” Journal of Applied Statistics, vol 27, no 7,200, pp909-923.) nThe number of common correct answers is often ignored. nH-H Index (D.N. Harpp, J.J. Hogan, and J.S. Jennings, 1996, “Crime in the Classroom – Part II, and update,” Journal of Chemical Education, vol 73, no 4, pp 349-351): H-H = (Number of exact answers in common) (Number of different answers)

27 7/10/07 - SEDE'07 26 Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

28 7/10/07 - SEDE'07 27 No/Little Cheating Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

29 7/10/07 - SEDE'07 28 Rampant Cheating Joshua Benton and Holly K. Hacker, “At Charters, Cheating’s off the Charts:, Dallas Morning News, June 4, 2007.

30 7/10/07 - SEDE'07 29 Data Mining Applications Outline nIntroduction – Data Mining Overview n Classification (Prediction,Forecasting) n Clustering n Association Rules (Link Analysis) nApplications n Fraud Detection & Illegal Activities n Facial Recognition n Cheating & Plagiarism n Bioinformatics nConclusions

31 7/10/07 - SEDE'07 30 DNA nBasic building blocks of organisms nLocated in nucleus of cells nComposed of 4 nucleotides nTwo strands bound together http://www.visionlearning.com/library/module_viewer.php?mi d=63

32 7/10/07 - SEDE'07 31 Protein RNA DNA transcription translation CCTGAGCCAACTATTGATGAA PEPTIDE CCUGAGCCAACUAUUGAUGAA Central Dogma: DNA -> RNA -> Protein www.bioalgorithms.infowww.bioalgorithms.info; chapter 6; Gene Prediction

33 7/10/07 - SEDE'07 32 miRNA nShort (20-25nt) sequence of noncoding RNA nKnown since 1993 but significance not widely appreciated until 2001 nImpact / Prevent translation of mRNA nGenerally reduce protein levels without impacting mRNA levels (animal cells) nFunctions n Causes some cancers n Guide embryo development n Regulate cell Differentiation n Associated with HIV n …

34 7/10/07 - SEDE'07 33 Questions nIf each cell in an organism contains the same DNA – n How does each cell behave differently? n Why do cells behave differently during childhood/? n What causes some cells to act differently – such as during disease? nDNA contains many genes, but only a few are being transcribed – why? nOne answer - miRNA

35 7/10/07 - SEDE'07 34 http://www.time.com/time/magazine/article/0,9171,1541283,00.html

36 7/10/07 - SEDE'07 35 Human Genome nScientists originally thought there would be about 100,000 genes nAppear to be about 20,000 nWHY? nAlmost identical to that of Chimps. What makes the difference? nVisualization from UCR dnaQT.mov nAnswers appear to lie in the noncoding regions of the DNA (formerly thought to be junk)

37 7/10/07 - SEDE'07 36 RNAi – Nobel Prize in Medicine 2006 Double stranded RNA Short Interfering RNA (~20-25 nt) RNA-Induced Silencing Complex Binds to mRNA Cuts RNA siRNA may be artificially added to cell! Image source: http://nobelprize.org/nobel_prizes/medicine/laureates/2006/adv.html, Advanced Information, Image 3 http://nobelprize.org/nobel_prizes/medicine/laureates/2006/adv.html

38 7/10/07 - SEDE'07 37 Computer Science & Bioinformatics nAlgorithms nData Structures nImproving efficiency nData Mining nBiologists don’t usually understand or even appreciate what Computer Science can do nIssues: n Scalability n Fuzzy nWe will look at: n Microarray Clustering n TCGR

39 7/10/07 - SEDE'07 38 Affymetrix GeneChip ® Array http://www.affymetrix.com/corporate/outreach/lesson_plan/educator_resources.affx

40 7/10/07 - SEDE'07 39 Microarray Data Analysis nEach probe location associated with gene nMeasure the amount of mRNA nColor indicates degree of gene expression nCompare different samples (normal/disease) nTrack same sample over time nQuestions n Which genes are related to this disease? n Which genes behave in a similar manner? n What is the function of a gene? nClustering n Hierarchical n K-means

41 7/10/07 - SEDE'07 40 Microarray Data - Clustering "Gene expression profiling identifies clinically relevant subtypes of prostate cancer" Proc. Natl. Acad. Sci. USA, Vol. 101, Issue 3, 811-816, January 20, 2004

42 7/10/07 - SEDE'07 41 miRNA Research Issues nPredict / Find miRNA in genomic sequence nPredict miRNA targets nIdentify miRNA functions

43 7/10/07 - SEDE'07 42 Temporal CGR (TCGR) n2D Array n Each Row represents counts for a particular window in sequence First row – first window Last row – last window We start successive windows at the next character location n Each Column represents the counts for the associated pattern in that window Initially we have assumed order of patterns is alphabetic n Size of TCGR depends on sequence length and subpattern length

44 7/10/07 - SEDE'07 43 TCGR Example (cont’d) TCGRs for Sub-patterns of length 1, 2, and 3

45 7/10/07 - SEDE'07 44 TCGR – Mature miRNA (Window=5; Pattern=3) All Mature Mus Musculus Homo Sapiens C Elegans ACG CGCGCGUCG

46 7/10/07 - SEDE'07 45 P O S I T I VE NE GA T I VE TCGRs for Xue Training Data C. Xue, F. Li, T. He, G. Liu, Y. Li, nad X. Zhang, “Classification of Real and Pseudo MicroRNA Precursors using Local Structure- Sequence Features and Support Vector Machine,” BMC Bioinformatics, vol 6, no 310.

47 7/10/07 - SEDE'07 46 PO S I T I VE NE GA T I VE TCGRs for Xue Test Data

48 7/10/07 - SEDE'07 47 Data Mining Applications Outline nIntroduction – Data Mining Overview n Classification (Prediction,Forecasting) n Clustering n Association Rules (Link Analysis) nApplications n Fraud Detection & Illegal Activities n Facial Recognition n Cheating & Plagiarism n Bioinformatics nConclusions

49 7/10/07 - SEDE'07 48 Conclusions nNot magic nDoesn’t work for all applications nStock Market Prediction nIssues n Privacy n Data nHere are some infamous examples of failed data mining applications

50 7/10/07 - SEDE'07 49

51 7/10/07 - SEDE'07 50 Dallas Morning News October 7, 2005

52 7/10/07 - SEDE'07 51 http://ieeexplore.ieee.org/iel5/6/32236/01502526.pdf?tp=&arnumber=1502526&isnumber=32236

53 7/10/07 - SEDE'07 52 BIG BROTHER ? nTotal Information Awareness n http://infowar.net/tia/www.darpa.mil/iao/index.htm http://infowar.net/tia/www.darpa.mil/iao/index.htm n http://www.govtech.net/magazine/story.php?id=45918 http://www.govtech.net/magazine/story.php?id=45918 n http://en.wikipedia.org/wiki/Information_Awareness_Office http://en.wikipedia.org/wiki/Information_Awareness_Office nTerror Watch List n http://www.businessweek.com/technology/content/may2005/tc20050 511_8047_tc_210.htm http://www.businessweek.com/technology/content/may2005/tc20050 511_8047_tc_210.htm n http://www.theregister.co.uk/2004/08/19/senator_on_terror_watch/ http://www.theregister.co.uk/2004/08/19/senator_on_terror_watch/ n http://blogs.abcnews.com/theblotter/2007/06/fbi_terror_watc.html http://blogs.abcnews.com/theblotter/2007/06/fbi_terror_watc.html n http://www.thedenverchannel.com/news/9559707/detail.html http://www.thedenverchannel.com/news/9559707/detail.html nCAPPS n http://www.theregister.co.uk/2004/04/26/airport_security_failures/ http://www.theregister.co.uk/2004/04/26/airport_security_failures/ n http://www.heritage.org/Research/HomelandDefense/BG1683.cfm http://www.heritage.org/Research/HomelandDefense/BG1683.cfm n http://www.theregister.co.uk/2004/07/16/homeland_capps_scrapped/ http://www.theregister.co.uk/2004/07/16/homeland_capps_scrapped/ n http://en.wikipedia.org/wiki/CAPPS http://en.wikipedia.org/wiki/CAPPS

54 7/10/07 - SEDE'07 53

55 7/10/07 - SEDE'07 54

56 7/10/07 - SEDE'07 55


Download ppt "7/10/07 - SEDE'07 1 DATA MINING APPLICATIONS Margaret H. Dunham Southern Methodist University Dallas, Texas 75275 This material is based."

Similar presentations


Ads by Google