Presentation is loading. Please wait.

Presentation is loading. Please wait.

WPI Center for Research in Exploratory Data and Information Analysis CREDIA KDDRG Research Projects Prof. Carolina Ruiz Department of Computer.

Similar presentations


Presentation on theme: "WPI Center for Research in Exploratory Data and Information Analysis CREDIA KDDRG Research Projects Prof. Carolina Ruiz Department of Computer."— Presentation transcript:

1 WPI Center for Research in Exploratory Data and Information Analysis CREDIA KDDRG Research Projects Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute

2 WPI Center for Research in Exploratory Data and Information Analysis CREDIA Some Current Analytical Data Mining Research Projects at WPI Mining Complex Data: Set and Sequence Mining –Systems performance Data –Sleep Data –Financial Data –Web Data Data Mining for Genetic Analysis –Correlating genetic information with diseases –Predicting gene expression patterns Data Mining for Electronic Commerce –Collaborative and Content-Based Filtering Using Association Rules and using Neural Networks

3 WPI Center for Research in Exploratory Data and Information Analysis CREDIA (Source: blsc.com ) DATA SET Clinical (sequential)  Electro-encephalogram (EEG),  Electro-oculogram (EOG),  Electro-myogram (EMG),  Probe measuring flow of Oxygen in blood etc. Purpose:  Associations between sleep patterns and health/pathology  Obtain patterns of different sleep stages (4 sleep+REM +Wake) Potential Rules: (A)Association Rules (Sleep latency Narcolepsy confidence=92%, support= 13% (B) Classification Rules (snoring= HEAVY) & (AHI* > 30/hour): severe OSA*** => (Race = Caucasian) confidence=70%, support= 8% *AHI = Apnea – Hypopnea index, **OSA = Obstructive Sleep Apnea Analyzing Sleep Data Diagnostic (tabular)  Questionnaire responses  Patient’s demographic info.  Patient’s medical history WPI, UMassMedical, BC

4 WPI Center for Research in Exploratory Data and Information Analysis CREDIA {depression, fatigue} 27M5 {stroke, dementia, fatigue} 97,72,67,80,…7390,92,96,89,86,…F23 {arthritis}102,99,87,96,…4997,100,82,80,70, … M14 … …… … …… Input Data Each instance: [Tabular | set | sequential] * attributes attr1 attr2 attr3 attr4 attr5 [class] illnesses heart rate age oxygen gender Epworth P1 P2 P3 …

5 WPI Center for Research in Exploratory Data and Information Analysis CREDIA Analyzing Financial Data Sequential data – daily stock values “Normal” (tabular/relational) data –sector (computers, agricultural, educational, …), type of government, product releases, companies awards, … Desired rules: –If DELL’s stock value increases & 1999 IBM’s stock value decreases

6 WPI Center for Research in Exploratory Data and Information Analysis CREDIA Events – Financial Data Basic events: 16 or so financial templates [Little&Rhodes78] difficult pattern matching – alignments and time warping Rounding Top Reversal Descending Triangle Reversal Panic Reversal Head & Shoulders Reversal

7 WPI Center for Research in Exploratory Data and Information Analysis CREDIA WPI Weka Tool for mining complex temporal/spatial associations

8 WPI Center for Research in Exploratory Data and Information Analysis CREDIA Data Mining for Genetic Analysis w/ Profs. Ryder (BB, WPI), Krushkal (BB, U. Tennessee), Ward (CS, WPI), and Alvarez (CS, BC) SNP analysis –discovering correlations between sequence variations and diseases Gene expression –discovering patterns that cause a gene to be expressed in a particular cell

9 WPI Center for Research in Exploratory Data and Information Analysis CREDIA Correlating Genetics with Diseases Utilize Data Mining Techniques with Actual Genetic Data Sampled from Research Spinal Muscular Atrophy: inherited disease that results in progressive muscle degeneration and weakness.

10 WPI Center for Research in Exploratory Data and Information Analysis CREDIA Genomic Data Resources Patient Gender SMA Type (Severity) SNP Location C212 Father / Mother AG1-CA FemaleSevereY272C 31 / / MaleMildY272C / / 114 Wirth, B. et al. Journal of Human Molecular Genetics

11 WPI Center for Research in Exploratory Data and Information Analysis CREDIA Our System: CAGE To predict gene expression based on DNA sequences. Muscle Cell Neural Cell Seam Cells CAGE Gene 1 Gene 2 Gene 3 Gene 1 Gene 2 Gene 3 Gene 2 Gene 1 On Off

12 WPI Center for Research in Exploratory Data and Information Analysis CREDIA Gene expression Analysis ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT PR1 PROMOTER(S) CELL TYPES PR2 PR3 PR4 PR5 PR6 PR7 PR8 PR9 M1 M2 M4 M5 M3 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9 neural muscle neural muscle neural muscle

13 WPI Center for Research in Exploratory Data and Information Analysis CREDIA..CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGA TRANSCRIPTIONAL PROTEINS MUSCLE CELL Gene Expression GENE Transcription of DNA into RNA PROMOTER REGION TF 1 TF 2 TF 3 M1M4M MOTIFS M1, M2, M4

14 WPI Center for Research in Exploratory Data and Information Analysis CREDIA ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT PR1 PROMOTER(S) PR2 PR3 PR4 PR5 PR6 PR7 PR8 PR9 M1 M2 M4 M5 M3 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9 neural muscle neural muscle neural muscle R1: M1, M4, M5 => Neural supp =22%, conf=100% [Supp. instances: PR1, PR2] R2: M2, M4, M5 => Neural supp =22%, conf=100% [Supp. instances: PR1,PR8]

15 WPI Center for Research in Exploratory Data and Information Analysis CREDIA “Well-clustered” motifs M1 M2 M4 M5 M Coefficient of variation of distances (cvd) between two motifs: IR1={M1,M2,M5}  (M1,M2) =  (M1,M2) = cvd(M1,M2) = 0.55

16 WPI Center for Research in Exploratory Data and Information Analysis CREDIA Distance-based Association Rules Given: –min-support –min-confidence –max-cvd thresholds Mine: –all distance-based association rules Sample distance-based assoc. rule

17 WPI Center for Research in Exploratory Data and Information Analysis CREDIA Grad. & Undergrad. Students Ali Benamara. Dharmesh Thakkar. Senthil K Palanisamy. Zachary Stoecker-Sylvia. Keith A. Pray. Jonathan Freyberger. Maged El-Sayed. Parameshvyas Laxminarayan. Aleksandar Icev. Wendy Kogel. Michael Sao Pedro. Christopher Shoemaker. Weiyang Lin. Jonathan Rudolph Eduardo Paredes Iavor N. Trifonov. Takeshi Kawato Cindy Leung and Sam Holmes. John Baird (BB), Jay Farmer, Rebecca Gougian (BB), Ken Monterio (BB), Paul Young. Zachary Stoecker-Sylvia. Kristin Blitsch (BB), Ben Lucas, Sarah Towey(BB) Wendy Kogel, Brooke LeClair, Christopher St. Yves. Brian Murphy, David Phu (CS/BB), Ian Pushee, Frederick Tan (CS/BB). Daniel Doyle, Jared Judecki, James Lund, Bryan Padovano (BB). Christopher Cole. Michael Ciman and John Gulbrandsen. Tara Halwes Christopher Martino. Matthew Berube. Anna Novikov. Amy Kao and Dana Rock.


Download ppt "WPI Center for Research in Exploratory Data and Information Analysis CREDIA KDDRG Research Projects Prof. Carolina Ruiz Department of Computer."

Similar presentations


Ads by Google