Collaborative Information Management: Advanced Information Processing in Bioinformatics Joost N. Kok LIACS - Leiden Institute of Advanced Computer Science.
Published byModified over 5 years ago
Presentation on theme: "Collaborative Information Management: Advanced Information Processing in Bioinformatics Joost N. Kok LIACS - Leiden Institute of Advanced Computer Science."— Presentation transcript:
Collaborative Information Management: Advanced Information Processing in Bioinformatics Joost N. Kok LIACS - Leiden Institute of Advanced Computer Science & LUMC - Leiden University Medical Center
BioRange Bioinformatics for microarray technology Bioinformatics for proteomics and metabolomics Integrative bioinformatics Vl-e informatics for bioinformatics applications Test bed with “real-life applications”
Biorange CIM, AIM in BioINF Five research lines: Information Structuring Heterogenous Data Integration Advanced Mining Algorithms Data Interlinking and Integration Data Storage and Management
Data Mining Data Mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data useful novel, surprising comprehensible valid (accurate)
Data Mining It is somewhat comparable to statistics (and often based on the latter), but takes it further in the sense that whereas statistics aims more at validating given hypotheses, in data mining often millions of potential patterns are generated and tested, in the hope of finding some that are potentially useful.
Case study: SNP data Genome scan comprising 500K data points (Single Nucleotide Polymorphisms or SNPs) in 900 subjects from families expressing survival to extremely high ages (longevity). The analysis of this set of 450 million data points is to recognize patterns specific for the genetic make-up of long survivors.
Case study: SNP data The genetic scan data will be combined with gene expression data (30,000 data points per subject in 100 subjects), protein data (NMR spectra from blood parameters in hundreds of subjects) and imaging data (quantitative photography of facial ageing parameters).
Case study: SNP data Subjects with SNP’s Classes (Young, Old) Above a certain support within Y,O Above a certain difference between classes Y,O Above a certain correlation with a class Y,O etc
Substructures Sequences DNA Trees XML documents Graphs Molecules GASTON Tools hms.liacs.nl
Mutagenicity data set of 4069 compounds (56% mutagenic) www.cheminformatics.org
Patternbases Pattern Databases = Patterns + Data Query Languages work on Patterns + Data Since patternbases provide an architecture for pattern discovery and a means to discover and use those patterns through the query language, data mining becomes in essence an interactive querying process.
Patternbases Derive new patterns from data + old patterns Apriori Algorithm: Frequent Item Sets Frequent Items Sets + Data: Assocation Rules
Patternbases Derive new patterns from data + old patterns Find all item sets that are correlated with classes Fix a We can prune the search space by only considering frequent item sets with minimum support