1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.

Slides:



Advertisements
Similar presentations
An Introduction to Data Mining
Advertisements

Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Microarray Data Analysis Day 2
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
BIOINFORMATICS Ency Lee.
Bioinformatics “Other techniques raise more questions than they answer. Bioinformatics is what answers the questions those techniques generate.” SheAvery
Bioinformatics at IU - Ketan Mane. Bioinformatics at IU What is Bioinformatics? Bioinformatics is the study of the inherent structure of biological information.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Bioinformatics and Phylogenetic Analysis
Introduction to Genomics, Bioinformatics & Proteomics Brian Rybarczyk, PhD PMABS Department of Biology University of North Carolina Chapel Hill.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
Introduction to BioInformatics GCB/CIS535
Scientific Data Mining: Emerging Developments and Challenges F. Seillier-Moiseiwitsch Bioinformatics Research Center Department of Mathematics and Statistics.
Data Mining – Intro.
Why microarrays in a bioinformatics class? Design of chips Quantitation of signals Integration of the data Extraction of groups of genes with linked expression.
Presented by Liu Qi An introduction to Bioinformatics Algorithms Qi Liu
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
341: Introduction to Bioinformatics Dr. Natasa Przulj Deaprtment of Computing Imperial College London
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
Bioinformatics Jan Taylor. A bit about me Biochemistry and Molecular Biology Computer Science, Computational Biology Multivariate statistics Machine learning.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
9/30/2004TCSS588A Isabelle Bichindaritz1 Introduction to Bioinformatics.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
1 Bio + Informatics AAACTGCTGACCGGTAACTGAGGCCTGCCTGCAATTGCTTAACTTGGC An Overview پرتال پرتال بيوانفورماتيك ايرانيان.
Chapter 1 Introduction to Data Mining
Introduction to Bioinformatics Spring 2002 Adapted from Irit Orr Course at WIS.
Finish up array applications Move on to proteomics Protein microarrays.
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.
What is Genetic Research?. Genetic Research Deals with Inherited Traits DNA Isolation Use bioinformatics to Research differences in DNA Genetic researchers.
REMINDERS 2 nd Exam on Nov.17 Coverage: Central Dogma of DNA Replication Transcription Translation Cell structure and function Recombinant DNA technology.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Mining Biological Data. Protein Enzymatic ProteinsTransport ProteinsRegulatory Proteins Storage ProteinsHormonal ProteinsReceptor Proteins.
Overview of Bioinformatics 1 Module Denis Manley..
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
BIOLOGICAL DATABASES. BIOLOGICAL DATA Bioinformatics is the science of Storing, Extracting, Organizing, Analyzing, and Interpreting information in biological.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Structural Models Lecture 11. Structural Models: Introduction Structural models display relationships among entities and have a variety of uses, such.
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
EB3233 Bioinformatics Introduction to Bioinformatics.
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
An overview of Bioinformatics. Cell and Central Dogma.
Mining the Biomedical Research Literature Ken Baclawski.
Bioinformatics and Computational Biology
An approach to carry out research and teaching in Bioinformatics in remote areas Alok Bhattacharya Centre for Computational Biology & Bioinformatics JAWAHARLAL.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
GeWorkbench Overview Support Team Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of MIT and Harvard.
Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods > Capable of incorporating domain knowledge > Effective,
Biotechnology and Bioinformatics: Bioinformatics Essential Idea: Bioinformatics is the use of computers to analyze sequence data in biological research.
High throughput biology data management and data intensive computing drivers George Michaels.
Of 24 lecture 11: ontology – mediation, merging & aligning.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
BME435 BIOINFORMATICS.
Biological Databases By: Komal Arora.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
High-throughput Biological Data The data deluge
Data Warehousing and Data Mining
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Data Mining.
Introduction to Bioinformatics
Presentation transcript:

1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang

2 Introduction How to bridge data mining and bioinformatics for successful data mining of biological data? Three major themes:  Data Cleaning, data preprocessing, and semantic integration of heterogeneous, distributed biomedical databases  Exploration of existing data mining tools for biodata analysis  Development of advanced, effective, and scalable data mining methods in biodata analysis

3 Research Topics on Advanced Data Mining Methods for Biodata Analysis Analysis of frequent patterns, sequential patterns and structured patterns: identification of cooccurring or correlated biosequences or biostructure patterns Effective classification and comparison of biodata

4 Various kinds of cluster analysis methods  Discovering pairwise frequent patterns and clustering biodata based on such frequent patterns Computational modeling of biological networks  Identifying the sequence of genetic activities across different stages of disease development Data visualization and visual data mining

5 Data Cleaning, Data Preprocessing, and Data Integration Biomedical data are stored in multiple distributed databases. Need automated preprocessing techniques Data cleaning: to ensure data quality (data interpretability)  How do the data enter the system? Minimum Information About a microarray Experiment (MIAME) MicroArray and Gene Expression (MAGE)

6 Data Cleaning (continued) How are the data delivered?  Verifying checksums or relationships between data streams  Using reliable transmission protocols Where do the data go after being received?  Hardware and software constraints

7 Data Cleaning (continued) Are the data combined with other data sets? How are the data retrieved? How are the data analyzed?  Computer science models and biomedical models have to come together

8 Data Preprocessing Multidisciplinary efforts are needed Process management: supporting standardization of content and format, automation of preprocessing Documentation of biomedical domain expertise: establishing metadata standard (MAGE-ML), creating annotation files, developing text-mining software Statistical and database analyses: including data cleaning, integration, transformation, and reduction

9 Semantic Integration of Heterogeneous Data Combining multiple sources into a coherent data store and finding semantically equivalent real-world entities from several biomedical sources to be matched up Semantic integration is still an open problem due to the complexity of bioontology and heterogeneous distributed nature of the recorded high-dimensional data

10 Semantic Integration of Heterogeneous Data Two approaches:  Construction of integrated biodata warehouses or biodatabases: requires common ontology and terminology and sophisticated data mapping rules  Construction of a federation of heterogeneous distributed biodatabases: builds up mapping rules or semantic ambiguity resolution rules across multiple databases

11 Exploration of Existing Data Mining Tools for Biodata Analysis DNA and Protein Sequence Analysis  Three basic approaches: sequence comparison, similarity search, pattern finding  Tools: Pairwise alignment tools: the Basic Local Alignment Search Tool (BLAST) Multiple sequence alignment tools: ClustalW  Challenging problems: promoter search, protein functional motif search

12 Genome Analysis  How is the whole genome put together from many small pieces of sequences?  Where are the genes located on a chromosome?  Challenging problem: prediction of gene structures Macromolecule Structure Analysis  Prediction of secondary structure of RNA and proteins  Comparison of protein structures  Protein Structure classification  Visualization of protein structures  Structure prediction is still an unsolved problem

13 Pathway Analysis  To build, model, and visualize biological processes among gene products Microarray Analysis  Algorithms: hierarchical clustering, k- means, self-organizing map, support vector machine, association rules, neural networks  Software: GeneSpring, Spotfire

14 Discovery of Frequent Sequential and Structured Patterns Most biodata patterns contain a substantial amount of noise or faults Mining Sequential Patterns  BLAST: For a protein or DNA sequence S, BLAST will find all similar sequences S’ in the database such that the aggregate mutation score from S to S’ is above some user-specified threshold.  Tandem repeat detection: A segment that occurs more than a certain number of times within a DNA sequence

15 Mining Structures Patterns  Apriori-like candidate generation and test approach: FSG  Frequent pattern growth approach: gSpan  Mining closed subgraph patterns rather than all subgraph patterns: A subgraph G is closed if there exists no supergraph G’ such that and support(G) = support(G’)

16 Classification Methods Normal cells vs. cancer cells Support vector machine (SVM) is considered the most accurate classification tool for many bioinformatics applications Drawback of SVM: complexity of training an SVM is O(N 2 )

17 Cluster Analysis Methods Clustering microarray data by biclustering or p-clustering  In microarray gene expression dataset, each column represents a condition, whereas each row represents a gene.  A bicluster is a subset of genes and conditions such that the subset of genes exhibits similar fluctuations under a given subset of conditions

18 Clustering sequential biodata  The functionality of a gene depends largely on its layout or the sequential order of amino acids or nucleotides.  If two genes or proteins have similar components, their functionality may be similar.

19 Computational Modeling of Biological Networks Molecular interactions in a cell can be represented using graphs of network connections. A set of connected molecular interactions can be considered as a pathway. Three subsystems: metabolic network or pathway, protein network, genetic or gene regulatory network

20 Data Visualization and Visual Data Mining Three types of visualization tools  Generic data visualization tools  Knowledge discovery in databases and model visualization tools  Interactive visualization environments for integrating data mining and visualization processes

21 Emerging Frontiers Text Mining in Bioinformatics  To find all the related literature and publications studying the same genes and proteins from different aspects  Automated mining of biochemical knowledge from digital repositories of scientific literature  Two approaches for recognizing interactions between proteins and other molecules: Based on occurrence statistics of gene names from MEDLINE documents to predict the connections among genes Use specific linguistic structures to extract protein interaction information from MEDLINE documents

22 Emerging Frontiers Systems Biology  To understand a system’s structure and dynamics  Four key properties: System structures: the network of gene interactions and biochemical pathways System dynamics: how a system behaves over time under various conditions The control method: the mechanisms that systematically control the state of the cell The design method: strategies to modify and construct biological systems having desired properties

23 Open Research Problems Data Quality Maintenance Visualization difficulties with high-dimensional data File standards, data storage, access, data mining, and information retrieval How to integrate biological knowledge into the designing and developing of data mining models and algorithms Find the rules or regularities that may disclose the mystery of the “dark matter” of a genome