Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.

Similar presentations


Presentation on theme: "1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang."— Presentation transcript:

1 1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang

2 2 Introduction How to bridge data mining and bioinformatics for successful data mining of biological data? Three major themes:  Data Cleaning, data preprocessing, and semantic integration of heterogeneous, distributed biomedical databases  Exploration of existing data mining tools for biodata analysis  Development of advanced, effective, and scalable data mining methods in biodata analysis

3 3 Research Topics on Advanced Data Mining Methods for Biodata Analysis Analysis of frequent patterns, sequential patterns and structured patterns: identification of cooccurring or correlated biosequences or biostructure patterns Effective classification and comparison of biodata

4 4 Various kinds of cluster analysis methods  Discovering pairwise frequent patterns and clustering biodata based on such frequent patterns Computational modeling of biological networks  Identifying the sequence of genetic activities across different stages of disease development Data visualization and visual data mining

5 5 Data Cleaning, Data Preprocessing, and Data Integration Biomedical data are stored in multiple distributed databases. Need automated preprocessing techniques Data cleaning: to ensure data quality (data interpretability)  How do the data enter the system? Minimum Information About a microarray Experiment (MIAME) MicroArray and Gene Expression (MAGE)

6 6 Data Cleaning (continued) How are the data delivered?  Verifying checksums or relationships between data streams  Using reliable transmission protocols Where do the data go after being received?  Hardware and software constraints

7 7 Data Cleaning (continued) Are the data combined with other data sets? How are the data retrieved? How are the data analyzed?  Computer science models and biomedical models have to come together

8 8 Data Preprocessing Multidisciplinary efforts are needed Process management: supporting standardization of content and format, automation of preprocessing Documentation of biomedical domain expertise: establishing metadata standard (MAGE-ML), creating annotation files, developing text-mining software Statistical and database analyses: including data cleaning, integration, transformation, and reduction

9 9 Semantic Integration of Heterogeneous Data Combining multiple sources into a coherent data store and finding semantically equivalent real-world entities from several biomedical sources to be matched up Semantic integration is still an open problem due to the complexity of bioontology and heterogeneous distributed nature of the recorded high-dimensional data

10 10 Semantic Integration of Heterogeneous Data Two approaches:  Construction of integrated biodata warehouses or biodatabases: requires common ontology and terminology and sophisticated data mapping rules  Construction of a federation of heterogeneous distributed biodatabases: builds up mapping rules or semantic ambiguity resolution rules across multiple databases

11 11 Exploration of Existing Data Mining Tools for Biodata Analysis DNA and Protein Sequence Analysis  Three basic approaches: sequence comparison, similarity search, pattern finding  Tools: Pairwise alignment tools: the Basic Local Alignment Search Tool (BLAST) Multiple sequence alignment tools: ClustalW  Challenging problems: promoter search, protein functional motif search

12 12 Genome Analysis  How is the whole genome put together from many small pieces of sequences?  Where are the genes located on a chromosome?  Challenging problem: prediction of gene structures Macromolecule Structure Analysis  Prediction of secondary structure of RNA and proteins  Comparison of protein structures  Protein Structure classification  Visualization of protein structures  Structure prediction is still an unsolved problem

13 13 Pathway Analysis  To build, model, and visualize biological processes among gene products Microarray Analysis  Algorithms: hierarchical clustering, k- means, self-organizing map, support vector machine, association rules, neural networks  Software: GeneSpring, Spotfire

14 14 Discovery of Frequent Sequential and Structured Patterns Most biodata patterns contain a substantial amount of noise or faults Mining Sequential Patterns  BLAST: For a protein or DNA sequence S, BLAST will find all similar sequences S’ in the database such that the aggregate mutation score from S to S’ is above some user-specified threshold.  Tandem repeat detection: A segment that occurs more than a certain number of times within a DNA sequence

15 15 Mining Structures Patterns  Apriori-like candidate generation and test approach: FSG  Frequent pattern growth approach: gSpan  Mining closed subgraph patterns rather than all subgraph patterns: A subgraph G is closed if there exists no supergraph G’ such that and support(G) = support(G’)

16 16 Classification Methods Normal cells vs. cancer cells Support vector machine (SVM) is considered the most accurate classification tool for many bioinformatics applications Drawback of SVM: complexity of training an SVM is O(N 2 )

17 17 Cluster Analysis Methods Clustering microarray data by biclustering or p-clustering  In microarray gene expression dataset, each column represents a condition, whereas each row represents a gene.  A bicluster is a subset of genes and conditions such that the subset of genes exhibits similar fluctuations under a given subset of conditions

18 18 Clustering sequential biodata  The functionality of a gene depends largely on its layout or the sequential order of amino acids or nucleotides.  If two genes or proteins have similar components, their functionality may be similar.

19 19 Computational Modeling of Biological Networks Molecular interactions in a cell can be represented using graphs of network connections. A set of connected molecular interactions can be considered as a pathway. Three subsystems: metabolic network or pathway, protein network, genetic or gene regulatory network

20 20 Data Visualization and Visual Data Mining Three types of visualization tools  Generic data visualization tools  Knowledge discovery in databases and model visualization tools  Interactive visualization environments for integrating data mining and visualization processes

21 21 Emerging Frontiers Text Mining in Bioinformatics  To find all the related literature and publications studying the same genes and proteins from different aspects  Automated mining of biochemical knowledge from digital repositories of scientific literature  Two approaches for recognizing interactions between proteins and other molecules: Based on occurrence statistics of gene names from MEDLINE documents to predict the connections among genes Use specific linguistic structures to extract protein interaction information from MEDLINE documents

22 22 Emerging Frontiers Systems Biology  To understand a system’s structure and dynamics  Four key properties: System structures: the network of gene interactions and biochemical pathways System dynamics: how a system behaves over time under various conditions The control method: the mechanisms that systematically control the state of the cell The design method: strategies to modify and construct biological systems having desired properties

23 23 Open Research Problems Data Quality Maintenance Visualization difficulties with high-dimensional data File standards, data storage, access, data mining, and information retrieval How to integrate biological knowledge into the designing and developing of data mining models and algorithms Find the rules or regularities that may disclose the mystery of the “dark matter” of a genome


Download ppt "1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang."

Similar presentations


Ads by Google