Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods > Capable of incorporating domain knowledge > Effective,

Slides:



Advertisements
Similar presentations
Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Advertisements

Integrating Genomes D. R. Zerbino, B. Paten, D. Haussler Science 336, 179 (2012) Teacher: Professor Chao, Kun-Mao Speaker: Ho, Bin-Shenq June 4, 2012.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Ontology annotation: mapping genomic regions biological function Paul D Thomas, Huaiyu Mi and Suzanna Lewis.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Profiles for Sequences
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Bioinformatics Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 1 Introduction Aleppo University Faculty of technical engineering.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Profile-profile alignment using hidden Markov models Wing Wong.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Bioinformatics and Phylogenetic Analysis
Introduction to BioInformatics GCB/CIS535
Bio 465 Summary. Overview Conserved DNA Conserved DNA Drug Targets, TreeSAAP Drug Targets, TreeSAAP Next Generation Sequencing Next Generation Sequencing.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
CISC667, F05, Lec24, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) DNA Microarray, 2d gel, MSMS, yeast 2-hybrid.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods that help solve biological problems > Capable of incorporating.
Protein Interactions and Disease Audry Kang 7/15/2013.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
341: Introduction to Bioinformatics Dr. Natasa Przulj Deaprtment of Computing Imperial College London
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
1 Bio + Informatics AAACTGCTGACCGGTAACTGAGGCCTGCCTGCAATTGCTTAACTTGGC An Overview پرتال پرتال بيوانفورماتيك ايرانيان.
ComPath Comparative Metabolic Pathway Analyzer Kwangmin Choi and Sun Kim School of Informatics Indiana University.
Problem Statement and Motivation Key Achievements and Future Goals Technical Approach Investigators: Yang Dai Prime Grant Support: NSF High-throughput.
Functional Associations of Protein in Entire Genomes Sequences Bioinformatics Center of Shanghai Institutes for Biological Sciences Bingding.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Gene Regulatory Network Inference. Progress in Disease Treatment  Personalized medicine is becoming more prevalent for several kinds of cancer treatment.
Finish up array applications Move on to proteomics Protein microarrays.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.
Discovering the Correlation Between Evolutionary Genomics and Protein-Protein Interaction Rezaul Kabir and Brett Thompson
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Monday, November 8, 2:30:07 PM  Ontology is the philosophical study of the nature of being, existence or reality as such, as well as the basic categories.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Protein and RNA Families
Overview of Bioinformatics 1 Module Denis Manley..
Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
EB3233 Bioinformatics Introduction to Bioinformatics.
An overview of Bioinformatics. Cell and Central Dogma.
Bioinformatics and Computational Biology
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
1 CISC 841 Bioinformatics (Fall 2008) Review Session.
BNFO 615 Fall 2016 Usman Roshan NJIT. Outline Machine learning for bioinformatics – Basic machine learning algorithms – Applications to bioinformatics.
KnowEnG: A SCALABLE KNOWLEDGE ENGINE FOR LARGE SCALE GENOMIC DATA
Hidden Markov Models (HMM)
High-throughput Biological Data The data deluge
Combining HMMs with SVMs
Genome organization and Bioinformatics
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Dr Tan Tin Wee Director Bioinformatics Centre
Sequence Based Analysis Tutorial
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001) Weekly Lab. Seminar
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (IV)
Profile HMMs GeneScan TMMOD
Mod. Reg. Data set Correct topology location Sens- itivity Speci- ficity TMMOD 1 (a) (b) (c) S (78.3%) 51 (61.4%) 64 (77.1%) 67 (80.7%) 52 (62.7%)
Presentation transcript:

Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods > Capable of incorporating domain knowledge > Effective, Expressive, Interpretable

Motivations Understanding correlations between genotype and phenotype Predicting genotype phenotype Phenotypes: –Protein function –Drug/therapy response –Drug-drug interactions for expression –Drug mechanism –Interacting pathways of metabolism

Projects –Homology detection, protein family classification (funded by a DuPont S&E award)  Support Vector Machines  Hidden Markov models  Graph theoretic methods –Probabilistic modeling for BioSequence (funded by NIH)  HMMs, and beyond  Motifs finding  Secondary structure –Comparative Genomics  Identify genome features for diagnostic and therapeutic purposes (funded by an Army grant)  Evolution of metabolic pathways Tree and graph comparisons

Detect remote homologues Attributes to be looked at: -Sequence similarity, Aggregate statistics (e.g., protein families), Pattern/motif, and more attributes (presence at phylogenetic tree). How to incorporate domain specific knowledge into the model so a classifier can be more accurate? Results: -Quasi-consensus based comparison of profile HMM for protein sequences (submitted to Bioinformatics) -Using extended phylogenetic profiles and support vector machines for protein family classification (SNPD 04) -Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships (JCB 2003)

Support Vector Machines

= = x = y = z = Hamming distanceTree-based distance Data: phylogenetic profiles - How to account for correlations among profile components?  profile extension (Narra & Liao, SNPD 04)

From MSA to profile HMMs using existing packages (SAM-T99 or HMMER) Generation of quasi consensus sequence from the model Alignment of consensus sequence of a model with the other model Extraction of two alignments in each direction Quasi consensus based comparison of HMMs V G A - - H A G E Y V N V D E V V E A - - D V A G H V K G D V Y S - - T Y E T S F N A - - N I P K H I A G A D N G A G V A G A - - H D G E F V N V D E F C K A - - D V A G H V K G F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V V G A - - H A G E Y V - K A - T I A E H A - G A - H D G E F Consensus 2 Seed 1 Seed 2 A G A - - H D G E F V - G A N - V A E H V - G A H - A G E Y Seed 2 Consensus 1 Seed 1 V - K A - T I A E H V G A - - H A G E Y V N V D E V V E A - - D V A G H V K G D V Y S - - T Y E T S F N A - - N I P K H I A G A D N G A G V V G A - - N V A E H S(c 2 |M 1 ) A - G A - H D G E F V G A - - H A G E Y Aln 21 A G A - - H D G E F V - G A H - A G E Y Aln 12 V - G A N - V A E H A G A - - H D G E F V N V D E F C K A - - D V A G H V K G F V L S - - T I E T S D N K - - T I A K H I A G A D T G A G V V K A - - T I A E H S(c 1 |M 2 ) M 1 V G A N V A E H Consensus 1 M 2 V K A T I A E H Consensus 2

Sequence Models (HMMs and beyond) Motivations: What is responsible for the function? –Patterns/motifs –Secondary structure To capture long range correlations of bio sequences –Transporter proteins –RNA secondary structure Methods: generative versus discriminative –Linear dependent processes –Stochastic grammars –Model equivalence

TMMOD: An improved hidden Markov model for predicting transmembrane topology (to appear in IEEE ICTAI04)

Mod.Reg. Data set Correct topology Correct location Sens- itivity Speci- ficity TMMOD 1 (a) (b) (c) S (78.3%) 51 (61.4%) 64 (77.1%) 67 (80.7%) 52 (62.7%) 65 (78.3%) 97.4% 71.3% 97.1% 97.4% 71.3% 97.1% TMMOD 2 (a) (b) (c) S (73.5%) 54 (65.1%) 65 (78.3%) 61 (73.5%) 66 (79.5%) 99.4% 93.8% 99.7% 97.4% 71.3% 97.1% TMMOD 3 (a) (b) (c) S (84.3%) 64 (77.1%) 74 (89.2%) 71 (85.5%) 65 (78.3%) 74 (89.2%) 98.2% 95.3% 99.1% 97.4% 71.3% 97.1% TMHMMS (77.1%)69 (83.1%)96.2% PHDtmS-83 (85.5%) (88.0%)98.8%95.2% TMMOD 1 (a) (b) (c) S (73.1%) 92 (57.5%) 117 (73.1%) 128 (80.0%) 103 (64.4%) 126 (78.8%) 97.4% 77.4% 96.1% 97.0% 80.8% 96.7% TMMOD 2 (a) (b) (c) S (75.0%) 97 (60.6%) 118 (73.8%) 132 (82.5%) 121 (75.6%) 135 (84.4%) 98.4% 97.7% 98.4% 97.2% 95.6% 97.2% TMMOD 3 (a) (b) (c) S (75.0%) 110 (68.8%) 135 (84.4%) 133 (83.1%) 124 (77.5%) 143 (89.4%) 97.8% 94.5% 98.3% 97.6% 98.1% TMHMMS (76.9%)134 (83.8%)97.1%97.7%

Genomics study of enterobacterial BT agents (funded by the US Army via Center for Biological Defense, USF ) Goals: –Identification of genes and sequence tags as targets for novel diagnosis and therapy –BT agents: Yersinia pestis, Salmonella, Escherichia coli O157:H7) Methods: –Various bioinformatics tools and databases

Comparative Genomics Motivation: –Evolution of metabolic pathways –Gene functions –De novo (alternative pathways) Genetic engineering Drug discovery Methods: –Put data into a context: knowledge/data representation Trees, graphs, etc. –Learning models/methods

O1O1 O2O2 OmOm P1P1 P1P1 PnPn         Profiling: pairs of attribute-value

What we found: Informative way to compare genomes Majority pathways (or rather their enzyme components) evolve in congruence with species

What we do next: –Database and search engine –Off-line self-consistent iteration –Pathways in a network Graph comparisons –Identify key components of networks –Small world topology Cross-level interactions with regulatory networks