A number of slides taken/modified from:

Slides:



Advertisements
Similar presentations
Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1.
Advertisements

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Darwinian Genomics Csaba Pal Biological Research Center Szeged, Hungary.
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Bioinformatics at WSU Matt Settles Bioinformatics Core Washington State University Wednesday, April 23, 2008 WSU Linux User Group (LUG)‏
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Bioinformatics For MNW 2 nd Year Jaap Heringa FEW/FALW Integrative Bioinformatics Institute VU (IBIVU) Tel ,
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Structural bioinformatics
Bioinformatics Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 1 Introduction Aleppo University Faculty of technical engineering.
Intro to Bioinformatics Summary. What did we learn Pairwise alignment – Local and Global Alignments When? How ? Tools : for local blast2seq, for global.
Introduction to DNA Microarrays Todd Lowe BME 88a March 11, 2003.
Sequence Similarity Searching Class 4 March 2010.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Data-intensive Computing: Case Study Area 1: Bioinformatics B. Ramamurthy 6/17/20151.
Bioinformatics and Phylogenetic Analysis
Introduction to Genomics, Bioinformatics & Proteomics Brian Rybarczyk, PhD PMABS Department of Biology University of North Carolina Chapel Hill.
Introduction to BioInformatics GCB/CIS535
Scientific Data Mining: Emerging Developments and Challenges F. Seillier-Moiseiwitsch Bioinformatics Research Center Department of Mathematics and Statistics.
The Protein Data Bank (PDB)
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Signaling Pathways and Summary June 30, 2005 Signaling lecture Course summary Tomorrow Next Week Friday, 7/8/05 Morning presentation of writing assignments.
Topics in Computational Biology (COSI 230a) Pengyu Hong 09/02/2005.
Why microarrays in a bioinformatics class? Design of chips Quantitation of signals Integration of the data Extraction of groups of genes with linked expression.
ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
341: Introduction to Bioinformatics Dr. Natasa Przulj Deaprtment of Computing Imperial College London
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Bioinformatics Jan Taylor. A bit about me Biochemistry and Molecular Biology Computer Science, Computational Biology Multivariate statistics Machine learning.
Multiple sequence alignment
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Metagenomic Analysis Using MEGAN4
Gene Regulatory Network Inference. Progress in Disease Treatment  Personalized medicine is becoming more prevalent for several kinds of cancer treatment.
Networks and Interactions Boo Virk v1.0.
Introduction to Bioinformatics Spring 2002 Adapted from Irit Orr Course at WIS.
Finish up array applications Move on to proteomics Protein microarrays.
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.
Bioinformatics For MNW 2 nd Year Jaap Heringa FEW/FALW Centre for Integrative Bioinformatics VU (IBIVU) Tel ,
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Modeling of complex systems: what is relevant? Arno Knobbe, Marvin Meeng, Joost Kok Leiden Institute of Advanced Computer Science (LIACS)
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Protein and RNA Families
Introduction to Bioinformatics Dr. Rybarczyk, PhD University of North Carolina-Chapel Hill
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Gene Expression and Networks. 2 Microarray Analysis Supervised Methods -Analysis of variance -Discriminate analysis -Support Vector Machine (SVM) Unsupervised.
Proteomics Session 1 Introduction. Some basic concepts in biology and biochemistry.
Central dogma: the story of life RNA DNA Protein.
EB3233 Bioinformatics Introduction to Bioinformatics.
An overview of Bioinformatics. Cell and Central Dogma.
Bioinformatics and Computational Biology
Introduction to biological molecular networks
A Report on CAMDA’01 Biointelligence Lab School of Computer Science and Engineering Seoul National University Kyu-Baek Hwang and Jeong-Ho Chang.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
1 Mona Singh What is computational biology?. 2 Mona Singh Genome The entire hereditary information content of an organism.
High throughput biology data management and data intensive computing drivers George Michaels.
Microarray: An Introduction
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Bioinformatics Overview
“Proteomics is a science that focuses on the study of proteins: their roles, their structures, their localization, their interactions, and other factors.”
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Bioinformatics For MNW 2nd Year
From Mendel to Genomics
Introduction to Bioinformatics
Presentation transcript:

A number of slides taken/modified from: CIS 595 Bioinformatics Lecture 2 Introduction to Bioinformatics A number of slides taken/modified from: Russ B. Altman (http://www.smi.stanford.edu/projects/helix/bmi214/) Patrik Medstrand (www.cmb.lu.se/devbiol/bioinfo/ old/download/intro2003/databases_handouts.pdf) Mark Gerstein (http://bioinfo.mbb.yale.edu/mbb452a/2002/sequences2002.pdf)

What is Bioinformatics? Every application of computer science to biology Sequence analysis, images analysis, sample management, population modeling, … Analysis of data coming from large-scale biological projects Genomes, transcriptomes, proteomes, metabolomes, etc…

The New Biology Traditional biology New “high-throughput” biology Small team working on a specialized topic Well defined experiment to answer precise questions New “high-throughput” biology Large international teams using cutting edge technology defining the project Results are given raw to the scientific community without any underlying hypothesis

Examples of “High-Throughput” Complete genome sequencing Simultaneous expression analysis of thousands of genes (DNA microarrays, SAGE) Large-scale sampling of the proteome Protein-protein analysis large-scale 2-hybrid (yeast, worm) Large-scale 3D structure production (yeast) Metabolism modeling Biodiversity

Role of Bioinformatics Control and management of the data Sequence, Structure and Function analysis Analysis of primary data e.g. Mass spectra analysis DNA microarrays image analysis Statistics Database storage and access Interpreting results in a biological context

Sequence, Structure and Function Analysis In order to gather insight into the ways in which genes and gene products (proteins) function perform: SEQUENCE ANALYSIS: Analyze DNA and protein sequences, searching for clues about structure, function, and control. STRUCTURE ANALYSIS: Analyze biological structures, searching for clues about sequence, function and control. FUNCTION ANALYSIS: Understand how the sequences and structures leads to the functions.

Evolution and Bioinformatics Common descent of organisms implies that they will share many “basic technologies.” Development of new phenotypes in response to environmental pressure can lead to “specialized technologies.” More recent divergence implies more shared technologies between species. All of biology is about two things: understanding shared or unshared features.

Biology is Fundamentally Information Science Where is information: DNA Sequences GENBANK release 128 (2/02) contains 17,089,143,893 bases in 1,546,532 sequences Protein Sequences PIR or Swiss-prot (as of 3/02); 106,736 sequences, 39,242,287 total amino acids Protein 3D Structures Protein Data Bank (PDB), as of March 2002: 17,679 Coordinate Entries; 15,855 proteins, 1060 nucleic acids, 746 protein/nucleic acid complex 18 carbohydrates

Biology is Fundamentally Information Science Where is information: Online access to DNA microarray data http://smd.stanford.edu/; 10,000 to 40,000 genes per chip; Each set of experiments involves 3 to 100 “conditions” Medical Literature on line. Online database of published literature since 1966 = Medline = PubMED resource 4,600 journals 11,000,000+ articles (most with abstracts) ETC…

Topics Sequence Alignment; Sequence Motifs; Gene Finding Computing with Biological Structures Phylogenetic Algorithms Microarray Data Analysis Genetic Networks Comparative Genomics Proteomics Biological Ontologies; Biological Text Mining

Sequence Alignment What is sequence alignment? Why align sequences? Given two sequences and a scoring scheme find the optimal pairing of letters. RKVA--GMAKPNM RKIAVAAASKPAV Why align sequences? A few sequences with known structure and function; much more with unknown properties. If one of them has known structure/function, then alignment to the other yields insight about another Similarity may be used as evidence of homology, but does not necessarily imply homology

Sequence Alignment Types of alignment: Local vs. global; Pairwise vs. multiple d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPWN-LPADLAWFKRNTL--------NKPVIMGRHTWESI d3dfr__ TAFLWAQDRDGLIGKDGHLPWH-LPDDLHYFRAQTV--------GKIMVVGRRTYESF

Sequence Alignment How to measure the alignment quality? Define scoring matrix (PAM250)

Sequence Alignment Alignment algorithms: Similarity strength: dot matrix dynamic programming Fasta, Blast, Psi-Blast; Clustal Similarity strength: Percent identity E-value (statistical measure)

Sequence Alignment

Sequence Motifs A subsequence that occurs in multiple sequences with a biological importance. Protein motifs often result from structural features DNA sequences that provide signals for protein binding or nucleic acid folding

Sequence Motifs PROSITE Database a collection of motifs (1135 different motifs): A manually created collection of regular expressions associated with different protein families/functions. Globin sequence signature (PDOC00933): F-[LF]-x(5)-G-[PA]-x(4)-G-[KRA]-x-[LIVM]-x(3)-H

Gene Finding Problem : Identify the genes within raw genomic DNA sequence Input: Raw DNA sequence Output: Location of gene elements in the raw sequence (including exons, introns, other sequence annotations)

Topics Sequence Alignment; Sequence Motifs; Gene Finding Computing with Biological Structures Phylogenetic Algorithms Microarray Data Analysis Genetic Networks Comparative Genomics Proteomics Biological Ontologies; Biological Text Mining

Computing with Biological Structures General Issues How do we represent structure for computation? How do we compare structures? How can we summarize structural families?

Computing with Biological Structures Applications: Structure alignment Build fold library Hb Alignment of Individual Structures Fusing into a Single Fold “Template” Mb

Computing with Biological Structures Why align structures: Provides the “gold standard” for sequence alignment For nonhomologous proteins, identify common substructures of interest Classify proteins into clusters, based on structural similarity (SCOP)

Computing with Biological Structures Applications: Predicting RNA Secondary Structure (the MFOLD Program http://www.bioinfo.rpi.edu/applications/mfold/old/rna/)

Computing with Biological Structures Protein secondary structure prediction Sequence RPDFCLEPPYTGPCKARIIRYFYNAKAGLVQTFVYGGCRAKRNNFKSAEDAMRTCGGA Structure CCGGGGCCCCCCCCCCCEEEEEEETTTTEEEEEEECCCCCTTTTBTTHHHHHHHHHCC

Topics Sequence Alignment; Sequence Motifs; Gene Finding Computing with Biological Structures Phylogenetic Algorithms Microarray Data Analysis Genetic Networks Comparative Genomics Proteomics Biological Ontologies; Biological Text Mining

Phylogenetic Algorithms Why build evolutionary tree? Understand the lineage of different species. Have an organizing principle for sorting species into a taxonomy Understand how various functions evolved. Understand forces and constraints on evolution. To do multiple alignment.

Phylogenetic Algorithms Multiple Alignment and Trees Progressive alignment methods do multiple alignment and evolutionary tree construction at the same time. Sequence alignment provides scores which can be interpreted as inversely related to distances in evolution. Distances can be used to build trees. Trees can be used to give multiple alignments via common parents.

Topics Sequence Alignment; Sequence Motifs; Gene Finding Computing with Biological Structures Phylogenetic Algorithms Microarray Data Analysis Genetic Networks Comparative Genomics Proteomics Biological Ontologies; Biological Text Mining

Microarray Data Analysis Experimental Protocol

Microarray Data Analysis

Microarray Data Analysis What are expression arrays good for? Follow population of (synchronized) cells over time, to see how expression changes (vs. baseline). Expose cells to different external stimuli and measure their response (vs. baseline). Take cancer cells (or other pathology) and compare to normal cells. (Also some non-expression uses, such as assessing presence/absence of sequences in the genome)

Microarray Data Analysis Preprocessing Merging replicate experiments Score differential hybridization Background correction Cy5/Cy3 normalization Data input Duplicate spot variability Replicate experiment variability Spot quality Artifactual regions

Microarray Data Analysis Convert microarray images to data

Microarray Data Analysis Clustering: If two genes are expressed in the same way, they may be functionally related. If a gene has unknown function, but clusters with genes of known function, this is a way to assign its general function. We may be able to look at high resolution measurements of expression and figure out which genes control which other genes. E.g. peak in cluster 1 always precedes peak in cluster 2 => cluster 1 turns cluster 2 on?

Microarray Data Analysis Classification: Uses known groups of interest (from other sources) to learn the features associated with these groups in the primary data, create rules for associating the data with the groups of interest. Often called “supervised machine learning.”

Topics Sequence Alignment; Sequence Motifs; Gene Finding Computing with Biological Structures Phylogenetic Algorithms Microarray Data Analysis Genetic Networks Comparative Genomics Proteomics Biological Ontologies; Biological Text Mining

Genetic Networks What is a genetic network? Individual genes have a function (e.g. transforming a substance or binding to a substance) Sets of functions when sequenced can produce pathways (e.g. output of one transformation is the input to another) Sets of pathways, as they interact with other pathways, create a genetic network of interactions.

Genetic Networks Reconstructing Genetic Regulatory Networks: Hard problem. Given N genes, there are an exponential number of connections between the genes. Relationships are not generally +/- but are but are continuous valued. Must use knowledge about expected function and membership in pathways to prune the list of possible network interactions.

Topics Sequence Alignment; Sequence Motifs; Gene Finding Computing with Biological Structures Phylogenetic Algorithms Microarray Data Analysis Genetic Networks Comparative Genomics Proteomics Biological Ontologies; Biological Text Mining

Comparative Genomics Large scale comparison of genomes to Assumption: understand the biology of individual genomes extract general principles applying to groups of genomes. Assumption: many biological sequences, structures, and functions are shared across organisms, the signal from these organisms can be increased by combining them in analyses.

Comparative Genomics Important issues for Comparative Genomics Aligning very large sequences Comparative approaches to gene finding Comparative approaches to assigning function Comparative approaches to identifying key regulatory regions

Example: Assigning protein functions Comparative Genomics Example: Assigning protein functions

Topics Sequence Alignment; Sequence Motifs; Gene Finding Computing with Biological Structures Phylogenetic Algorithms Microarray Data Analysis Genetic Networks Comparative Genomics Proteomics Biological Ontologies; Biological Text Mining

Proteomics What is PROTEOMICS? -OMICS has become the suffix to denote the study of the entire set of something Genomics: study of all genes Proteomics: study of all proteins Transcriptomics: study of all mRNA transcripts Metabolomics: study of metabolites in cell

Proteomics Proteomics questions Which proteins are made from the genome? What is their 3D structure? Where they are? What they do? Which other proteins they interact with? Are they modified in the cell post-translationally?

Proteomics Key proteomic technologies 3D structure determination (X-ray/NMR) 2D Gels to assess all the proteins in a cell. Mass spectrometry to identify proteins, protein modifications. Yeast-Two-Hybrid systems to assess protein-protein interactions Protein Arrays to assess all proteins in a cell using antibodies or other recognition technology.

Topics Sequence Alignment; Sequence Motifs; Gene Finding Computing with Biological Structures Phylogenetic Algorithms Microarray Data Analysis Genetic Networks Comparative Genomics Proteomics Biological Ontologies; Biological Text Mining

Biomedical Ontologies In order to communicate effectively we need: common language basic knowledge Example: Metabolic Pathways: language: names of products, enzymes, substrates and pathways knowledge: what is a reaction, how do enzymes and substrates participate, what are the legal components of a pathway

Biomedical Ontologies Gene Ontology (http://www.geneontology.org/) Used to classify gene function. A controlled listing of three types of function: Molecular Function Biological Process Cellular Component

Biological Text Mining Literature in Biomedicine Much literature generated quickly. 11 million citations in MEDLINE. 400,000 added yearly. Need methods to deal with data. Query Summarize Organize Understand

Long term challenges Computational model of physiology. Can we give a medication to a computer before we give it to a human? Design of new compounds for medical and industrial use. Can we design a protein or nucleic acid to have a specified function? Engineering new biological pathways. Can we devise methods for designing and implementing new metabolic capabilities for treating disease? Data mining for new knowledge. Can we ask computer programs to examine data (in the context of our models) and create new knowledge?