Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

Slides:



Advertisements
Similar presentations
Genomes and Proteomes genome: complete set of genetic information in organism gene sequence contains recipe for making proteins (genotype) proteome: complete.
Advertisements

From Genome to Proteome Juang RH (2004) BCbasics Systems Biology, Integrated Biology.
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
Protein Sequencing and Identification by Mass Spectrometry.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
Annotating genomes using proteomics data Andy Jones Department of Preclinical Veterinary Science.
ProReP - Protein Results Parser v3.0©
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Novel Peptide Identification using ESTs and Sequence Database Compression Nathan Edwards Center for Bioinformatics and Computational Biology University.
Proteomics Josh Leung Biology 1220 April 13 th, 2010.
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Evaluated Reference MS/MS Spectra Libraries Current and Future NIST Programs.
Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
Genome Sequencing & App. of DNA Technologies Genomics is a branch of science that focuses on the interactions of sets of genes with the environment. –
Comparison of chicken light and dark meat using LC MALDI-TOF mass spectrometry as a model system for biomarker discovery WP 651 Jie Du; Stephen J. Hattan.
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
Improving Genome Annotation using Proteomics Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park.
Improving the Reliability of Peptide Identification by Tandem Mass Spectrometry Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Nathan Edwards Center for Bioinformatics and Computational Biology
Top-down characterization of proteins in bacteria with unsequenced genomes Nathan Edwards Georgetown University Medical Center.
Optimal k-mer superstrings for protein identification and DNA assay design. Nathan Edwards Center for Bioinformatics and Computational Biology University.
Direct Experimental Observation of Functional Protein Isoforms by Tandem Mass Spectrometry Nathan Edwards Center for Bioinformatics and Computational Biology.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Laxman Yetukuri T : Modeling of Proteomics Data
Protein bioinformatics and systems biology Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Genome Annotation Rosana O. Babu.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Lecture 9. Functional Genomics at the Protein Level: Proteomics.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
In-Gel Digestion Why In-Gel Digest?
Genomics II: The Proteome Using high-throughput methods to identify proteins and to understand their function.
PEAKS: De Novo Sequencing using Tandem Mass Spectrometry Bin Ma Dept. of Computer Science University of Western Ontario.
From Genomes to Genes Rui Alves.
Bioinformatics and Computational Biology
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Peptide Identification via Tandem Mass Spectrometry Sorin Istrail.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.
Overview of Mass Spectrometry
A New Strategy of Protein Identification in Proteomics Xinmin Yin CS Dept. Ball State Univ.
Separates charged atoms or molecules according to their mass-to-charge ratio Mass Spectrometry Frequently.
Lecture-9 MS Techniques and Protein Identification Huseyin Tombuloglu, Phd GBE423 Genomics & Proteomics.
Aggressive Enumeration of Peptide Sequences for MS/MS Peptide Identification Nathan Edwards Center for Bioinformatics and Computational Biology.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center  Peptide sequence databases, meta-search engine, machine-learning.
Improving the Sensitivity of Peptide Identification by Meta-Search, Grid-Computing, and Machine-Learning Nathan Edwards Georgetown University Medical Center.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.
Protein Sequence Databases for Proteomics The good, the bad & the ugly US HUPO: Bioinformatics for Proteomics Nathan Edwards – March 12, 2006.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Top-down characterization of proteins in bacteria with unsequenced genomes Colin Wynne Catherine Fenselau University of Maryland, College Park Nathan Edwards.
Proteomics: Technology and Cell Signaling Presenter: Ido Tal Advisor: Prof. Michal Linial י " ג סיון תשע " ה.
Peptide de novo sequencing Peptide de novo sequencing is the analytical process that derives a peptide’s amino acid sequence from its tandem mass spectrum.
Post translational modification n- acetylation Peptide Mass Fingerprinting (PMF) is an analytical technique for identifying unknown protein. Proteins to.
bacteria and eukaryotes
The Syllabus. The Syllabus Safety First !!! Students will not be allowed into the lab without proper attire. Proper attire is designed for your protection.
Genome organization and Bioinformatics
Introduction to Bioinformatics II
Proteomics Informatics David Fenyő
Shotgun Proteomics in Neuroscience
Proteomics Informatics David Fenyő
Presentation transcript:

Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park

2 Sample Preparation for Peptide Identification Enzymatic Digest and Fractionation

3 Mass Spectrometer Ionizer Sample + _ Mass Analyzer Detector MALDI Electro-Spray Ionization (ESI) Time-Of-Flight (TOF) Quadrapole Ion-Trap Electron Multiplier (EM)

4 Single Stage MS MS m/z

5 Tandem Mass Spectrometry (MS/MS) Precursor selection m/z

6 Tandem Mass Spectrometry (MS/MS) Precursor selection + collision induced dissociation (CID) MS/MS m/z

7 Peptide Identification For each (likely) peptide sequence 1. Compute fragment masses 2. Compare with spectrum 3. Retain those that match well Peptide sequences from protein sequence databases Swiss-Prot, IPI, NCBI’s nr,... Automated, high-throughput peptide identification in complex mixtures

8 What goes missing? Known coding SNPs Novel coding mutations Alternative splicing isoforms Alternative translation start-sites Microexons Alternative translation frames

9 Why should we care? Alternative splicing is the norm! Only 20-25K human genes Each gene makes many proteins Proteins have clinical implications Biomarker discovery Evidence for SNPs and alternative splicing stops with transcription Genomic assays, ESTs, mRNA sequence. Little hard evidence for translation start site

10 Novel Splice Isoform

11 Novel Splice Isoform

12 Novel Frame

13 Novel Frame

14 Novel Mutation Ala2→Pro associated with familial amyloid polyneuropathy

15 Novel Mutation

16 Genomic Peptide Sequences Genomic DNA Exons & introns, 6 frames, large (3Gb → 6Gb) ESTs No introns, 6 frames, large (4Gb → 8Gb) Used by gene, protein, and alternative splicing annotation pipelines Highly redundant, nucleotide error rate ~ 1%

17 Compressed EST Database Six-frame translation of all ESTs Optionally, ESTs that map to a gene Eliminate ORFs < 30 amino-acids Amino-acid 30-mers Observed in at least two ESTs Represent AA 30-mers in C 3 FASTA database Complete, Correct, Compact

18 SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

19 Compressed SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI

20 Sequence Databases & CSBH-graphs Original sequences correspond to paths ACDEFGI, ACDEFACG, DEFGEFGI

21 Sequence Databases & CSBH-graphs All k-mers represented by an edge have the same count

22 cSBH-graphs Quickly determine those that occur twice

23 Compressed-SBH-graph ACDEFGI

24 Compressed EST Database Gene centric compressed EST peptide sequence database 20,774 sequence entries ~8Gb vs 223 Mb ~35 fold compression 22 hours becomes 15 minutes E-values improve by similar factor! Makes routine EST searching feasible Search ESTs instead of IPI?

25 Conclusions Peptides identify more than just proteins Compressed peptide sequence databases make routine EST searching feasible cSBH-graph + edge counts + C 2 /C 3 enumeration algorithms Minimal FASTA representation of k-mer sets

26 Collaborators Chau-Wen Tseng, Xue Wu Computer Science Catherine Fenselau, Crystal Harvey Biochemistry Calibrant Biosystems Thanks to PeptideAtlas, X!Tandem