Making Sense of DNA and Protein Sequences Based on a NCBI minicourse Presented by Jae-Hyung Lee, ISU June 14, 2007.

Slides:



Advertisements
Similar presentations
1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
Advertisements

© Wiley Publishing All Rights Reserved. How Most People Use Bioinformatics.
On line (DNA and amino acid) Sequence Information Lecture 7.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Comparative genomics Joachim Bargsten February 2012.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Profiles for Sequences
Psi-BLAST, Prosite, UCSC Genome Browser Lecture 3.
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Bioinformatics and Phylogenetic Analysis
Tutorial 5 Motif discovery.
Protein Modules An Introduction to Bioinformatics.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
On line (DNA and amino acid) Sequence Information
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Functional Linkages between Proteins. Introduction Piles of Information Flakes of Knowledge AGCATCCGACTAGCATCAGCTAGCAGCAGA CTCACGATGTGACTGCATGCGTCATTATCTA.
Essential Bioinformatics and Biocomputing Module (Tutorial) Biological Databases Lecturer: Chen Yuzong Jan 2003 TAs: Cao Zhiwei Lee Teckkwong, Bernett.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
1 LSM2241 AY0910 Semester 2 MiniProject Briefing Round 5.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Protein and RNA Families
Condor: BLAST Rob Quick Open Science Grid Indiana University.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Motif discovery and Protein Databases Tutorial 5.
P HYLO P AT : AN UPDATED VERSION OF THE PHYLOGENETIC PATTERN DATABASE CONTAINS GENE NEIGHBORHOOD Presenter: Reihaneh Rabbany Presented in Bioinformatics.
Using blast to study gene evolution – an example.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
Construction of Substitution matrices
InterPro Sandra Orchard.
Biotechnology and Bioinformatics: Bioinformatics Essential Idea: Bioinformatics is the use of computers to analyze sequence data in biological research.
Biocomputational Languages December 1, 2011 Greg Antell & Khoa Nguyen.
Bioinformatics Computing 1 CMP 807 – Day 4 Kevin Galens.
Bioinformatics What is a genome? How are databases used? What is a phylogentic tree?
Sequence similarity, BLAST alignments & multiple sequence alignments
Demo: Protein Information Resource
Sequence based searches:
Comparative Genomics.
Pipelines for Computational Analysis (Bioinformatics)
Genome Annotation Continued
GEP Annotation Workflow
Genome Projects Maps Human Genome Mapping Human Genome Sequencing
Genome Center of Wisconsin, UW-Madison
There are four levels of structure in proteins
Sequence Based Analysis Tutorial
Identify D. melanogaster ortholog
Basic Local Alignment Search Tool
Presentation transcript:

Making Sense of DNA and Protein Sequences Based on a NCBI minicourse Presented by Jae-Hyung Lee, ISU June 14, 2007

Objectives 1.In this lab, we will first try to make sense of the DNA sequence by determining whether it codes for a protein 2.If it does, we will use the protein sequence to search for the presence of any motifs or structural domains and also to try to predict its function. 3.Finally, we will map the protein sequence onto the structure of a protein with similar sequence.

Chromosome to protein DNA protein Chromosome nucleosome MESARSQVQV HAPVVAARLR NIWPKFPKWL HEAPLAVAWE VTRLFMHCKV DLEDESLGLK YDPSWSTARD VTDIWKTLYRL ….

Drosophila melanogaster 4 chromosomes 180 Mb total sequence 140 Mb euchromatic sequence 12, ,000 genes Genome database - FlyBase ISMB `99 tutorial The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster

LOCUS AE bp DNA linear INV 13-JAN-2006 DEFINITION Drosophila melanogaster chromosome 2L, section 7 of 83 of the complete sequence. ACCESSION AE AE AE AE VERSION AE GI: KEYWORDS. SOURCE Drosophila melanogaster (fruit fly) ORGANISM Drosophila melanogaster Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophila. REFERENCE 1 (bases 1 to ) AUTHORS Adams,M.D., Celniker,S.E., Holt,R.A., Evans,C.A., Gocayne,J.D., Amanatides,P.G., Scherer,S.E., Li,P.W., Hoskins,R.A., Galle,R.F., George,R.A., Lewis,S.E., Richards,S., Ashburner,M., Henderson,S.N., Sutton,G.G., Wortman,J.R., Yandell,M.D., Zhang,Q., Chen,L.X., Brandon,R.C., Rogers,Y.H., Blazej,R.G., Champe,M., Pfeiffer,B.D., Wan,K.H., Doyle,C., Baxter,E.G., Helt,G., Nelson,C.R., Gabor,G.L., Abril,J.F., Agbayani,A., An,H.J., Andrews-Pfannkoch,C., Baldwin,D., Ballew,R.M., Basu,A., Baxendale,J., Bayraktaroglu,L., Beasley,E.M., Beeson,K.Y., Benos,P.V., Berman,B.P., Bhandari,D., Bolshakov,S., Borkova,D., Botchan,M.R., Bouck,J., Brokstein,P., Brottier,P., Burtis,K.C., Busam,D.A., Butler,H., Cadieu,E., Center,A., Chandra,I., Cherry,J.M., Cawley,S., Dahlke,C., Davenport,L.B., Davies,P., de Pablos,B., Delcher,A., Deng,Z., Mays,A.D., Dew,I., Dietz,S.M., Dodson,K., Doup,L.E., Downes,M., Dugan-Rocha,S., Dunkov,B.C., Dunn,P., Durbin,K.J., Evangelista,C.C., Ferraz,C., Ferriera,S., Fleischmann,W., Fosler,C., Gabrielian,A.E., Garg,N.S., Gelbart,W.M., Glasser,K., Glodek,A., Gong,F., Gorrell,J.H., Gu,Z., Guan,P., Harris,M., Harris,N.L., Harvey,D., Heiman,T.J., Hernandez,J.R., Houck,J., Hostin,D., Houston,K.A., Howland,T.J., Wei,M.H., Ibegwam,C., Jalali,M., Kalush,F., Karpen,G.H., Ke,Z., Kennison,J.A., Ketchum,K.A., Kimmel,B.E., Kodira,C.D., Kraft,C., Kravitz,S., Kulp,D., Lai,Z., Lasko,P., Lei,Y., Levitsky,A.A., Li,J., Li,Z., Liang,Y., Lin,X., Liu,X., Mattei,B., McIntosh,T.C., McLeod,M.P., McPherson,D., Merkulov,G., Milshina,N.V., Mobarry,C., Morris,J., Moshrefi,A., Mount,S.M., Moy,M., Murphy,B., Murphy,L., Muzny,D.M., Nelson,D.L., Nelson,D.R., Nelson,K.A., Nixon,K., Nusskern,D.R., Pacleb,J.M., Palazzolo,M., Pittman,G.S., Pan,S., Pollard,J., Puri,V., Reese,M.G., Reinert,K., Remington,K., Saunders,R.D., Scheeler,F., Shen,H., Shue,B.C., Siden-Kiamos,I., Simpson,M., Skupski,M.P., Smith,T., Spier,E., Spradling,A.C., Stapleton,M., Strong,R., Sun,E., Svirskas,R., Tector,C., Turner,R., Venter,E., Wang,A.H., Wang,X., Wang,Z.Y., Wassarman,D.A., Weinstock,G.M., Weissenbach,J., Williams,S.M., WoodageT, Worley,K.C., Wu,D., Yang,S., Yao,Q.A., Ye,J., Yeh,R.F., Zaveri,J.S., Zhan,M., Zhang,G., Zhao,Q., Zheng,L., Zheng,X.H., Zhong,F.N., Zhong,W., Zhou,X., Zhu,S., Zhu,X., Smith,H.O., Gibbs,R.A., Myers,E.W., Rubin,G.M. and Venter,J.C. TITLE The genome sequence of Drosophila melanogaster JOURNAL Science 287 (5461), (2000) PUBMED Genebank AE003584

Outline How to query eukaryotic DNA sequences Predict coding region/exons (GenScan) Obtain protein product (GenScan) Identify motifs/sites (ScanProsite) Search for similar sequences (BLASTp) Predict function (COG) Perform multiple sequence alignment (Multalin) Obtain 3-D structure template (CDD)

GenScan GENSCAN is a general-purpose gene identification program that can be used to analyze genomic DNA sequences from a variety of organisms including human, other vertebrates, invertebrates and plants For each sequence, the program determines the most likely "parse" (I.e., gene structure) under a probabilistic model of the gene structural and compositional properties of the genomic DNA for the given organism. This set of exons/genes is then printed to an output file (text) together with the corresponding predicted peptide sequences. A graphical (PostScript) output may also be created which displays the location and DNA strand of each predicted exon.

ScanProsite The ScanProsite tool can: 1) scan protein sequence(s) from: UniProt Knowledgebase (Swiss-Prot/TrEMBL) PDB provided by the user for the occurrence of motifs (patterns, profiles and rules) stored in the PROSITE database 2) search protein databases for "hits" by specific motifs

BLAST By finding similarities between sequences, scientists can: –infer the function of newly sequenced genes –predict new members of gene families –explore evolutionary relationships Now that whole genomes are being sequenced, sequence similarity searching can be used to predict the location and function of protein-coding and regulattory regions in genomic DNA. Basic Local Alignment Search Tool (BLAST) is the tool most frequently used for calculating sequence similarity. BLAST comes in variations for use with different query sequences against different databases. BLASTp searches similar protein sequences to users’ query sequence in the protein sequence databases.

COG Orthologs are genes derived from a common ancestor through vertical descent. –This is often stated as the same gene in different species. In contrast, paralogs are genes within the same genome that have evolved by duplication. Clusters of Orthologous Groups of proteins (COGs) were delineated by comparing protein sequences encoded in 43 complete genomes, representing 30 major phylogenetic lineages. –Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain.

MultAlin Multiple sequence alignment (MSA) is the arrangement of several protein or nucleic acid sequences with postulated gaps so that similar residues are juxtaposed. Multalin creates a MSA from a group of related sequences using progressive pairwise alignments.

CDD Conserved domain –Recurring unit in molecular evolution, whose extents can be determined by sequence and structure analysis –Performs a particular function –Represented as a local MSA of proteins containing of the domain Conserved domain database (CDD) search allows you to match your protein sequence to a template in a library of conserved protein domains, generate a multiple sequence alignment based on this match and explore 3D modeling templates for your sequence.