Presentation is loading. Please wait.

Presentation is loading. Please wait.

Making Sense of DNA and Protein Sequences Based on a NCBI minicourse Presented by Jae-Hyung Lee, ISU June 14, 2007.

Similar presentations


Presentation on theme: "Making Sense of DNA and Protein Sequences Based on a NCBI minicourse Presented by Jae-Hyung Lee, ISU June 14, 2007."— Presentation transcript:

1 Making Sense of DNA and Protein Sequences Based on a NCBI minicourse Presented by Jae-Hyung Lee, ISU June 14, 2007

2 Objectives 1.In this lab, we will first try to make sense of the DNA sequence by determining whether it codes for a protein 2.If it does, we will use the protein sequence to search for the presence of any motifs or structural domains and also to try to predict its function. 3.Finally, we will map the protein sequence onto the structure of a protein with similar sequence.

3 Chromosome to protein DNA protein Chromosome nucleosome http://www.phschool.com/science/biology_place/biocoach/translation/overview.html MESARSQVQV HAPVVAARLR NIWPKFPKWL HEAPLAVAWE VTRLFMHCKV DLEDESLGLK YDPSWSTARD VTDIWKTLYRL ….

4 Drosophila melanogaster 4 chromosomes 180 Mb total sequence 140 Mb euchromatic sequence 12,000 -14,000 genes Genome database - FlyBase ISMB `99 tutorial The challenge of annotating a complete eukaryotic genome: A case study in Drosophila melanogaster

5 LOCUS AE003584 346734 bp DNA linear INV 13-JAN-2006 DEFINITION Drosophila melanogaster chromosome 2L, section 7 of 83 of the complete sequence. ACCESSION AE003584 AE002637 AE002638 AE014134 VERSION AE003584.5 GI:55380426 KEYWORDS. SOURCE Drosophila melanogaster (fruit fly) ORGANISM Drosophila melanogaster Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophila. REFERENCE 1 (bases 1 to 346734) AUTHORS Adams,M.D., Celniker,S.E., Holt,R.A., Evans,C.A., Gocayne,J.D., Amanatides,P.G., Scherer,S.E., Li,P.W., Hoskins,R.A., Galle,R.F., George,R.A., Lewis,S.E., Richards,S., Ashburner,M., Henderson,S.N., Sutton,G.G., Wortman,J.R., Yandell,M.D., Zhang,Q., Chen,L.X., Brandon,R.C., Rogers,Y.H., Blazej,R.G., Champe,M., Pfeiffer,B.D., Wan,K.H., Doyle,C., Baxter,E.G., Helt,G., Nelson,C.R., Gabor,G.L., Abril,J.F., Agbayani,A., An,H.J., Andrews-Pfannkoch,C., Baldwin,D., Ballew,R.M., Basu,A., Baxendale,J., Bayraktaroglu,L., Beasley,E.M., Beeson,K.Y., Benos,P.V., Berman,B.P., Bhandari,D., Bolshakov,S., Borkova,D., Botchan,M.R., Bouck,J., Brokstein,P., Brottier,P., Burtis,K.C., Busam,D.A., Butler,H., Cadieu,E., Center,A., Chandra,I., Cherry,J.M., Cawley,S., Dahlke,C., Davenport,L.B., Davies,P., de Pablos,B., Delcher,A., Deng,Z., Mays,A.D., Dew,I., Dietz,S.M., Dodson,K., Doup,L.E., Downes,M., Dugan-Rocha,S., Dunkov,B.C., Dunn,P., Durbin,K.J., Evangelista,C.C., Ferraz,C., Ferriera,S., Fleischmann,W., Fosler,C., Gabrielian,A.E., Garg,N.S., Gelbart,W.M., Glasser,K., Glodek,A., Gong,F., Gorrell,J.H., Gu,Z., Guan,P., Harris,M., Harris,N.L., Harvey,D., Heiman,T.J., Hernandez,J.R., Houck,J., Hostin,D., Houston,K.A., Howland,T.J., Wei,M.H., Ibegwam,C., Jalali,M., Kalush,F., Karpen,G.H., Ke,Z., Kennison,J.A., Ketchum,K.A., Kimmel,B.E., Kodira,C.D., Kraft,C., Kravitz,S., Kulp,D., Lai,Z., Lasko,P., Lei,Y., Levitsky,A.A., Li,J., Li,Z., Liang,Y., Lin,X., Liu,X., Mattei,B., McIntosh,T.C., McLeod,M.P., McPherson,D., Merkulov,G., Milshina,N.V., Mobarry,C., Morris,J., Moshrefi,A., Mount,S.M., Moy,M., Murphy,B., Murphy,L., Muzny,D.M., Nelson,D.L., Nelson,D.R., Nelson,K.A., Nixon,K., Nusskern,D.R., Pacleb,J.M., Palazzolo,M., Pittman,G.S., Pan,S., Pollard,J., Puri,V., Reese,M.G., Reinert,K., Remington,K., Saunders,R.D., Scheeler,F., Shen,H., Shue,B.C., Siden-Kiamos,I., Simpson,M., Skupski,M.P., Smith,T., Spier,E., Spradling,A.C., Stapleton,M., Strong,R., Sun,E., Svirskas,R., Tector,C., Turner,R., Venter,E., Wang,A.H., Wang,X., Wang,Z.Y., Wassarman,D.A., Weinstock,G.M., Weissenbach,J., Williams,S.M., WoodageT, Worley,K.C., Wu,D., Yang,S., Yao,Q.A., Ye,J., Yeh,R.F., Zaveri,J.S., Zhan,M., Zhang,G., Zhao,Q., Zheng,L., Zheng,X.H., Zhong,F.N., Zhong,W., Zhou,X., Zhu,S., Zhu,X., Smith,H.O., Gibbs,R.A., Myers,E.W., Rubin,G.M. and Venter,J.C. TITLE The genome sequence of Drosophila melanogaster JOURNAL Science 287 (5461), 2185-2195 (2000) PUBMED 10731132 Genebank AE003584

6 http://genome.ucsc.edu

7 Outline How to query eukaryotic DNA sequences Predict coding region/exons (GenScan) Obtain protein product (GenScan) Identify motifs/sites (ScanProsite) Search for similar sequences (BLASTp) Predict function (COG) Perform multiple sequence alignment (Multalin) Obtain 3-D structure template (CDD)

8 GenScan GENSCAN is a general-purpose gene identification program that can be used to analyze genomic DNA sequences from a variety of organisms including human, other vertebrates, invertebrates and plants For each sequence, the program determines the most likely "parse" (I.e., gene structure) under a probabilistic model of the gene structural and compositional properties of the genomic DNA for the given organism. This set of exons/genes is then printed to an output file (text) together with the corresponding predicted peptide sequences. A graphical (PostScript) output may also be created which displays the location and DNA strand of each predicted exon. http://bioinfo.hku.hk/genscanintro.html

9 ScanProsite The ScanProsite tool can: 1) scan protein sequence(s) from: UniProt Knowledgebase (Swiss-Prot/TrEMBL) PDB provided by the user for the occurrence of motifs (patterns, profiles and rules) stored in the PROSITE database 2) search protein databases for "hits" by specific motifs

10 BLAST By finding similarities between sequences, scientists can: –infer the function of newly sequenced genes –predict new members of gene families –explore evolutionary relationships Now that whole genomes are being sequenced, sequence similarity searching can be used to predict the location and function of protein-coding and regulattory regions in genomic DNA. Basic Local Alignment Search Tool (BLAST) is the tool most frequently used for calculating sequence similarity. BLAST comes in variations for use with different query sequences against different databases. BLASTp searches similar protein sequences to users’ query sequence in the protein sequence databases.

11 COG Orthologs are genes derived from a common ancestor through vertical descent. –This is often stated as the same gene in different species. In contrast, paralogs are genes within the same genome that have evolved by duplication. Clusters of Orthologous Groups of proteins (COGs) were delineated by comparing protein sequences encoded in 43 complete genomes, representing 30 major phylogenetic lineages. –Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain.

12 MultAlin Multiple sequence alignment (MSA) is the arrangement of several protein or nucleic acid sequences with postulated gaps so that similar residues are juxtaposed. Multalin creates a MSA from a group of related sequences using progressive pairwise alignments.

13 CDD Conserved domain –Recurring unit in molecular evolution, whose extents can be determined by sequence and structure analysis –Performs a particular function –Represented as a local MSA of proteins containing of the domain Conserved domain database (CDD) search allows you to match your protein sequence to a template in a library of conserved protein domains, generate a multiple sequence alignment based on this match and explore 3D modeling templates for your sequence.


Download ppt "Making Sense of DNA and Protein Sequences Based on a NCBI minicourse Presented by Jae-Hyung Lee, ISU June 14, 2007."

Similar presentations


Ads by Google