Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization.

Slides:



Advertisements
Similar presentations
Bioinformatics Ayesha M. Khan Spring 2013.
Advertisements

NGS Bioinformatics Workshop 2.1 Tutorial – Next Generation Sequencing and Sequence Assembly Algorithms May 3rd, 2012 IRMACS Facilitator: Richard.
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Databases (“knowledge bases”) used in genome analysis
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
On line (DNA and amino acid) Sequence Information Lecture 7.
NGS Bioinformatics Workshop 1
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Biological databases.
Lecture 2.21 Retrieving Information: Using Entrez.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Lecture 1.4 Understanding and Using Biological Databases Francis Ouellette
Understanding and Using Biological Databases
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
1 Computational Biology, Part 13 Retrieving and Displaying Macromolecular Structures Robert F. Murphy Copyright  1996, 1999, All rights reserved.
1 Computational Biology, Part 11 Retrieving and Displaying Macromolecular Structures Robert F. Murphy Copyright  1996, 1999, All rights reserved.
ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.
An Introduction to Bioinformatics Molecular Biology Databases.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
NGS Analysis Using Galaxy
Wellcome Trust Workshop Working with Pathogen Genomes Module 1 Artemis.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Databases. Where to get data? GenBank – Protein Databases –SWISS-PROT:
On line (DNA and amino acid) Sequence Information
Structure and Function of Proteins Lecturer: Dr. Ora Furman Oct 2009 Winter 2009/10 Teaching Assistants: Miraim Oxsman Sivan Pearl.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Information Resources for Bioinformatics 1 MARC: Developing Bioinformatics Programs July, 2008 Alex Ropelewski Hugh Nicholas
Bsubt.embl complete entry in EMBL format (DNA and Features) bsubt.embl.Z bsubt.fasta complete DNA sequence in Fasta format bsubt.fasta.Z bsubt.con construct.
Essential Bioinformatics and Biocomputing Module (Tutorial) Biological Databases Lecturer: Chen Yuzong Jan 2003 TAs: Cao Zhiwei Lee Teckkwong, Bernett.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
EBI is an Outstation of the European Molecular Biology Laboratory. Annotation Procedures for Structural Data Deposited in the PDBe at EBI.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
NCBI Literature Databases: PubMed
EBI is an Outstation of the European Molecular Biology Laboratory. Protein Database in Europe Deposition, Validation, Search and Analysis Services.
Bioinformatics and Computational Biology
Computer Storage of Sequences
EBI is an Outstation of the European Molecular Biology Laboratory. EBI patent related services Jennifer McDowall Senior Scientist, EMBL-EBI 3 rd Annual.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
©CMBI 2008 Databases Data must be in a certain format for software to recognize Every database can have its own format but some data elements are essential.
1 EMBL Outstation — The European Bioinformatics Institute Mus musculus - a model organism in SWISS-PROT.
Copyright OpenHelix. No use or reproduction without express written consent1.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Introduction to Genes and Genomes with Ensembl
Protein databases Henrik Nielsen
Sequence File Formats.
Retrieving Information: Using Entrez
Archives and Information Retrieval
생물정보학 Bioinformatics.
Mangaldai College, Mangaldai
Lesson 3 Bioinformatics Laboratory
Introduction to Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Genome Analysis: Databases, Sequence Formats and Visualization Tools Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

 Understand the purpose of, and use of, bioinformatics databases resources, such as GenBank,UniProt/Swiss-Prot, Entrez and Ensembl.  Be able to recognize common database data formats and sequence features, sequence and genome browsers.  What kind of tools are available to visualize sequence data?  Appreciate the issues surrounding bioinformatic database updating. Objectives of today lecture Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Biological Databases and Data Models Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Databases in general Also check out the annual “web-software” issue of NAR every July Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Databases  Organized array of information  On the WWW or Local  Place where you put things in, and (if all goes well!) you should be able to get them out again.  Allows you to make discoveries. Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Useful Database Primary (archival) Primary (archival) – GenBank/EMBL/DDBJ (seqs) – PDB -(protein structures) – Medline (literature) – IMEx databases (protein interactions) Secondary (curated) Secondary (curated) – RefSeq (seqs) – UniProt - SwissProt (seqs) – Taxon (taxonomy) – PROSITE (binding sites) – OMIM (genetics literature/reviews) – IMEx databases (protein interactions) Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Sequence Databases  DNA  NCBI: GenBank -> RefSeq National Center for Biotechnology Information  EBI: EMBL European Bioinformatics Institute  Protein  NCBI: GenPept  EBI: UniProt: TrEMBL -> UniProt: Swiss-Prot TrEMBL= “translated EMBL” Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

NCBI: GenBank -> RefSeq Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Further Readings!!! Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

EBI: EMBL Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

UniProt: Swiss-Prot, TrEMBL Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

UniProt: Swiss-Prot: An example of curated, reviewed annotation Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization  Incorporates: Function of the protein Function of the protein Subcellular localization of protein Subcellular localization of protein Post-translational modification Post-translational modification Domains and sites Domains and sites Secondary structure Secondary structure Quaternary structure Quaternary structure Similarities to other proteins Similarities to other proteins Diseases associated with deficiencies in the protein Diseases associated with deficiencies in the protein Sequence conflicts, variants, etc. Sequence conflicts, variants, etc.

GenBank DDBJ EMBL EMBL Entrez SRS getentry NIG CIB EBI NCBI NIH Submissions Updates Submissions Updates Submissions Updates INSDC - International Nucleotide Sequence Database Collaboration Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization National Institute of genetics

File Formats Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

GenBank Flat File Features (AA seq) DNA Sequence Header TitleTitle TaxonomyTaxonomy CitationCitation LOCUS AF bp DNA linear BCT 19-AUG-1999 DEFINITION Pseudomonas fluorescens ECF sigma factor SigX (sigX) gene, complete cds. ACCESSION AF VERSION AF GI: KEYWORDS. SOURCE Pseudomonas fluorescens. ORGANISM Pseudomonas fluorescens Bacteria; Proteobacteria; gamma subdivision; Pseudomonadaceae; Pseudomonas. REFERENCE 1 (bases 1 to 591) AUTHORS Brinkman,F.S., Schoofs,G., Hancock,R.E. and De Mot,R. TITLE Influence of a putative ECF sigma factor on expression of the major outer membrane protein, OprF, in Pseudomonas aeruginosa and Pseudomonas fluorescens JOURNAL J. Bacteriol. 181 (16), (1999) MEDLINE PUBMED REFERENCE 2 (bases 1 to 591) AUTHORS De Mot,R. TITLE Direct Submission JOURNAL Submitted (04-DEC-1998) F.A. Janssens Laboratory of Genetics, Applied Plant Sciences, K. Mercierlaan 92, Heverlee B-3001, Belgium FEATURES Location/Qualifiers source /organism="Pseudomonas fluorescens" /strain="M114" /db_xref="taxon:294" gene /gene="sigX" CDS /gene="sigX" /codon_start=1 /transl_table=11 /product="ECF sigma factor SigX" /protein_id="AAD " /db_xref="GI: " /translation="MNKAQTLSTRYDPRELSDEELVARSHTELFHVTRAYEELMRRYQ RTLFNVCARYLGNDRDADDVCQEVMLKVLYGLKNLEGKSKFKTWLYSITYNECITQYR KERRKRRLMDALSLDPLEEASEEKALQPEEKGGLDRWLVYVNPIDRGILVLRFVAELE FQEIADIMHMGLSATKMRYKRALDKLREKFAGETET" BASE COUNT 157 a 133 c 170 g 131 t ORIGIN 1 atgaataaag cccaaacgct atccacgcgc tacgaccccc gcgagctctc tgatgaggag 61 ttggtcgcgc gctcgcatac cgagcttttt cacgtaacgc gcgcctatga agaactgatg 121 cggcgttacc agcgaacatt atttaacgtt tgtgcgagat atcttgggaa cgatcgcgac 181 gcagacgatg tctgtcagga agtcatgttg aaggtgctgt atggcctgaa gaacctcgag 241 gggaaatcga agttcaaaac gtggctctac agcatcacgt acaacgaatg tattacgcag 301 tatcggaagg aacggcgaaa gcgtcgcttg atggacgcat tgagtcttga ccccctcgag 361 gaagcgtccg aagaaaaggc gcttcaaccc gaggagaagg gcgggcttga tcgctggctg 421 gtgtatgtga acccgattga ccgtggaatt ctggtgcttc gatttgtcgc agagctggaa 481 tttcaggaga tcgcagacat catgcacatg ggtttgagtg cgacaaaaat gcgttacaaa 541 cgtgctctag ataaattgcg tgagaaattt gcaggcgaga ctgaaactta g Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

EMBL Flat File Features (AA seq) DNA Sequence Header TitleTitle TaxonomyTaxonomy CitationCitation ID AF standard; DNA; PRO; 591 BP. AC AF115338; SV AF DT 03-JUN-1999 (Rel. 59, Created) DT 23-AUG-1999 (Rel. 60, Last updated, Version 2) DE Pseudomonas fluorescens ECF sigma factor SigX (sigX) gene, complete cds. KW. OS Pseudomonas fluorescens OC Bacteria; Proteobacteria; gamma subdivision; Pseudomonadaceae; Pseudomonas. RN [1] RP RX MEDLINE; RA Brinkman F.S., Schoofs G., Hancock R.E., De Mot R.; RT "Influence of a putative ECF sigma factor on expression of the major outer RT membrane protein, OprF, in Pseudomonas aeruginosa and Pseudomonas RT fluorescens"; RL J. Bacteriol. 181(16): (1999). RN [2] RP RA De Mot R.; RT ; RL Submitted (04-DEC-1998) to the EMBL/GenBank/DDBJ databases. RL F.A. Janssens Laboratory of Genetics, Applied Plant Sciences, K. RL Mercierlaan 92, Heverlee B-3001, Belgium DR SPTREMBL; Q9X4L7; Q9X4L7. FH Key Location/Qualifiers FH FT source FT /db_xref="taxon:294" FT /organism="Pseudomonas fluorescens" FT /strain="M114" FT CDS FT /codon_start=1 FT /db_xref="SPTREMBL:Q9X4L7" FT /transl_table=11 FT /gene="sigX" FT /product="ECF sigma factor SigX" FT /protein_id="AAD " FT /translation="MNKAQTLSTRYDPRELSDEELVARSHTELFHVTRAYEELMRRYQR FT TLFNVCARYLGNDRDADDVCQEVMLKVLYGLKNLEGKSKFKTWLYSITYNECITQYRKE FT RRKRRLMDALSLDPLEEASEEKALQPEEKGGLDRWLVYVNPIDRGILVLRFVAELEFQE FT IADIMHMGLSATKMRYKRALDKLREKFAGETET" SQ Sequence 591 BP; 157 A; 133 C; 170 G; 131 T; 0 other; atgaataaag cccaaacgct atccacgcgc tacgaccccc gcgagctctc tgatgaggag 60 ttggtcgcgc gctcgcatac cgagcttttt cacgtaacgc gcgcctatga agaactgatg 120 cggcgttacc agcgaacatt atttaacgtt tgtgcgagat atcttgggaa cgatcgcgac 180 gcagacgatg tctgtcagga agtcatgttg aaggtgctgt atggcctgaa gaacctcgag 240 gggaaatcga agttcaaaac gtggctctac agcatcacgt acaacgaatg tattacgcag 300 tatcggaagg aacggcgaaa gcgtcgcttg atggacgcat tgagtcttga ccccctcgag 360 gaagcgtccg aagaaaaggc gcttcaaccc gaggagaagg gcgggcttga tcgctggctg 420 gtgtatgtga acccgattga ccgtggaatt ctggtgcttc gatttgtcgc agagctggaa 480 tttcaggaga tcgcagacat catgcacatg ggtttgagtg cgacaaaaat gcgttacaaa 540 cgtgctctag ataaattgcg tgagaaattt gcaggcgaga ctgaaactta g 591 // Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

UniProt: Swiss-Prot (a curated DB) ID CYS3_YEAST STANDARD; PRT; 393 AA. AC P31373; DT 01-JUL-1993 (REL. 26, CREATED) DE CYSTATHIONINE GAMMA-LYASE (EC ) (GAMMA-CYSTATHIONASE). GN CYS3 OR CYI1 OR STR1 OR YAL012W OR FUN35. OS TAXONOMY OC SACCHAROMYCETACEAE; SACCHAROMYCES. RX CITATION CC -!- CATALYTIC ACTIVITY: L-CYSTATHIONINE + H(2)O = L-CYSTEINE + CC NH(3) + 2-OXOBUTANOATE. CC -!- COFACTOR: PYRIDOXAL PHOSPHATE. CC -!- PATHWAY: FINAL STEP IN THE TRANS-SULFURATION PATHWAY SYNTHESIZING CC L-CYSTEINE FROM L-METHIONINE. CC -!- SUBUNIT: HOMOTETRAMER. CC -!- SUBCELLULAR LOCATION: CYTOPLASMIC. CC -!- SIMILARITY: BELONGS TO THE TRANS-SULFURATION ENZYMES FAMILY. CC CC Disclaimer CC DR DATABASE cross-reference KW CYSTEINE BIOSYNTHESIS; LYASE; PYRIDOXAL PHOSPHATE. FT INIT_MET 0 0 FT BINDING PYRIDOXAL PHOSPHATE (BY SIMILARITY). SQ SEQUENCE 393 AA; MW; 55BA2771 CRC32; TLQESDKFAT KAIHAGEHVD VHGSVIEPIS LSTTFKQSSP ANPIGTYEYS RSQNPNRENL ERAVAALENA QYGLAFSSGS ATTATILQSL PQGSHAVSIG DVYGGTHRYF TKVANAHGVE TSFTNDLLND LPQLIKENTK LVWIETPTNP TLKVTDIQKV ADLIKKHAAG QDVILVVDNT FLSPYISNPL NFGADIVVHS ATKYINGHSD VVLGVLATNN KPLYERLQFL QNAIGAIPSP FDAWLTHRGL KTLHLRVRQA ALSANKIAEF LAADKENVVA VNYPGLKTHP NYDVVLKQHR DALGGGMISF RIKGGAEAAS KFASSTRLFT LAESLGGIES LLEVPAVMTH GGIPKEAREA SGVFDDLVRI SVGIEDTDDL LEDIKQALKQ ATN // ID CYS3_YEAST STANDARD; PRT; 393 AA. AC P31373; DT 01-JUL-1993 (REL. 26, CREATED) DT 01-JUL-1993 (REL. 26, LAST SEQUENCE UPDATE) DT 01-NOV-1995 (REL. 32, LAST ANNOTATION UPDATE) DE CYSTATHIONINE GAMMA-LYASE (EC ) (GAMMA-CYSTATHIONASE). GN CYS3 OR CYI1 OR STR1 OR YAL012W OR FUN35. OS SACCHAROMYCES CEREVISIAE (BAKER'S YEAST). OC EUKARYOTA; FUNGI; ASCOMYCOTA; HEMIASCOMYCETES; SACCHAROMYCETALES; OC SACCHAROMYCETACEAE; SACCHAROMYCES. RN [1] RP SEQUENCE FROM N.A., AND PARTIAL SEQUENCE. RX MEDLINE; [NCBI, ExPASy, Israel, Japan] RA ONO B.-I., TANAKA K., NAITO K., HEIKE C., SHINODA S., YAMAMOTO S., RA OHMORI S., OSHIMA T., TOH-E A.; RT "Cloning and characterization of the CYS3 (CYI1) gene of RT Saccharomyces cerevisiae."; RL J. BACTERIOL. 174: (1992). CC -!- CATALYTIC ACTIVITY: L-CYSTATHIONINE + H(2)O = L-CYSTEINE + CC NH(3) + 2-OXOBUTANOATE. CC -!- COFACTOR: PYRIDOXAL PHOSPHATE. CC -!- PATHWAY: FINAL STEP IN THE TRANS-SULFURATION PATHWAY SYNTHESIZING CC L-CYSTEINE FROM L-METHIONINE. CC -!- SUBUNIT: HOMOTETRAMER. CC -!- SUBCELLULAR LOCATION: CYTOPLASMIC. CC -!- SIMILARITY: BELONGS TO THE TRANS-SULFURATION ENZYMES FAMILY. CC CC This SWISS-PROT entry is copyright. It is produced through a collaboration CC between the Swiss Institute of Bioinformatics and the EMBL outstation - CC the European Bioinformatics Institute. There are no restrictions on its CC use by non-profit institutions as long as its content is in no way CC modified and this statement is not removed. Usage by and for commercial CC entities requires a license agreement (See CC or send an to CC DR EMBL; L05146; AAC ; -. [EMBL / GenBank / DDBJ] [CoDingSequence] DR EMBL; L04459; AAA ; -. [EMBL / GenBank / DDBJ] [CoDingSequence] DR EMBL; D14135; BAA ; -. [EMBL / GenBank / DDBJ] [CoDingSequence] DR PIR; S31228; S DR YEPD; 5280; -. DR SGD; L ; CYS3. [SGD / YPD] DR PFAM; PF01053; Cys_Met_Meta_PP; 1. DR PROSITE; PS00868; CYS_MET_METAB_PP; 1. DR DOMO; P DR PRODOM [Domain structure / List of seq. sharing at least 1 domain] DR PROTOMAP; P DR PRESAGE; P DR SWISS-2DPAGE; GET REGION ON 2D PAGE. KW CYSTEINE BIOSYNTHESIS; LYASE; PYRIDOXAL PHOSPHATE. FT INIT_MET 0 0 FT BINDING PYRIDOXAL PHOSPHATE (BY SIMILARITY). SQ SEQUENCE 393 AA; MW; 55BA2771 CRC32; TLQESDKFAT KAIHAGEHVD VHGSVIEPIS LSTTFKQSSP ANPIGTYEYS RSQNPNRENL ERAVAALENA QYGLAFSSGS ATTATILQSL PQGSHAVSIG DVYGGTHRYF TKVANAHGVE TSFTNDLLND LPQLIKENTK LVWIETPTNP TLKVTDIQKV ADLIKKHAAG QDVILVVDNT FLSPYISNPL NFGADIVVHS ATKYINGHSD VVLGVLATNN KPLYERLQFL QNAIGAIPSP FDAWLTHRGL KTLHLRVRQA ALSANKIAEF LAADKENVVA VNYPGLKTHP NYDVVLKQHR DALGGGMISF RIKGGAEAAS KFASSTRLFT LAESLGGIES LLEVPAVMTH GGIPKEAREA SGVFDDLVRI SVGIEDTDDL LEDIKQALKQ ATN // Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

PDB- Protein Data Bank Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

PDB – Provides?  Protein Data Bank  Protein and Nucleic acid 3D structures  Xray, NMR, Computationally predicted  Sequence present Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

HEADER LEUCINE ZIPPER 15-JUL-93 1DGC 1DGC 2 COMPND GCN4 LEUCINE ZIPPER COMPLEXED WITH SPECIFIC 1DGC 3 COMPND 2 ATF/CREB SITE DNA 1DGC 4 SOURCE GCN4: YEAST (SACCHAROMYCES CEREVISIAE); DNA: SYNTHETIC 1DGC 5 AUTHOR T.J.RICHMOND 1DGC 6 REVDAT 1 22-JUN-94 1DGC 0 1DGC 7 JRNL AUTH P.KONIG,T.J.RICHMOND 1DGC 8 JRNL TITL THE X-RAY STRUCTURE OF THE GCN4-BZIP BOUND TO 1DGC 9 JRNL TITL 2 ATF/CREB SITE DNA SHOWS THE COMPLEX DEPENDS ON DNA 1DGC 10 JRNL TITL 3 FLEXIBILITY 1DGC 11 JRNL REF J.MOL.BIOL. V DGC 12 JRNL REFN ASTM JMOBAK UK ISSN DGC 13 REMARK 1 1DGC 14 REMARK 2 1DGC 15 REMARK 2 RESOLUTION. 3.0 ANGSTROMS. 1DGC 16 REMARK 3 1DGC 17 REMARK 3 REFINEMENT. 1DGC 18 REMARK 3 PROGRAM X-PLOR 1DGC 19 REMARK 3 AUTHORS BRUNGER 1DGC 20 REMARK 3 R VALUE DGC 21 REMARK 3 RMSD BOND DISTANCES ANGSTROMS 1DGC 22 REMARK 3 RMSD BOND ANGLES 3.86 DEGREES 1DGC 23 REMARK 3 1DGC 24 REMARK 3 NUMBER OF REFLECTIONS DGC 25 REMARK 3 RESOLUTION RANGE ANGSTROMS 1DGC 26 REMARK 3 DATA CUTOFF 3.0 SIGMA(F) 1DGC 27 REMARK 3 PERCENT COMPLETION DGC 28 REMARK 3 1DGC 29 REMARK 3 NUMBER OF PROTEIN ATOMS 456 1DGC 30 REMARK 3 NUMBER OF NUCLEIC ACID ATOMS 386 1DGC 31 REMARK 4 1DGC 32 SEQRES 1 A 62 ILE VAL PRO GLU SER SER ASP PRO ALA ALA LEU LYS ARG 1DGC 60 SEQRES 2 A 62 ALA ARG ASN THR GLU ALA ALA ARG ARG SER ARG ALA ARG 1DGC 61 SEQRES 3 A 62 LYS LEU GLN ARG MET LYS GLN LEU GLU ASP LYS VAL GLU 1DGC 62 SEQRES 4 A 62 GLU LEU LEU SER LYS ASN TYR HIS LEU GLU ASN GLU VAL 1DGC 63 SEQRES 5 A 62 ALA ARG LEU LYS LYS LEU VAL GLY GLU ARG 1DGC 64 SEQRES 1 B 19 T G G A G A T G A C G T C 1DGC 65 SEQRES 2 B 19 A T C T C C 1DGC 66 HELIX 1 A ALA A 228 LYS A DGC 67 CRYST P DGC 68 ORIGX DGC 69 ORIGX DGC 70 ORIGX DGC 71 SCALE DGC 72 SCALE DGC 73 SCALE DGC 74 ATOM 1 N PRO A DGC 75 ATOM 2 CA PRO A DGC 76 ATOM 842 C5 C B DGC 916 ATOM 843 C6 C B DGC 917 TER 844 C B 9 1DGC 918 MASTER DGC 919 END 1DGC 920 PDB  HEADER  COMPND  SOURCE  AUTHOR  DATE  JRNL  REMARK  SECRES  ATOM COORDINATES Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Data Formats  Flat Files √  Many other formats for particular uses… XML,  Clustal (for multiple sequence alignments),  GFF (for sequence annotation), etc…  FASTA – simplest!  High throughput data file formats: BAM, etc. Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

FASTA >gi|121066|sp|P03069|GCN4_YEAST GENERAL CONTROL PROTEIN GCN4 MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPIIKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYENLEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVLEDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPESSDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGER > Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

FASTA > Your favourite gene 1 - yfg1 MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPIIKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYENLEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVLEDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPESSDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGER > Your favourite gene 2 - yfg2 MQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPIVIVDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYENWTITSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVLLEDNSKEWEDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIV Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

In GenBank, records are organized for various reasons. Understanding the rationale behind “groupings” and “numbering” systems for such databases is the key to fully taking advantage of database resources - appropriately! Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

LOCUS vs Accession vs PID vs protein_id: What’s the difference? LOCUS: Unique string of 10 letters and numbers in the database. Not maintained amongst databases. ACCESSION: A unique identifier to that record (particular sequence) in GenBank/EMBL/DDBJ that does not change when record is updated. Nucleotide gi: Geninfo identifier (gi), a unique integer specific for GenBank which will change every time the sequence changes. VERSION: System started in 1999 for GenBank/EMBL/DDBJ where the accession and version play the same function as the accession and gi number. Format: accession.version PID: Protein Identifier: g, e or d prefix to gi number. Can have one or two on one CDS (coding sequence). Protein gi: Geninfo identifier (gi), a GenBank unique integer which will change every time the sequence changes. protein_id: Identifier which has the same structure and function as the nucleotide Accession with version numbers. Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

LOCUS, Accession, NID, gi and PID LOCUS HSU bp mRNA PRI 21-MAY-1998 DEFINITION Homo sapiens integrin-linked kinase (ILK) mRNA, complete cds. ACCESSION U40282 VERSION U GI: CDS CDS /gene="ILK" /gene="ILK" /note="protein serine/threonine kinase" /note="protein serine/threonine kinase" /codon_start=1 /codon_start=1 /product="integrin-linked kinase" /product="integrin-linked kinase" /protein_id="AAC " /protein_id="AAC " /db_xref="PID:g " /db_xref="PID:g " /db_xref="GI: " /db_xref="GI: " LOCUS: HSU40282 LOCUS: HSU40282 ACCESSION: U40282 ACCESSION: U40282 VERSION: U VERSION: U GI: GI: PID: g PID: g Protein gi: Protein gi: protein_id: AAC protein_id: AAC Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Which of these would you use to cite a sequence in a paper? Can you think of situations where you would use one over another? Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Which of these would you use to cite a sequence? When would you use one over another? LOCUS: Unique string of 10 letters and numbers in the database. Not maintained amongst databases. ACCESSION: A unique identifier to that record (particular sequence) in GenBank/EMBL/DDBJ that does not change when record is updated. Nucleotide gi: Geninfo identifier (gi), a unique integer specific for GenBank which will change every time the sequence changes. (and can disappear!) VERSION: System started in 1999 for GenBank/EMBL/DDBJ where the accession and version play the same function as the accession and gi number. Format: accession.version PID: Protein Identifier: g, e or d prefix to gi number. Can have one or two on one CDS (coding sequence). Protein gi: Geninfo identifier (gi), a GenBank unique integer which will change every time the sequence changes. protein_id: Identifier which has the same structure and function as the nucleotide Accession with version numbers.

Briefly…Examples of Functional Divisions PAT Patent EST Expressed Sequence Tags STS Sequence Tagged Site GSS Genome Survey Sequence HTG High Throughput Genome (unfinished) HTC High throughput cDNA (unfinished) Genbank overview: Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Other Sequence (& related) File Formats  Historically, a number of other sequence and annotation file formats have been proposed, see:  The demands of representing NGS data have given rise to additional file formats and data compression standards, some of which you will encounter in this course. The next few slides will present an overview of a few of these emergent NGS formats and standards. See: Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Other Sequence (& Annotation) File Formats  FASTQ – FASTA with quality data  2bit – compressed DNA sequence format  SAM/BAM – Sequence Alignment Mapping  GFF/GTF – General Feature Format  BED/WIG – annotation track data formats Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

FASTQ  FASTQ – FASTA “with an attitude” (embedded quality scores). Originally developed at the Sanger to couple (Phred) quality data with sequence, it is now common to specify raw read output data from NGS machines in this format.  Various flavors:  fastq-sanger  fastq-illumina  fastq-solexa Differing in the format of the sequence identifier and in the valid range of quality scores. See:  /2009/12/16/nar.gkp1137.full “…the Sanger version of the FASTQ format has found the broadest acceptance, supported by many assembly and read mapping tools …Therefore, most users will do this conversion very early in their GTTGCTTCTGGCGTGGGTGGGGGGG +EAS54_6_R1_2_1_443_348 *-+*''))**55CCF>>>>>>CCCC Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Linux, MacOSX or Unix only Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

2bit File Format  Highly compressed sequence file stores multiple DNA sequences (up to 4 Gb total) in a compact randomly-accessible format. The file contains masking information as well as the DNA itself. Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

SAM/BAM  SAM– a tab-delimited text file that contains a compact and index-able representation of nucleotide sequence alignments  BAM – binary version of SAM (preferred by IGV)  I/O format of several NGS tools, see:  See also: Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Gene/General/Generic Feature Formats (GFF)  A General Feature Format (GFF) file is a relatively simple tab-delimited text file for describing genomic features. Many genome browsers – gbrowse, IGV, etc. - take GFF as input for annotation data  There are several slightly but significantly different GFF file formats (GFF,GFF2, GFF3, GTF). The current primary standard is GFF3: Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Excerpt of a GFF File ##gff-version 3 1 ##sequence-region ctg ctg123. gene ID=gene00001;Name=EDEN ctg123. mRNA ID=mRNA00001;Parent=gene00001;Name=EDEN.1 ctg123. exon ID=exon00001;Parent=mRNA00003 ctg123. CDS ID=cds00001;Parent=mRNA00001;Name=edenprotein.1 Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

BED File Format  BED format provides a flexible way to define the data lines that are displayed in an annotation track in a genome browser.  If your data set is BED-like, but it is very large and you would like to keep it on your own server, you should use the bigBed data format. Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

WIGgle format  The Wiggle format is for display of dense, continuous data such as GC percent, probability scores, and transcriptome data.  If you need to display continuous data that is sparse or contains elements of varying size, use the BedGraph format instead.  If you have a very large data set and you would like to keep it on your own server, you should use the bigWig data format Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

EMBOSS Sequence Analysis Suite emboss.sourceforge.net Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Open Bioinformatics Foundation bioperl / biojava / biopython / bioruby / biosql etc. Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Sequence Databases: “Roll your Own”?  GMOD BioSQL: a lightweight database schema for storing and retrieving (annotated) sequence records using OpenBio software tools.  GMOD “Chado”: a more complex database schema for storing sequence data, genome feature annotation and a host of other related biological data (initially inspired by Drosophila genome annotation and genetics; supported by many GMOD software tools) Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Retrieving Sequence Information: Using integrated database resources such as Entrez Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

What you may be looking for:  Heard on CBC about a disease gene that was recently discovered, and you want to know more about it.  Want to build a dataset of DNA sequences upstream of a set of co-expressed genes, to identify common regulatory element sequences  Evolutionary, functional, structural analyses, etc… Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Entrez: Initial version of this “Pathway to Discovery” Amino acid sequence similarity Coding region features Nucleotide sequence similarity Term frequency statistics Literature citations in sequence databases MEDLINE abstracts Nucleotide sequences Protein sequences Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Genetic Analysis of Cancer in Families The Genetic Predisposition to Cancer PubMed Text Neighboring Common terms could indicate similar subject matter Common terms could indicate similar subject matter Statistical method Statistical method Weights based on term frequencies within document and within the database as a whole Weights based on term frequencies within document and within the database as a whole Some terms are better than others Some terms are better than others

GenomesStructures MVILLVILAIVLISD VTGREGSWQIPCMNV KRKKGREGDHIVLIL ILLNNAWASVLPESDS SDSGPLIILHEREKR LALAMAREENSPNCT PLIKRESAEDSEDLR KRKKTDEDDHIVLIL ACGATGTGGTCGATG TTCTCTATTATTATC GGAAGCTAAGGATAT CGCTGATGTGAGGTGA TCGGTTCTATCTGCA TAGCATGGATATTGA TGGCTTATAGGCTAG CGCTGATGTGAGGTG Links Protein Sequences GenBank MEDLINE Expression Data Accession Numbers PubMed online Journals Full text SNP Data Accession Numbers - Map MMDB structure:function VAST Entrez began to integrate more data… Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Entrez Entrez Help /books/NBK3837/ /books/NBK3837/ /books/NBK3837/ Check out also What’s New /books/NBK1969/ /books/NBK1969/ /books/NBK1969/ on to keep up on new features added (like the Database of Genomic Structural Variation recently released) SFU’s Cenk Sahinalp - international leader in structural variation bioinformatics research Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

BLink Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Other Sequence Databases and Sequence Data Visualization Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

The Ensembl Genomes Database: Focuses on humans and select vertebrates (but a plant version is also available…) Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

What is Ensembl?  Publicly available, automated annotation of selected eukaryotic genomes (initially with mammalian focus)  Open source software (but slightly complicated to set up…)  Multiple different ways to access data, including programmatic (Perl API)  Provides access to additional data from other groups (distributed annotation system or DAS) Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

ENSEMBL – Region in Detail Check out the “Printable mini-course” at o/website/tutorials/index.htm l o/website/tutorials/index.htm l Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Generic Model Organism Database (GMOD) Project Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

BioMart(Ensmart) A powerful querying system (later: we’ll learn about Ensembl’s Perl API) Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Distributed Annotation System (DAS)  Allows Third-Party annotation  Users choose the annotation they are interested in  Good for specialized feature annotation or for comparison of different methodologies  Allows you to view different data in a consistent user interface/display Open source display focused on eukaryotes  Ensembl Open source display for any dataset  Gbrowse Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Gbrowse: Another genome data viewer with DAS Gene track  Protein track  Metabolic pathways track  Regulons track  3D structures track  Intergenic sequences track  Terminators track  DNA sequence track  Translation track  Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Gbrowse is used to display genomic data for many projects  Mouse, Rat, Fly, C. elegans and other animals  Rice and a number of other plants  S. cerevisiae and other yeasts  A number of unicellular eukaryotes  Many many prokaryotes  Other types of data: HapMap, Segmental Duplications, RNA-seq data-specific or other type-specific data  ** Open source package ** (slightly simpler to set up than Ensembl) Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Entrez, Ensembl, Gbrowse: What’s the difference? EntrezEntrez –Search and retrieval system for major databases, including PubMed, Sequences (including genomes), Structures, Taxonomy, etc. –NCBI (Maryland, USA) centrally hosts Entrez and they decide what to host and maintain –Not open source EnsemblEnsembl –Automated annotation of selected eukaryotic genomes –EMBL-EBI and the Sanger Institute (Cambridge/Hinxton, UK) centrally hosts most resources and they decide what data to host and maintain. –Open source and can obtain a local copy plus access other DAS data GbrowseGbrowse –Genome/genomic data viewer –Very decentralized – anyone can set it up and publicly display any data –Open source and can set up a local copy plus access other DAS data Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Entrez, Ensembl, Gbrowse: Benefits/Disadvantages of each? EntrezEntrez Reputable institution – trust in the dataReputable institution – trust in the data Maintained by well established group with a lot of capitalMaintained by well established group with a lot of capital Perceived more consistencyPerceived more consistency Limited to what they make availableLimited to what they make available They make the call on how to display it, analyze it, and classify itThey make the call on how to display it, analyze it, and classify it Some of the analyses are definitely a black boxSome of the analyses are definitely a black box EnsemblEnsembl Open source – can see how the data is analyzed/processed – NOT necessarily an issue with lower quality data – a lot of eyes are watching you (wooahh haa haa…)Open source – can see how the data is analyzed/processed – NOT necessarily an issue with lower quality data – a lot of eyes are watching you (wooahh haa haa…) Reputable institution – trust in the dataReputable institution – trust in the data GbrowseGbrowse Easy to use and set upEasy to use and set up Open source – can see how the data is analyzed/processedOpen source – can see how the data is analyzed/processed Anybody can release their data to the worldAnybody can release their data to the world Anybody can analyze the data in they want and release it to the worldAnybody can analyze the data in they want and release it to the world Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Local Visualization of NGS Data Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

How do I update or correct errors in the Databases?  Example: For Gene names, citations, new protein name, sequencing errors in Genbank…  But most people don’t bother to correct things that they notice are wrong…  increased need for more focused community-based projects  increased need for more focused community-based projects Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Community Assisted Curation of Subsets of Datasets  Core curators continually update annotation of a data subset (i.e. a genome)  Literature review  Input from the community  Updates sent in batches to centralized databases - > additional review -> becomes, for example, an NCBI RefSeq  Examples: WormBase.org, Pseudomonas.com Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Ethical issues with bioinformatics databases  How public and/or open source should biomolecular data be?  How much should researchers be forced to release data as soon as possible?  How much analysis of a genome can a researcher publish before the genome sequence is published?  How do we best organize the data?  BIG issue! i.e. biomolecular pathway classifications can bias analyses of pathways are found to be upregulated or downregulated by gene expression analysis Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization

Resources            Lecture 2. Genome Analysis: Databases, Seq Formats & Visualization