Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 3. THE GENBANK SEQUENCE DATABASE

Similar presentations


Presentation on theme: "Chapter 3. THE GENBANK SEQUENCE DATABASE"— Presentation transcript:

1 Chapter 3. THE GENBANK SEQUENCE DATABASE

2 Introduction GenBank, the National Institutes of Health (NIH) genetic sequence database, is an annotated collection of all publicly available nucleotide and protein sequences. GenBank, which is built by the National Center for Biotechnology Information (NCBI), is part of the International Nucleotide Sequence Database Collaboration. DNA Data Bank of Japan (DDBJ), European Molecular Biology Laboratory (EMBL) Historically, the protein database preceded the nucleotide databases. 1960s, Dayhoff, Atlas of Protein Sequence and Structures (1965) 1982, DNA sequence database, EMBL – GenBank – DDBJ 1988, International Nucleotide Sequence Database Collaboration

3 Introduction MAGEST: ESTs and Gene Expression Pattern Database for Halocynthia roretzi Maternal cDNA

4 Introduction NGIC 국가유전체정보센터

5 Primary and secondary databases
There is an important distinction between primary (archival) and secondary (curated) databases. The primary databases represent experimental results but are not a curated review. Curated reviews are found in the secondary databases. The primary databases represent experimental results GenBank nucleotide sequence records are derived from the sequencing of a biological molecule that exists in a test tube. Secondary database are obtained results from primary databases.

6 Format vs. Content: Computer vs. Humans
DNA sequence record and be represented as a string of nucleotides with some tag or identifier . FASTA (pearson format) >NM_007348 ttttgtccgc ctgccgccgc cgtcccagat attaatcacg gagttccagg gagaaggaac ttgtgaaatg ggggagccgg ctggggttgc cggcaccatg gagtcacctt ttagcccggg actctttcac aggctggatg aagattggga ttctgctctc tttgctgaac tcggttattt cacagacact gatgagctgc aattggaagc agcaaatgag acgtatgaaa acaattttga taatcttgat tttgatttgg atttgatgcc ttgggagtca gacatttggg acatcaacaa ccaaatctgt acagttaaag atattaaggc agaacctcag ccactttctc cagcctcctc aagttattca gtctcgtctc ctcggtcagt ggactcttat tcttcaactc agcatgttcc >NP_031374 MGEPAGVAGTMESPFSPGLFHRLDEDWDSALFAELGYFTDTDELQLEAANETYENNFDNLDFDLDLMPWESDIWDINNQICTVKDIKAEPQPLSPASSSYSVSSPRSVDSYSSTQHVPEELDLSSSSQMSPLSLYGENSNSLSSAEPLKEDKPVTGPRNKTENGLTPKKKIQVNSKPSIQPKPLLLPAAPKTQTNSSVPAKTIIIQTVPTLMPLAKQQPIISLQPAPTKGQTVLLSQPTVVQLQAPGVLPSAQPVLAVAGGVTQLPNHVVNVVPAPSANSPVNGKLSVTKPVLQSTMRNVGSDIAVLRRQQRMIKNRESACQSRKKKKEYMLGLEARLKAALSENEQLKKENGTLKRQLDEVVSENQRLKVPSPKRRVVCVMIVLAFIILNYGPMSMLEQDSRRMNPSVSPANQRRHLLGFSAKEAQDTSDGIIQKNSYRYDHSVSNDKALMVLTEEPLLYIPPPPCQPLINTTESLRLNHELRGWVHRHEVERTKSRRMTNNQQKTRILQGALEQGSNSQLMAVQYTETTSSISRNSGSELQVYYASPRSYQDFFEAIRRRGDTFYVVSFRRDHLLLPATTHNKTTRPKMSIVLPAININENVINGQDYEVMMQIDCQVMDTR ILHIKSSSVPPYLRDQQRNQTNTFFGSPPAATEATHVVSTIPESLQ >gi | | Homo sapiens activating transcription factor 6 (ATF6) gene, complete cds.

7 The database There are three important consequences of not having the correct or proper information on the nucleotide record. If a coding sequence is not indicated on a nucleic acid record, it will not be represented in the protein databases. The set of features usable on the nucleotide feature table that are specific to protein sequences themselves is limited. If a coding feature on a nucleotide record contains incorrect information about the protein, this could be propagated to other records in both the nucleotide and protein databases on the basis of sequence similarity

8 The GenBank flatfile : a dissection
The GenBank flatfile (GBFF) is the elementary unit of information in the GenBank database. It is one of the most commonly used formats in the representation of biological sequences. The GBFF can be separated into three parts, the header the features the nucleotide sequences

9 The Header

10 The Header (locus) ▣ Locus name
1. This element was historically used to represent the locus that was the subject of the record. 2. All letters are uppercase. 3. Most DNA sequence records represented only one genetic locus . HUMHBB : human b-globin locus SV40 : simian virus 4. use an accession number of ensured uniqueness, cannot exceed 10 characters ▣ Sequence length 1. Sequences can range from 1 to 350,000 base pairs 2. seldom accept sequences shorter than 50 bp, primer sequences is discouraged 3. Records of greater than 350 kb are acceptable in the database if the sequence represents a single gene.

11 The Header (locus) ▣ Moleclue type
1. The “mol type” usually is DNA or RNA. 2. The acceptable mol type are DNA, RNA, tRNA, rRNA, mRNA, and uRNA 3. If the tRNA or rRNA has been sequenced directly or via some cDNA intermediate, then tRNA or rRNA is shown as the mol type. 4. If rRNA gene sequense was obtained via the PCR from genomic DNA, then DNA is the mol type. ▣ GenBank division code 1. three letters, taxonomic inferences or other classification purposes 2. recalling the time when the various GenBank division were used to break up the database files into what was then a more manageable size. 3. new function-based divisions : represent functional and definable sequence type

12 The Header (locus) ▣ GenBank division code
EST (Expressed Sequence Tags) : contains "single-pass" cDNA sequences from a number of organisms.

13 The Header (locus) ▣ GenBank division code
GSS (Genome Survey Sequences) : similar to the EST division with the exception that most of the sequences are genomic in origin, rather than cDNA (mRNA). ① random "single pass read" genome survey sequences. ② cosmid/BAC/YAC end sequences ③ exon trapped genomic sequences ④ transposon-tagged sequences

14 The Header (locus) ▣ GenBank division code
STS (Sequence Tagged Sites) : contains sequence and mapping data on short genomic landmark sequences or Sequence Tagged Sites

15 The Header (locus) ▣ GenBank division code ▣ Date
CON (contigged) : In shotgun DNA sequencing projects, a contig (from contiguous) is a set of overlapping DNA segments derived from a single genetic source. ▣ Date 1. The date is the date the record was last made public. 2. It should be noted that none of these dates is legally binding on the promulgating organization.

16 The Header (definition)
The definition line is the line in the GenBank record that attempts to summarize the biology of the record. ▣ mRNA definition Genus species product name (gene symbol) mRNA, complete cds. ▣ genomic record Genus species product name (gene symbol) gene, complete cds. ▣ organelle sequences DEFINITION Genus species protein X(xxx) gene, complete cds; DEFINITION Genus species XXS ribosomal RNA gene, complete cds; Nuclear gene (s) for mitochondrial product (s) Nuclear gene (s) for chloroplast product (s) Mitochondrial gene (s) for mitochondrial product (s) chloroplast gene (s) for chloroplast product (s)

17 The Header (accession)
1. The accession number represents the primary key to reference a given record in the database 2. This is the number that is cited in publication and is always associated with the record 3. If the sequence is updated, the accession number will not change. 4. Format “1 + 5” : one uppercase letter followed by five digits “2 + 6” : two letters plus six digits ▣ VERSION 1. The version line contains the Accession.version and the gi. These identifers are associated with a unique nucleotide sequence. 2. If the sequence changes, the version number in the Accession.version will be incremented by one and the gi will change.

18 The Header (keywords) ▣ Keyward line
1. The keyword line is another historical relic that is, in many cases, unfortunately misused. 2. NCBI discourages the use of keywords but will include them on request, especially if the words are not present elsewhere in the record or are used in a controlled fashion (EST, STS, GSS, HTG).

19 The Header (source) ▣ Source line
The source line will either have the common name for the organism or its scientific name.

20 The Header (references)
Each GenBank record must have at least one reference or citation Published paper PubMed identifier provides a link to the PubMed databases. Unpublished paper could be submitted Direct submission placeholders for a publication

21 The Header (comment) This section includes a great variety of notes and comment that refer to the whole record. This section is optional and not found in most records in GenBank. The comment section also contains information about the history of the sequence. If the sequence of a particular record is updated, the comment will contain a pointer to the previous of the record.

22 The Feature Table The most important direct representation of the biological information in the record. A full set of annotations within the record facilitates quick extraction of the relevant biological features and allows the submitter to indicate why this record was submitted to the databases.

23 The Source Feature The source feature is the only feature that must be present on all GenBank records. All DNA sequence records have some origin, even if synthetic in the extreme case. Care should be taken to avoid adding superfluous information to the record

24 The CDS Feature

25 The CDS Feature ▣ Database cross-reference (db_xref)
The CDS feature contains instructions to the reader in how to join two sequences together or on how to make an amino acid sequence from the indicated coordinates and the inferred genetic code. ▣ Database cross-reference (db_xref) This controlled qualifier allows the databases to cross-reference the sequence in question to an external database with an identifier used in that database. ▣ protein_id Each protein sequence is assigned a protein_id or protein accession number. The format of this accession number is “3 + 5” or three letters and five digits. Because amino acid sequences represent one of the most important by-products of the nucleotide sequence database, much attention is devoted to making sure they are valid. These sequences are the starting material for the protein databases and offer the most sensitive way of making new gene discoveries.

26 The Gene Feature The RNA Feature
The gene feature represents a segment of DNA that can be identified with a name or some arbitrary number, as is often used in genome sequencing project. The gene feature allows the user to see the gene area of interest and in some cases to select it. The RNA Feature Although these are presently not instantiated into separate records as protein sequences are, these sequences are essential to our understanding of how higher genomes are organized. The RNA feature on a genomic record should represent the experimental evidence of the presence of that biological molecule.


Download ppt "Chapter 3. THE GENBANK SEQUENCE DATABASE"

Similar presentations


Ads by Google