Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how.

Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how to search GenBank for information. Understand difference between header, features and sequence. Distinguish between a primary database and secondary database. Introduce ENTREZ platform for biological data analysis.

BIOSEQs Biological sequence-central element in the NCBI data model. Comprises a single continuous molecule of nucleic acid or protein. Must have at least one sequence identifier (Seq-id) Information on physical type of molecule (DNA, RNA or protein) Annotations-refers to specific locations within the Bioseq Descriptors-describe entire Bioseq

What is GenBank? Gene sequence database Annotated records that represent single contiguous stretches of DNA or RNA-may have more than one coding region (limit 350 kb) Generated from direct submissions to the DNA sequence databases from the authors. Part of the International Nucleotide Sequence Database Collaboration.

General Comments on GBFF Three sections: 1) Header-information about the whole record 2) Features-description of annotations-each represented by a key. 3) Nucleotide sequence-each ends with // on last line of record. DNA-centered Translated sequence is a feature

Feature Keys Purpose: 1) Indicates biological nature of sequence 2) Supplies information about changes to sequences Feature KeyDescription conflictSeparate deter’s of the same seq. differ rep_originOrigin of replication protein_bindProtein binding site on DNA CDSProtein coding sequence

Feature Keys-Terminology Feature Key Location/Qualifiers CDS 23..400 /product=“alcohol dehydro.” /gene=“adhI” The feature CDS is a coding sequence beginning at base 23 and ending at base 400, has a product called “alcohol dehydrogenase” and corresponds to the gene called “adhI”.

Feature Keys-Terminology (Cont.) Feat. Key Location/Qualifiers CDS join (544..589,688..1032) /product=“T-cell recep. B-ch.” /partial The feature CDS is a partial coding sequence formed by joining the indicated elements to form one contiguous sequence encoding a product called T-cell receptor beta-chain.

Record from GenBank LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. ACCESSION U49845 VERSION U49845.1 GI:1293613 KEYWORDS. SOURCE baker's yeast. ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Hemiascomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. Modification date GenBank division (plant, fungal and algal) Coding region Unique identifier (never changes) Nucleotide sequence identifier (changes when there is a change in sequence (accession.version)) GeneInfo identifier (changes whenever there is a change) Word or phrase describing the sequence (not based on controlled vocabulary). Not used in newer records. Common name for organism Formal scientific name for the source organism and its lineage based on NCBI Taxonomy Database Locus name

Record from GenBank (cont.1) REFERENCE 1 (bases 1 to 5028) AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W. TITLE Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10 (11), 1503-1509 (1994) MEDLINE 95176709 REFERENCE 2 (bases 1 to 5028) AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M. TITLE Selection of axial growth sites in yeast requires Axl2p, a novel plasma membrane glycoprotein JOURNAL Genes Dev. 10 (7), 777-793 (1996) MEDLINE 96194260 Oldest reference first Medline UID REFERENCE 3 (bases 1 to 5028) AUTHORS Roemer,T. TITLE Direct Submission JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New Haven, CT, USA Submitter of sequence (always the last reference)

Record from GenBank (cont.2) FEATURES Location/Qualifiers source 1..5028 /organism="Saccharomyces cerevisiae" /db_xref="taxon:4932" /chromosome="IX" /map="9" CDS <1..206 /codon_start=3 /product="TCP1-beta" /protein_id="AAA98665.1" /db_xref="GI:1293614" /translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA AEVLLRVDNIIRARPRTANRQHM" Partial sequence on the 5’ end. The 3’ end is complete. There are three parts to the feature key: a keyword (indicates functional group), a location (instruction for finding the feature), and a qualifier (auxiliary information about a feature) Keys Location Qualifiers Descriptive free text must be in quotations Start of open reading frame Database cross-refs Protein sequence ID # Note: only a partial sequence Values

Record from GenBank (cont.3) gene 687..3158 /gene="AXL2" CDS 687..3158 /gene="AXL2" /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of S. cerevisiae" /product="Axl2p" /protein_id="AAA98666.1" /db_xref="GI:1293615" /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVN... “ gene complement(3300..4037) /gene="REV7" CDS complement(3300..4037) /gene="REV7" /codon_start=1 /product="Rev7p" /protein_id="AAA98667.1" /db_xref="GI:1293616" /translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ... “ Cutoff Another location

Record from GenBank (cont.4) BASE COUNT 1510 a 1074 c 835 g 1609 t ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct...//

EBI Sequence Retreival System EMBL SRS-author SRS-accession number SRS-title SRS-reference SRS-organism Parts of the record are parsed into separate database files

Primary databases vs. Secondary databases Primary database has information from experimenter. It is called an archival database Secondary database derives information from primary database. It is a curated database

Types of primary databases carrying biological infomation GenBank/EMBL/DDBJ PDB-Three-dimensional structure coordinates of biological molecules PROSITE-database of protein domain/function relationships.

Types of secondary databases carrying biological infomation dbSTS-Non-redundant db of sequence-tagged sites (useful for physical mapping) Genome databases-(there are over 20 genome databases that can be searched EPD:eukaryotic promoter database NR-non-redundant GenBank+EMBL+DDBJ+PDB. Entries with 100% sequence identity are merged as one. Vector: A subset of GenBank containing vector DNA ProDom PRINTS BLOCKS

RNA cDNA DNA protein DNA databases derived from GenBank containing data for a single gene Non-redundant (nr) dbGSS dbHTGS dbSTS LocusLink RNA (cDNA) databases derived from GenBank containing data for a single gene dbEST UniGene LocusLink Protein databases derived from GenBank containing data for a single gene Non-redundant (nr) Swissprot PIR LocusLink

Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how.

Similar presentations

Presentation on theme: "Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how.

Similar presentations

Presentation on theme: "Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how."— Presentation transcript:

Similar presentations

About project

Feedback