Presentation is loading. Please wait.

Presentation is loading. Please wait.

DNA Databanks Speaker: Yu-Chung Chang 張猷忠 Institute of Biochemistry

Similar presentations


Presentation on theme: "DNA Databanks Speaker: Yu-Chung Chang 張猷忠 Institute of Biochemistry"— Presentation transcript:

1 DNA Databanks Speaker: Yu-Chung Chang 張猷忠 Institute of Biochemistry
National Yang-Ming University

2 Biological Databases DNA databanks Protein databases EST databases
GenBank, DDBJ, EMBL,… Protein databases PIR, Swiss-Prot, PRF, GenPept, TrEMBL, PDB,… EST databases dbEST, DOTS, UniGene, GIs, STACK,… Structure databases MMDB, PDB, Swiss-3DIMAGE,… Pathway databases KEGG, BRITE, TRANSPATH,… Integrated databases SRS Motif or cis-element databases Prosite, Pfam, BLOCKS, TransFac, PRINTS, URLs,… Gene, protein & disease databases GeneCards, OMIM, OMIA,… Taxonomy databases Literature databases PubMed, Medline,… Patent database Apipa, CA-STN, IPN, USPTO, EPO, Beilstein,… Others… RNA databases,…

3 DNA Databanks cDNA resources Genomic DNA resources EST resources
Genbank (NCBI), Nucleotide Sequence Database (EMBL), DDBJ , MGC,… Genomic DNA resources HTG, dbGSS, GOLD, ERGO,… EST resources dbEST, UniGene, GIs, STACKS, DOTS,… Others dbSTS, UniSTS, dbSNP, TransFac, ISIS, Repbase, ...

4 GenBank at National Center for Biotechnology Information (NCBI)
GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. There are approximately 11,720,000,000 bases in 10,897,000 sequence records as of February 2001. GenBank is part of the International Nucleotide Sequence Database Collaboration, which is comprised of the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI. These three organizations exchange data on a daily basis.

5 NCBI-SITEMAP

6 European Molecular Biology Laboratory (EMBL)
The EMBL Nucleotide Sequence Database constitutes Europe's primary nucleotide sequence resource. Main sources for DNA and RNA sequences are direct submissions from individual researchers, genome sequencing  projects and patent applications.

7 EBI-databases & tools

8 DNA Data Bank of Japan (DDBJ) http://www.ddbj.nig.ac.jp/
Database Search Getentry, SFgate & WAIS, SRS, Homology Search, TXSearch, SQmatch Data Analysis malign, clustal w Genome Analysis GTOP Protein Structure PDB Retriever, SSThread, LIBRA I

9

10 Genome Projects Whole genome sequences EST projects MGC projects
SNP projects GSS projects STS projects

11 Graphs created on 12 Dec 2000

12 Graphs created on 12 Dec 2000

13 GenBank Sequence Submission Policy
At this time the following types of submissions are NOT acceptable. sequences of less than 50 bp in length. computer generated or otherwise predicted sequences (i.e. EST assembled sequences). third party sequences downloaded from a sequence database or journal. one genomic sequence with multiple exons joined together without the sequence of the intervening introns. primer only sequences.

14 GenBank Sequence Submission Policy (cont.)
At this time the following types of submissions are NOT acceptable. protein only sequences. non-biologically contiguous sequences containing internal unsequenced spacers. sequences containing a mix of genomic and mRNA sequence represented as a single sequence EST submissions should be submitted through the dbEST system. as of 1 January, 2000, Genome Survey Sequences (GSSs) should not be submitted through Bankit; use the dbGSS system.

15 Data Submission WWW Bankit WebIn Sakura Sequin Diskette

16 DNA Databases at NCBI Nucleotides dbEST UniGene dbGSS dbSTS UniSTS
RefSeq MGC dbSNP HTGs UniVec

17 dbEST http://www.ncbi.nlm.nih.gov/dbEST/index.html
dbEST is a database of expressed sequence tags; short, single pass read cDNA (mRNA) sequences. Also includes cDNA sequences from differential display experiments and RACE experiments.

18 dbGSS http://www.ncbi.nlm.nih.gov/dbGSS/index.html
Database of genome survey sequences. Short, single pass read genomic sequences. Exon trapped sequences. Cosmid/BAC/YAC ends. Alu PCR sequences. GSS sequences are available from two sources: dbGSS and the GSS division of GenBank. The sequences and accession numbers in both sources are the same but the record formats differ.

19 dbSTS http://www.ncbi.nlm.nih.gov/dbSTS/index.html
Database of sequence tagged sites. Short sequences that are operationally unique in the genome, used to generate mapping reagents. STS sequences are available from two sources: dbSTS and the STS division of GenBank. The sequences and accession numbers in both sources are the same but the record formats differ.

20 HTGs http://www.ncbi.nlm.nih.gov/HTGS/
High throughput genome sequences from large scale genome sequencing centers. Unfinished (phase 0, 1, 2) and finished (phase 3) sequences. Sequence data in this division are available for BLAST homology searches against either the "htgs" database or the "month" database, which includes all new submissions for the prior month.

21 dbSNP http://www.ncbi.nlm.nih.gov/SNP/
Database of single nucleotide polymorphisms. Small-scale insertions/deletions. Polymorphic repetitive elements. Microsatellite variation.

22 New HTC (High Throughput cDNA) division
At the May 2000 collaborative meeting DDBJ/EMBL/GenBank agreed to create a new database division HTC to represent unfinished High Throughput cDNA sequences. HTC sequences may include  5'UTR and 3'UTR regions and (part of a) coding region. Upon finishing of these sequences, they will be moved to the corresponding taxonomic division. HTC sequence entries will include the keyword 'HTC'. The keyword will be removed once the entry has been included in the taxonomic division.

23 Mammalian Gene Collection (MGC) http://www.ncbi.nlm.nih.gov/MGC/
The Mammalian Gene Collection (MGC) project is a new effort by the NIH to generate full-length complementary DNA (cDNA) resources.

24 Entrez -A search & retrival system

25 Entrez Searching Subject searching Phrase searching
Searching for authors Searching for unique identifiers Searching by molecular weight Range searching Truncating searching (Wildcard searching) Combining sets

26 Entrez -Subject searching
Text searching hiv-1 Subject terms are automatically combined hiv-1 protease, hiv-1 AND protease $ L

27 Entrez -Phrase searching
“hiv-1 protease” Using quotes forces Entrez to check a phrase list against which the search terms are matshed. It is not adjacency searching. If the search phrase is not in the phrase list, Entrez treats it as a subject searching.

28 Entrez -Searching for authors
Chang YC Search only the author field Chang Search all fields Subject searching Do not use punctuation.

29 Entrez -Searching for unique identifiers
Accession numbers GenBank/EMBL/DDBJ: U12345, AF123456 GenPept: AAA12345 SwissProt & PIR: P12345 RefSeq: NM_123456, NT_123456, NP_123456, NC_123456, XM_123456, XP_123456 Sequence identification numbers GI numbers: Version numbers: AF

30 Entrez -Searching by molecular weight
012345[MOLWT] 010000:050000[MOLWT] 002000:010000[MOLWT] AND human[Organism] [field name]  feature table

31 Entrez -Range searching
Accession numbers [ACCN], sequence length [SLEN], and molecular weight [MOLWT] AF114696:AF114714[ACCN] Not for GI and Version numbers 3000:4000[SLEN] 002002:002100[MOLWT]

32 Entrez -Truncating searching
Wildcard searching Root word plus * bacte*, retroviru* Only retrieve the first 150 variations of truncated terms Left-handed trunction is not possible *ology

33 Entrez -Combining sets
Use your search History to combine documents #1 AND #4 L

34 Entrez -Boolean operators
AND, OR, NOT bacteria AND virus NOT phage (bacteria AND virus) NOT phage hiv-1 OR bacterial protease hiv OR (bacterial AND protease) L

35 Entrez -Boolean operators

36 Entrez -Using limits

37 Entrez -Limit a search to a particular database field
You are only intrested in nucleotide sequences from the mouse Select Nucleotide database from the black menu bar or the Search pull-down menu. Select limits. In the "Limits To:" section, select Organism from the Search Field pull-down menu. Type "mouse" without quotes in the query box and select Go.

38 Entrez -Limit a search to a particular database field
You are only interested in protein sequences that are less than 50 amino acids in length. Select the Protein database from the black menu bar or the Search pull-down menu. Select Limits. In the "Limited To:" section, select Sequence Length from the Search Field pull-down menu. Type "0:50" without quotes in the query box and select Go.

39 Entrez -Exclude certain kinds of sequences
You are interested in mitochondrial carriers but you do not want the EST sequences. Select the Nucleotide database from the black menu bar or the Search pull-down menu. Type "mitochondrial carrier" without quotes in the query box. Select Limits. In the "Limited To:" section, checkthe box next to “Exclude ESTs" and select Go.

40 Entrez -Limit the search to a particular molecule type
You are only interested in Cryptosporidium ribosomal RNA sequences. Select the Nucleotide database from the black menu bar or the Search pull-down menu. Type "cryptosporidium" without quotes in the query box. Select Limits. In the "limited to:" section, select the "Molecule" pull-down menu and choose rRNA and select Go.

41 Entrez -Limit the search to a particular gene location
You are interested in the genes in the chloroplast of flowering plants. Select the Nucleotide database from the black menu bar or the Search pull-down menu. Type "flowering plants" without quotes in the query box. Select Limits. In the "Limited To:" section, select the "Gene Location" pull down menu and choose chloroplast and select Go.

42 Entrez -Limit the search to records from a particular sequence database
You are interested only in cysteine phosphatase protein sequences submitted directly to PIR. Select the Protein database from the black menu bar or the Search pull-down menu. Type "cysteine phosphatase" without quotes in the query box. Select Limits. In the "Limited To:" section, select the "Only From" pull-down menu and choose PIR and select Go.

43 Entrez -Limit the search by date
You want to see any nucleotide sequences from pigs added to the database (or updated) in the last 30 days. Select the Nucleotide database from the black menu bar or the Search pull-down menu. Type "pigs" without quotes in the query box. Select Limits. In the "Limited To:" section, select Organism from the Search Field pull-down menu. And in the "Limited To:" section, select the "Modification Date" pull down menu and choose 30 days and select Go.

44 Entrez -Limit the search by date
You want to retrieve all mouse or human nucleotide sequences added to the database (or updated) during 1997. Select the Nucleotide database from the black menu bar or the Search pull-down menu. Type "mouse OR human" without quotes in the query box. Select Limits. In the "Limited To:" section, select Organism from the Search Field pull-down menu. And in the "Limited To:" section, select the "Modification Date" pull down menu and choose Modification Date. In the date boxes, type the dates in the format YYYY/MM/DD. You can tab from box to box in the date fields. Select Go.

45 Entrez -Using more than one limit at a time
You are interested in the protein translations of human GenBank nucleotide sequences added to the protein database (or updated) in the last 30 days. You do not want patent records. Select the Protein database from the black menu bar or the Search pull-down menu. Type "human" without quotes in the query box. Select Limits. In the "Limited To:" section, select Organism from the Search Field pull-down menu. On the same screen, select the exclude patents check box, select GenBank from the Only From pull-down menu, and finally select 30 days from the Modification Date pull-down menu and select Go.

46 Entrez -Writing advanced search statements
Find all human nucleotide sequences with LTR annotations. In the Nucleotide database use the following expression - LTR[FKEY] AND human[ORGN] Find drosophila population studies published in the Journal of Molecular Evolution In the PopSet database use the following expression - j mol evol[JOUR] AND drosophila[ORGN]

47 Entrez -Writing advanced search statements
Find all human protein sequences with lengths between 50 and 60 amino acids and that were entered into the database during 1999. In the Protein database use the following expression - human[ORGN] AND 50[SLEN]:60[SLEN] AND 1999[MDAT]

48

49 Feature key or descriptor line
Feature qualifiers

50 Feature Key Name (partial list)
allele attenuator CAAT_signal CDS enhancer exon gene GC_signal iDNA intron J_region LTR misc_binding misc_feature mRNA polyA_signal polyA_site STS 3’UTR 5’clip ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt

51 Feature Qualifiers (partial list)
/anticodon /bound_moiety /citation /codon /codon_start /cons_splice /db_xref /direction /EC_number /evidence /function /gene /map /note /organism /phenotype /rpt_family /translation

52

53 GenBank EST format

54 GenBank GSS format

55 GenBank GSS format

56 Gold: Genome OnLine Database http://wit.integratedgenomics.com/GOLD/
Genomes Online Database, is a World Wide Web resource for comprehensive access to information regarding complete and ongoing genome projects around the world.

57 SRS: Sequence Retrival System http://srs.ebi.ac.uk/

58 SRS: Sequence Retrival System http://srs.ebi.ac.uk/

59 SRS: Sequence Retrival System http://srs.ebi.ac.uk/

60 Deambulum http://www.infobiogen.fr/services/deambulum/english/

61 Deambulum http://www.infobiogen.fr/services/deambulum/english/

62 Deambulum: READSEQ http://www. infobiogen

63 NCGR: National Center for Genome Resources GSDB: Genome Sequence Database

64 NCGR: National Center for Genome Resources GSDB: Genome Sewquence Database

65 NCGR: National Center for Genome Resources GSDB: Genome Sequence Database

66 BIOBASE http://www.gene-regulation.com/pub/databases.html

67 Repbase http://charon.girinst.org/index.html

68 Minisatellite Database http://minisatellites.u-psud.fr/

69 STRBase http://www.cstl.nist.gov/div831/strbase/index.htm

70 ASDB http://devnull.lbl.gov:8888/alt/

71 ASDB http://devnull.lbl.gov:8888/alt/

72 ASDB http://devnull.lbl.gov:8888/alt/

73 ISIS http://www.introns.com/

74 ExInt: An Exon-Intron Database of. Eukaryotic Organism http://intron

75 ExInt: An Exon-Intron Database of. Eukaryotic Organism http://intron

76 ExInt: An Exon-Intron Database of. Eukaryotic Organism http://intron

77 MethDB http://www.methdb.de./
The purpose of this database is to provide the scientific community with a resource to store DNA methylation data search for methylation patterns and profiles correlate methylation and expression data of genes

78 Small RNA Database http://mbcr.bcm.tmc.edu/smallRNA/smallrna.html

79 UTRs http://igs-server.cnrs-mrs.fr/~gauthere/UTR/index.html

80 UTRdb & UTRsite http://bigarea.area.ba.cnr.it:8000/EmbIT/UTRHome/

81 TARD http://wwwicg.bionet.nsc.ru/SRCG/Translation/
Gene expression is often regulated at the level of mRNA translation. The structural characteristics of mRNA correlate with translation efficiency and specificity. Determination of "active elements" could be very useful for prediction of the gene expression pattern under both normal and stress conditions because not all mRNAs can be translated when stressed. Prediction of the gene expression pattern can might be useful for biotechnology and cDNA analysis.

82 RAD: RNA Abundance Database http://www.cbil.upenn.edu/RAD2/
RAD (RNA Abundance Database) is a public gene expression database designed to hold data from array-based (microarrays, high-density oligo arrays, macroarrays) and nonarray-based (SAGE) experiments. The ultimate goal is to allow comparative analysis of experiments performed by different laboratories using different platforms and investigating different biological systems.

83 RAD: RNA Abundance Database http://www.cbil.upenn.edu/RAD2/

84 Cook your food by yourself
Farms Markets Restaurents Cooking skills Sequencing centers Nucleotide databases Value-added databases Bioinformatics

85 Exercise Please try to write a search statement for finding all mouse nucleotide sequences with CDS annotations.


Download ppt "DNA Databanks Speaker: Yu-Chung Chang 張猷忠 Institute of Biochemistry"

Similar presentations


Ads by Google