Presentation is loading. Please wait.

Presentation is loading. Please wait.

National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov Bunu databases’in icine koy lecture 5i de sonuna Page 24.

Similar presentations


Presentation on theme: "National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov Bunu databases’in icine koy lecture 5i de sonuna Page 24."— Presentation transcript:

1 National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov Bunu databases’in icine koy lecture 5i de sonuna Page 24

2 www.ncbi.nlm.nih.gov Fig. 2.5 Page 25

3 Fig. 2.5 Page 25

4

5 PubMed is… National Library of Medicine's search service 16 million citations in MEDLINE links to participating online journals PubMed tutorial (via “Education” on side bar) Page 24

6 Entrez integrates… the scientific literature; DNA and protein sequence databases; 3D protein structure data; population study data sets; assemblies of complete genomes Page 24

7 Entrez is a search and retrieval system that integrates NCBI databases Page 24

8 BLAST is… Basic Local Alignment Search Tool NCBI's sequence similarity search tool supports analysis of DNA and protein databases 100,000 searches per day Page 25

9 OMIM is… Online Mendelian Inheritance in Man catalog of human genes and genetic disorders edited by Dr. Victor McKusick, others at JHU Page 25

10 Cancer Chromosomes Contains cytogenetic, clinical, and reference information from integrated information from the NCI Mitelman Database of Chromosome Aberrations in Cancer, the NCI Recurrent Aberrations in Cancer database, and the NCI/NCBI SKY/M-FISH & CGH Database.

11 CDD Conserved Domain Database, a collection of sequence alignments and profiles representing protein domains conserved in molecular evolution. Select 'Domains' from the Entrez pull down menu.

12 CoreNucleotide CoreNucleotide Contains all nucleotide sequences not included in the EST or GSS subsets. 3D Domains 3D Domains Contains protein domains from the Entrez Structure database. EST EST A Nucleotide database subset that contains only Expressed Sequence Tag records. Gene Gene Genes and associated information for a number of organisms in addition to and including human.

13 Genome Genome Genomes of over 1,200 organisms can be found in this database, representing both completely sequenced organisms and those for which sequencing is in progress. Genome Project Genome Project A searchable collection of complete and incomplete (in-progress) large-scale sequencing, assembly, annotation, and mapping projects for cellular organisms. dbGaP dbGaP Associated genotype and phenotype data. GENSAT GENSAT Gene expression atlas of the mouse central nervous system.

14 GEO Datasets GEO Datasets Curated gene expression and molecular abundance DataSets from NCBI's Gene Expression Omnibus, a gene expression and hybridization array repository. GEO Profiles GEO Profiles Individual gene expression and molecular abundance profiles assembled from the GEO repository. http://www.ncbi.nlm.nih.gov/About/tools/restable_mol.html

15 Books is… searchable resource of on-line books Page 26

16 TaxBrowser is… browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses) taxonomy information such as genetic codes molecular data on extinct organisms Page 26

17 Structure site includes… Molecular Modelling Database (MMDB) biopolymer structures obtained from the Protein Data Bank (PDB) Cn3D (a 3D-structure viewer) vector alignment search tool (VAST) Page 26

18 Accessing information on molecular sequences Page 26

19 Accession numbers are labels for sequences NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or the raw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data. Page 26

20 What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775GenBank genomic DNA sequence NT_030059Genomic contig Rs7079946dbSNP (single nucleotide polymorphism) N91759.1An expressed sequence tag (1 of 170) NM_006744RefSeq DNA sequence (from a transcript) NP_007635RefSeq protein AAC02945GenBank protein Q28369SwissProt protein 1KT7Protein Data Bank structure record protein DNA RNA Page 27

21 Four ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 27 Note: LocusLink at NCBI was recently retired. The third printing of the book has updated these sections (pages 27-31).

22 4 ways to access protein and DNA sequences [1] Entrez Gene with RefSeq Entrez Gene is a great starting point: it collects key information on each gene/protein from major databases. It covers all major organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635) Page 27

23 From the NCBI home page, type “rbp4” and hit “Go” revised Fig. 2.7 Page 29

24 revised Fig. 2.7 Page 29

25

26

27 By applying limits, there are now just two entries

28 revised Fig. 2.8 Page 30 Entrez Gene (top of page) Note that links to many other RBP4 database entries are available

29 Entrez Gene (middle of page)

30 Entrez Gene (bottom of page)

31 Fig. 2.9 Page 32

32 Fig. 2.9 Page 32

33 Fig. 2.9 Page 32

34 FASTA format Fig. 2.10 Page 32

35 What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775GenBank genomic DNA sequence NT_030059Genomic contig Rs7079946dbSNP (single nucleotide polymorphism) N91759.1An expressed sequence tag (1 of 170) NM_006744RefSeq DNA sequence (from a transcript) NP_007635RefSeq protein AAC02945GenBank protein Q28369SwissProt protein 1KT7Protein Data Bank structure record protein DNA RNA Page 27

36 NCBI’s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number that corresponds to the most stable, agreed-upon “reference” version of a sequence. RefSeq identifiers include the following formats: Complete genomeNC_###### Complete chromosomeNC_###### Genomic contigNT_###### mRNA (DNA format)NM_###### e.g. NM_006744 ProteinNP_###### e.g. NP_006735 Page 29-30

37 Accession MoleculeMethodNote AC_123456 GenomicMixedAlternate complete genomic AP_123456 ProteinMixedProtein products; alternate NC_123456 GenomicMixedComplete genomic molecules NG_123456 GenomicMixedIncomplete genomic regions NM_123456 mRNAMixedTranscript products; mRNA NM_123456789 mRNAMixedTranscript products; 9-digit NP_123456 ProteinMixedProtein products; NP_123456789 ProteinCurationProtein products; 9-digit NR_123456 RNAMixedNon-coding transcripts NT_123456 GenomicAutomatedGenomic assemblies NW_123456 GenomicAutomatedGenomic assemblies NZ_ABCD12345678 GenomicAutomatedWhole genome shotgun data XM_123456 mRNAAutomatedTranscript products XP_123456 ProteinAutomatedProtein products XR_123456 RNAAutomatedTranscript products YP_123456 ProteinAuto. & CuratedProtein products ZP_12345678 ProteinAutomatedProtein products NCBI’s RefSeq project: accession for genomic, mRNA, protein sequences

38 Four ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 31

39 DNARNA complementary DNA (cDNA) protein UniGene Fig. 2.3 Page 23

40 UniGene: unique genes via ESTs Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene UniGene clusters contain many expressed sequence tags (ESTs), which are DNA sequences (typically 500 base pairs in length) corresponding to the mRNA from an expressed gene. ESTs are sequenced from a complementary DNA (cDNA) library. UniGene data come from many cDNA libraries. Thus, when you look up a gene in UniGene you get information on its abundance and its regional distribution. Pages 20-21

41 Cluster sizes in UniGene This is a gene with 1 EST associated; the cluster size is 1 Fig. 2.3 Page 23

42 Cluster sizes in UniGene This is a gene with 10 ESTs associated; the cluster size is 10

43 Cluster sizes in UniGene (human) Cluster size (ESTs) Number of clusters 1  42,800 26,500 3-46,500 5-85,400 9-164,100 17-323,300  500-10002,128  2000-4000233  8000-16,00021  16,000-30,0008 UniGene build 194, 8/06

44 UniGene: unique genes via ESTs Conclusion: UniGene is a useful tool to look up information about expressed genes. UniGene displays information about the abundance of a transcript (expressed gene), as well as its regional distribution of expression (e.g. brain vs. liver). We will discuss UniGene further later (gene expression). Page 31

45 Five ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 31

46 Ensembl to access protein and DNA sequences Try Ensembl at www.ensembl.org for a premier human genome web browser. We will encounter Ensembl as we study the human genome, BLAST, and other topics.

47 click human

48 enter RBP4

49

50 Five ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 33

51 ExPASy to access protein and DNA sequences ExPASy sequence retrieval system (ExPASy = Expert Protein Analysis System) Visit http://www.expasy.ch/ Page 33

52 Fig. 2.11 Page 33

53

54 Example of how to access sequence data: HIV-1 pol There are many possible approaches. Begin at the main page of NCBI, and type an Entrez query: hiv-1 pol Page 34

55

56 Searching for HIV-1 pol: Following the “genome” link yields a manageable three results

57 Example of how to access sequence data: HIV-1 pol For the Entrez query: hiv-1 pol there are about 40,000 nucleotide or protein records (and >100,000 records for a search for “hiv-1”), but these can easily be reduced in two easy steps: --specify the organism, e.g. hiv-1[organism] --limit the output to RefSeq! Page 34

58 only 1 RefSeq over 100,000 nucleotide entries for HIV-1

59 Examples of how to access sequence data: histone query for “histone”# results protein records21847 RefSeq entries7544 RefSeq (limit to human)1108 NOT deacetylase697 At this point, select a reasonable candidate (e.g. histone 2, H4) and follow its link to Entrez Gene. There, you can confirm you have the right gene/protein. 8-12-06

60

61 Access to Biomedical Literature Page 35

62 PubMed at NCBI to find literature information

63 PubMed is the NCBI gateway to MEDLINE. MEDLINE contains bibliographic citations and author abstracts from over 4,600 journals published in the United States and in 70 foreign countries. It has >14 million records dating back to 1966. Page 35

64 MeSH is the acronym for "Medical Subject Headings." MeSH is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM. MeSH vocabulary is used for indexing journal articles for MEDLINE. The MeSH controlled vocabulary imposes uniformity and consistency to the indexing of biomedical literature. Page 35

65

66

67 PubMed search strategies Try the tutorial (“education” on the left sidebar) Use boolean queries (capitalize AND, OR, NOT) lipocalin AND disease Try using “limits” Try “Links” to find Entrez information and external resources Obtain articles on-line via Welch Medical Library (and download pdf files): http://www.welch.jhu.edu/ Page 35

68 lipocalin AND disease (60 results) lipocalin OR disease (1,650,000 results) lipocalin NOT disease (530 results) 1 AND 2 1 OR 2 1 NOT 2 1 1 1 2 2 2 Fig. 2.12 Page 34 8/04

69 true positive “globin” is found 8/06 “globin” is not found “globin” is present “globin” is absent Article contents: Search result: true negative false negative ( article discusses globins ) false positive ( article does not discuss globins )

70 WelchWeb is available at http://www.welch.jhu.edu

71 Brian Brown (bbrown20@jhmi.edu) and Carrie Iwema (iwema@jhmi.edu) are the Welch Medical Library liasons to the basic sciences http://www.welch.jhu.edu


Download ppt "National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov Bunu databases’in icine koy lecture 5i de sonuna Page 24."

Similar presentations


Ads by Google