Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genome Bioinformatics DNA and protein Databases I.

Similar presentations


Presentation on theme: "Genome Bioinformatics DNA and protein Databases I."— Presentation transcript:

1 Genome Bioinformatics DNA and protein Databases I

2 GenBankEMBLDDBJ Housed at EBI European Bioinformatics Institute There are three major public DNA databases Housed at NCBI National Center for Biotechnology Information Housed in Japan The underlying raw DNA sequences are identical

3 National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov

4

5 Tool-users Tool-makers bioinformatics public health informatics medical informatics infrastructure databases algorithms

6 The most sequenced organisms in GenBank Homo sapiens (Human)12.3 billion bases Mus musculus (Mouse)8.0b Rattus norvegicus (Rat)5.7b Bos taurus(bovine)3.5b Danio rerio (zebrafish)2.5b Zea mays (Maize)1.8b Oryza sativa (Rice) 1.5b Strongylocentrotus purpurata (Echinus)1.2b Xenopus tropicalis1.0b

7 Page 24

8 PubMed is… National Library of Medicine's search service 16 million citations in MEDLINE links to participating online journals PubMed tutorial (via “Education” on side bar) Page 24

9 PubMed is the NCBI gateway to MEDLINE. MEDLINE contains bibliographic citations and author abstracts from over 4,600 journals published worldwide. MeSH = Medical Subject Headings, is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM. MeSH vocabulary is used for indexing journal articles for MEDLINE. The MeSH controlled vocabulary imposes uniformity and consistency to the indexing of biomedical literature.

10

11 BLAST is… Basic Local Alignment Search Tool NCBI's sequence similarity search tool Supports analysis of DNA and protein databases 100,000 searches per day

12 OMIM is… Online Mendelian Inheritance in Man catalog of human genes and genetic disorders edited by Dr. Victor McKusick, others at JHU Example: hair/eye color, DRD4 Practice: sickle cell anemia or other interested disease Page 25

13 TaxBrowser is… browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses) taxonomy information such as genetic codes molecular data on extinct organisms Page 26

14 Structure site includes… Molecular Modelling Database (MMDB) biopolymer structures obtained from the Protein Data Bank (PDB) Cn3D (a 3D-structure viewer) vector alignment search tool (VAST)

15 Entrez integrates… The scientific literature; DNA and protein sequence databases; 3D protein structure data; Population study data sets; Assemblies of complete genomes Everything Practice: Hemoglobin or other interested gene

16 Entrez is a search and retrieval system that integrates NCBI databases

17 Accessing information on molecular sequences

18 Accession numbers are labels for sequences NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or the raw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data.

19 What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775GenBank genomic DNA sequence NT_030059Genomic contig Rs7079946dbSNP (single nucleotide polymorphism) N91759.1An expressed sequence tag (1 of 170) NM_006744RefSeq DNA sequence (from a transcript) NP_007635RefSeq protein AAC02945GenBank protein Q28369SwissProt protein 1KT7Protein Data Bank structure record protein DNA RNA

20 NCBI’s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number that corresponds to the most stable, agreed-upon “reference” version of a sequence. RefSeq identifiers include the following formats: Complete genomeNC_###### Complete chromosomeNC_###### Genomic contigNT_###### mRNA (DNA format)NM_###### e.g. NM_006744 ProteinNP_###### e.g. NP_006735

21 Accession MoleculeMethodNote AC_123456 GenomicMixedAlternate complete genomic AP_123456 ProteinMixedProtein products; alternate NC_123456 GenomicMixedComplete genomic molecules NG_123456 GenomicMixedIncomplete genomic regions NM_123456 mRNAMixedTranscript products; mRNA NM_123456789 mRNAMixedTranscript products; 9-digit NP_123456 ProteinMixedProtein products; NP_123456789 ProteinCurationProtein products; 9-digit NR_123456 RNAMixedNon-coding transcripts NT_123456 GenomicAutomatedGenomic assemblies NW_123456 GenomicAutomatedGenomic assemblies NZ_ABCD12345678 GenomicAutomatedWhole genome shotgun data XM_123456 mRNAAutomatedTranscript products XP_123456 ProteinAutomatedProtein products XR_123456 RNAAutomatedTranscript products YP_123456 ProteinAuto. & CuratedProtein products ZP_12345678 ProteinAutomatedProtein products NCBI’s RefSeq project: accession for genomic, mRNA, protein sequences

22 Four ways to access DNA and protein sequences [1] Entrez Gene with RefSeq/Mapviewer [2] UniGene/SNP [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI)

23 4 ways to access protein and DNA sequences [1] Entrez Gene with RefSeq Entrez Gene is a great starting point: it collects key information on each gene/protein from major databases. It covers all major organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635)

24 Themes throughout the course: gene/protein families We will use Human serum albumin (HSA) as a model gene/protein. We will study it in a variety of contexts including --sequence alignment --gene expression --protein structure --phylogeny --homologs in various species We will also use other examples, such as the globins and the pol protein of HIV-1

25 The gene for albumin is located on chromosome 4 and mutations in this gene can result in various anomalous proteins. The human albumin gene is 16,961 nucleotides long from the putative 'cap' site to the first poly(A) addition site. It is split into 15 exons which are symmetrically placed within the 3 domains that are thought to have arisen by triplication of a single primordial domain. The reference range for albumin concentrations in blood is 30 to 50 g/L. It has a serum half-life of approximately 20 days. It has a molecular mass of 67 kDa Human serum albumin is the most abundant protein in human blood plasma. It is produced in the liver. Albumin comprises about half of the blood serum protein. It is soluble and monomeric.

26

27

28

29

30 By applying limits, there are now just 17 entries

31

32 FASTA format

33

34

35

36

37

38 ALB

39 Four ways to access DNA and protein sequences [1] Entrez Gene with RefSeq/Mapviewer [2] UniGene/SNP [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI)

40 DNARNA complementary DNA (cDNA) protein UniGene

41 UniGene: unique genes via ESTs Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene UniGene clusters contain many expressed sequence tags (ESTs), which are DNA sequences (typically 500 base pairs in length) corresponding to the mRNA from an expressed gene. ESTs are sequenced from a complementary DNA (cDNA) library. UniGene data come from many cDNA libraries. Thus, when you look up a gene in UniGene you get information on its abundance and its regional distribution.

42 Cluster sizes in UniGene This is a gene with 1 EST associated; the cluster size is 1

43 Cluster sizes in UniGene This is a gene with 10 ESTs associated; the cluster size is 10

44 Cluster sizes in UniGene (human) Cluster size (ESTs) Number of clusters 1  42,800 26,500 3-46,500 5-85,400 9-164,100 17-323,300  500-10002,128  2000-4000233  8000-16,00021  16,000-30,0008 UniGene build 194, 8/06

45 UniGene: unique genes via ESTs Conclusion: UniGene is a useful tool to look up information about expressed genes. UniGene displays information about the abundance of a transcript (expressed gene), as well as its regional distribution of expression (e.g. brain vs. liver). We will discuss UniGene further later.

46

47

48

49

50

51

52 Practice: SNP for sickle cell anemia

53 Five ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI)

54 Ensembl to access protein and DNA sequences Try Ensembl at www.ensembl.org for a premier human genome web browser. We will encounter Ensembl as we study the human genome, BLAST, and other topics.

55

56

57

58

59 Five ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System

60 http://www.expasy.org Expert Protein Analysis System

61 http://www.expasy.org/tools/

62 实战练习 找到一个感兴趣的基因的 DNA 序列及其蛋 白质序列; 找到 5 个同源基因的 DNA 序列及其蛋白质 序列,分别用 FASTA 格式保存; 找到该基因在染色体上的位置; 说明该基因的表达模式及功能是什么? 查找并阅读相关文献 3-5 篇。 报告讲述该基因;

63 lipocalin AND disease (60 results) lipocalin OR disease (1,650,000 results) lipocalin NOT disease (530 results) 1 AND 2 1 OR 2 1 NOT 2 1 1 1 2 2 2 8/04

64 true positive “globin” is found 8/06 “globin” is not found “globin” is present “globin” is absent Article contents: Search result: true negative false negative ( article discusses globins ) false positive ( article does not discuss globins )


Download ppt "Genome Bioinformatics DNA and protein Databases I."

Similar presentations


Ads by Google