Molecular biology databases

Slides:



Advertisements
Similar presentations
BioInformatics (1). What is Life All About : Self-compiling & self-assembling Complementary surfaces Watson-Crick base pair (Nature April 25, 1953)
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
UniProt Eric Jain Swiss Institute of Bioinformatics, Geneva W3C Workshop on Semantic Web for Life Sciences, October 2004.
Measuring the degree of similarity: PAM and blosum Matrix
Bioinformatics and Chips Bioinformatics is a very integral part of each step in a chip project. Bioinformatics is a very integral part of each step in.
Introduction to Bioinformatics
1-month Practical Course Genome Analysis Protein Structure-Function Relationships Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Introduction to Bioinformatics
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Sequence similarity.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Signaling Pathways and Summary June 30, 2005 Signaling lecture Course summary Tomorrow Next Week Friday, 7/8/05 Morning presentation of writing assignments.
Wellcome Trust Workshop Working with Pathogen Genomes Module 1 Artemis.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Bsubt.embl complete entry in EMBL format (DNA and Features) bsubt.embl.Z bsubt.fasta complete DNA sequence in Fasta format bsubt.fasta.Z bsubt.con construct.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Essential Bioinformatics and Biocomputing Module (Tutorial) Biological Databases Lecturer: Chen Yuzong Jan 2003 TAs: Cao Zhiwei Lee Teckkwong, Bernett.
Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.
Molecular Biology Fourth Edition
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.
Blueprint of Life Based on Chapter 1 of Post-genome Informatics by Minoru Kanehisa, Oxford University Press, 2000.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
PROTEIN DATABASES. The ideal sequence database for computational analyses and data-mining: I t must be complete with minimal redundancy It must contain.
Pairwise Sequence Analysis-III
Molecular biology databases Based on Chapter 2 of Post-genome Informatics by Minoru Kanehisa, Oxford University Press, History 2.2 Information.
>gi| |gb|AAB | ADP-glucose pyrophosphorylase large subunit [Oryza sativa] 02-AUG-1996 Gene accession U66041 Plant Physiol. 112, 1399 (1996)
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Microbiology Chapter 9 Genetics - Science of the study of heredity, variations in organisms that are transferable from generations to generation DNA is.
1 EMBL Outstation — The European Bioinformatics Institute Mus musculus - a model organism in SWISS-PROT.
Rita Casadio BIOCOMPUTING GROUP University of Bologna, Italy Prediction of protein function from sequence analysis.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Molecular Biology Fourth Edition Chapter 1 A Brief History Lecture PowerPoint to accompany Robert F. Weaver Copyright © The McGraw-Hill Companies, Inc.
Figure 1 Myotubularin exhibits a tyrosine phosphatase activity
MCB 7200: Molecular Biology
Protein databases Henrik Nielsen
Bio/Chem-informatics
From: Phylogenetic Analysis of the ING Family of PHD Finger Proteins
Protein Families, Motifs & Domains.
Volume 88, Issue 5, Pages (March 1997)
Protein Sequence Alignments
Substitution Matrices
Swiss-Prot Database --- Xie, H
Blueprint of Life Based on Chapter 1 of Post-genome Informatics by Minoru Kanehisa, Oxford University Press, 2000.
Substitution Matrices Multiple Sequence Alignment (I)
Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences if a related sequence has a known function can you inherit.
Mark M Metzstein, H.Robert Horvitz  Molecular Cell 
Introduction to bioinformatics 2007
Introduction to bioinformatics Lecture 7
Genomes and Their Evolution
There are four levels of structure in proteins
A Brief History What is molecular biology?
Sequence Based Analysis Tutorial
Large-Scale Genomic Surveys
Sequence Based Analysis Tutorial
Volume 11, Issue 6, Pages (June 2003)
Marrying structure and genomics
Prediction of protein function from sequence analysis
Molecular Biology Fourth Edition
The Crystal Structure of the Human Hepatitis B Virus Capsid
Evolutionary genetics
Volume 88, Issue 5, Pages (March 1997)
BIOL 433 Plant Genetics Term 2,
Volume 11, Issue 6, Pages (June 2003)
Alignment IV BLOSUM Matrices
Evolutionary Fates and Origins of U12-Type Introns
Presentation transcript:

Molecular biology databases Based on Chapter 2 of Post-genome Informatics by Minoru Kanehisa, Oxford University Press, 2000 2.1 History 2.2 Information Technology 2.3 New generation databases

Evolution of molecular biology databases

The addresses for the major databases

New generation of molecular biology databases

Example of sequence database entry for Genbank LOCUS DRODPPC 4001 bp INV 15-MAR-1990 DEFINITION D.melanogaster decapentaplegic gene complex (DPP-C), complete cds. ACCESSION M30116 KEYWORDS . SOURCE D.melanogaster, cDNA to mRNA. ORGANISM Drosophila melanogaster Eurkaryote; mitochondrial eukaryotes; Metazoa; Arthropoda; Tracheata; Insecta; Pterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophilia. REFERENCE 1 (bases 1 to 4001) AUTHORS Padgett, R.W., St Johnston, R.D. and Gelbart, W.M. TITLE A transcript from a Drosophila pattern gene predicts a protein homologous to the transforming growth factor-beta family JOURNAL Nature 325, 81-84 (1987) MEDLINE 87090408 COMMENT The initiation codon could be at either 1188-1190 or 1587-1589 FEATURES Location/Qualifiers source 1..4001 /organism=“Drosophila melanogaster” /db_xref=“taxon:7227” mRNA <1..3918 /gene=“dpp” /note=“decapentaplegic protein mRNA” /db_xref=“FlyBase:FBgn0000490” gene 1..4001 /note=“decapentaplegic” /allele=“” CDS 1188..2954 /note=“decapentaplegic protein (1188 could be 1587)” /codon_start=1 /db_xref=“PID:g157292” /translation=“MRAWLLLLAVLATFQTIVRVASTEDISQRFIAAIAPVAAHIPLA SASGSGSGRSGSRSVGASTSTALAKAFNPFSEPASFSDSDKSHRSKTNKKPSKSDANR …………………… LGYDAYYCHGKCPFPLADHFNSTNAVVQTLVNNMNPGKVPKACCVPTQLDSVAMLYL NDQSTBVVLKNYQEMTBBGCGCR” BASE COUNT 1170 a 1078 c 956 g 797 t ORIGIN 1 gtcgttcaac agcgctgatc gagtttaaat ctataccgaa atgagcggcg gaaagtgagc 61 cacttggcgt gaacccaaag ctttcgagga aaattctcgg acccccatat acaaatatcg 121 gaaaaagtat cgaacagttt cgcgacgcga agcgttaaga tcgcccaaag atctccgtgc 181 ggaaacaaag aaattgaggc actattaaga gattgttgtt gtgcgcgagt gtgtgtcttc 241 agctgggtgt gtggaatgtc aactgacggg ttgtaaaggg aaaccctgaa atccgaacgg 301 ccagccaaag caaataaagc tgtgaatacg aattaagtac aacaaacagt tactgaaaca 361 gatacagatt cggattcgaa tagagaaaca gatactggag atgcccccag aaacaattca 421 attgcaaata tagtgcgttg cgcgagtgcc agtggaaaaa tatgtggatt acctgcgaac 481 cgtccgccca aggagccgcc gggtgacagg tgtatccccc aggataccaa cccgagccca 541 gaccgagatc cacatccaga tcccgaccgc agggtgccag tgtgtcatgt gccgcggcat 601 accgaccgca gccacatcta ccgaccaggt gcgcctcgaa tgcggcaaca caattttcaa …………………………. 3841 aactgtataa acaaaacgta tgccctataa atatatgaat aactatctac atcgttatgc 3901 gttctaagct aagctcgaat aaatccgtac acgttaatta atctagaatc gtaagaccta 3961 acgcgtaagc tcagcatgtt ggataaatta atagaaacga g //

Example of sequence database entry for SWISS-PROT ID DECA_DROME STANDARD; PRT; 588AA. AC P07713; DT 01-APR-1988 (REL. 07, CREATED) DT 01-APR-1988 (REL. 07, LAST SEQUENCE UPDATE) DT 01-FEB-1995 (REL. 31, LAST ANNOTATION UPDATE) DE DECAPENTAPLEGIC PROTEIN PRECURSOR (DPP-C PROTEIN). GN DPP. OS DROSOPHILA MELANOGASTER (FRUIT FLY). OC EUKARYOTA; METAZOA; ARTHROPODA; INSECTA; DIPTERA. RN [1] RP SEQUENCE FROM N.A. RM 87090408 RA PADGETT R.W., ST JOHNSTON R.D., GELBART W.M.; RL NATURE 325:81-84 (1987) RN [2] RP CHARACTERIZATION, AND SEQUENCE OF 457-476. RM 90258853 RA PANGANIBAN G.E.F., RASHKA K.E., NEITZEL M.D., HOFFMANN F.M.; RL MOL. CELL. BIOL. 10:2669-2677(1990). CC -!- FUNCTION: DPP IS REQUIRED FOR THE PROPER DEVELOPMENT OF THE CC EMBRYONIC DOORSAL HYPODERM, FOR VIABILITY OF LARVAE AND FOR CELL CC VIABILITY OF THE EPITHELIAL CELLS IN THE IMAGINAL DISKS. CC -!- SUBUNIT: HOMODIMER, DISULFIDE-LINKED. CC -!- SIMILARITY: TO OTHER GROWTH FACTORS OF THE TGF-BETA FAMILY. DR EMBL; M30116; DMDPPC. DR PIR; A26158; A26158. DR HSSP; P08112; 1TFG. DR FLYBASE; FBGN0000490; DPP. DR PROSITE; PS00250; TGF_BETA. KW GROWTH FACTOR; DIFFERENTIATION; SIGNAL. FT SIGNAL 1 ? POTENTIAL. FT PROPEP ? 456 FT CHAIN 457 588 DECAPENTAPLEGIC PROTEIN. FT DISULFID 487 553 BY SIMILARITY. FT DISULFID 516 585 BY SIMILARITY. FT DISULFID 520 587 BY SIMILARITY. FT DISULFID 552 552 INTERCHAIN (BY SIMILARITY). FT CARBOHYD 120 120 POTENTIAL. FT CARBOHYD 342 342 POTENTIAL. FT CARBOHYD 377 377 POTENTIAL. FT CARBOHYD 529 529 POTENTIAL. SQ SEQUENCE 588 AA; 65850MW; 1768420 CN; MRAWLLLLAV LATFQTIVRV ASTEDISQRF IAAIAPVAAH IPLASASGSG SGRSGSRSVG ASTSTAGAKA FNRFSEPASF SDSDKSHRSK TNKKPSKSDA NRQFNEVHKP RTDQLENSKN KSKQLVNKPN HNKMAVKEQR SHHKKSHHHR SHQPKQASAS TESHQSSSIE SIFVEEPTLV LDREVASINV PANAKAIIAE QGPSTYSKEA LIKDKLKPDP STYLVEIKSL LSLFNMKRPP KIDRSKIIIP EPMKKLYAEI MGHELDSVNI PKPGLLTKSA NTVRSFTHKD SKIDDRFPHH HRFRLHFDVK SIPADEKLKA AELQLTRDAL SQQVVASRSS ANRTRYQBLV YDITRVGVRG QREPSYLLLD TKTBRLNSTD TVSLDVQPAV DRWLASPQRN YGLLVEVRTV RSLKPAPHHH VRLRRSADEA HERWQHKQPL LFTYTDDGRH DARSIRDVSG GEGGGKGGRN KRHARRPTRR KNHDDTCRRH SLYVDFSDVG WDDWIVAPLG YDAYYCHGKC PFPLADHRNS TNHAVVQTLV NNMNPGKBPK ACCBPTQLDS VAMLYLNDQS TVVLKNYQEM TVVGCGCR

Functional classification of E. coli genes according to Monica Riley

Relational database. A table (relation) is a set and the three basic table operations shown here are extensions of the standard set operations. MUID Journal Volume Pages Year Paper 1 Paper 2 Paper 3 Paper 4 . . . . SELECT PROJECT MUID Journal Volume Pages Author Year Author JOIN MUID Author 1-1 Author 1-2 Author 2-1 Author 2-2 Author 2-3 Author 3-1 . . . .

A history of database technology development Relational database (Codd, 1970) Object-oriented Programming (Kay, 1972) Logic programming (Kowalski, 1972) Object-oriented Database (1986) Deductive database\ (1977) Deductive, object- Oriented database (1989)

Multimedia in GenomeNet

Pancreatic trypsin inhibitor PDB: 4PTI ribbon model and variant with cylinder for alpha helix (figures from PDB)

The periodic table of chemical elements where the shaded elements are those normally found in biology.

Biologically important classes of organic compounds derived from the six basic elements

The 20 common amino acids

BLO(ck)SU(bstitution)M(atrix) (Henikoff & Henikoff 1992) Derived from a set (2000) of aligned and ungapped regions from protein families; emphasizing more on chemical similarities (versus how easy it is to mutate from one residue to another). BLOSUMx is derived from the set of segments of x% identity. BLOSUM62 Matrix, log-odds representation

Substitution/Scoring Matrices Pam matrices (Dayhoff et al. 1978) --- phylogeny-based. PAM1: expected number of mutation = 1% PAM250 matrix, log-odds representation

A hidden Markov model for sequence analysis Start End m= match state (output), I = insert state (output), d= delete state (no output)

Globin fold  protein myoglobin PDB: 1MBN

 sandwich  protein immunoglobulin PDB: 7FAB

TIM barrel  /  protein Triose phosphate IsoMerase PDB: 1TIM

A fold in  + protein ribonuclease A PDB: 7RSA

434 Cro protein complex (phage) PDB: 3CRO

..YRCKVCSRVY THISNFCRHY VTSH... Zinc finger DNA recognition (Drosophila) PDB: 2DRP ..YRCKVCSRVY THISNFCRHY VTSH...

..RA RKLQRMKQLE DKVEE LLSKN YHLENEVARL... Leucine zipper (yeast) PDB: 1YSA ..RA RKLQRMKQLE DKVEE LLSKN YHLENEVARL...

The orthologue group table for F1-F0 ATP synthase (upper) and V-type ATP synthase (lower).

Note notion of Enzyme Commission (EC) number. Reactions and interactions Note notion of Enzyme Commission (EC) number. Biochemical pathways Genome diversity

The tree of life showing the relationship of archaea, bacteria, and eukaryotes, as well as the relationship of fungi, plants and animals. Animals Fungi Plants Protists Eukaryotes Archae Bacteria