Molecular biology databases Based on Chapter 2 of Post-genome Informatics by Minoru Kanehisa, Oxford University Press, 2000 2.1 History 2.2 Information.

Slides:



Advertisements
Similar presentations
BioInformatics (1). What is Life All About : Self-compiling & self-assembling Complementary surfaces Watson-Crick base pair (Nature April 25, 1953)
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Measuring the degree of similarity: PAM and blosum Matrix
Protein Structure Database Introduction Database of Comparative Protein Structure Models ModBase 生資所 g 詹濠先.
Introduction to Bioinformatics
1-month Practical Course Genome Analysis Protein Structure-Function Relationships Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Introduction to Bioinformatics
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Sequence analysis course
Sequence similarity.
Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Signaling Pathways and Summary June 30, 2005 Signaling lecture Course summary Tomorrow Next Week Friday, 7/8/05 Morning presentation of writing assignments.
Roadmap The topics:  basic concepts of molecular biology  more on Perl  overview of the field  biological databases and database searching  sequence.
Introduction to Biological Sequences. Background: What is DNA? Deoxyribonucleic acid Blueprint that carries genetic information from one generation to.
How is the amino acid sequence determined?
Basics of Sequence Alignment and Weight Matrices and DOT Plot
Comparative Genomics of the Eukaryotes
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Figure 16.0 Watson and Crick. Figure 16.0x James Watson.
Molecular Biology Fourth Edition
Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Blueprint of Life Based on Chapter 1 of Post-genome Informatics by Minoru Kanehisa, Oxford University Press, 2000.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Pairwise Sequence Analysis-III
Proteomics Session 1 Introduction. Some basic concepts in biology and biochemistry.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Sequence Alignment.
Microbiology Chapter 9 Genetics - Science of the study of heredity, variations in organisms that are transferable from generations to generation DNA is.
Rita Casadio BIOCOMPUTING GROUP University of Bologna, Italy Prediction of protein function from sequence analysis.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
Protein Sequence Alignment Multiple Sequence Alignment
What is sequencing? Video: WlxM (Illumina video) WlxM.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Astronomy 3040 Astrobiology Spring_2016 Day-11. Cells: Basic Units of Life Cells – microscopic units separated from the outside by a “membrane.”  Basic.
Figure 1 Myotubularin exhibits a tyrosine phosphatase activity
MCB 7200: Molecular Biology
From: Phylogenetic Analysis of the ING Family of PHD Finger Proteins
Molecular biology databases
Volume 88, Issue 5, Pages (March 1997)
Substitution Matrices
Blueprint of Life Based on Chapter 1 of Post-genome Informatics by Minoru Kanehisa, Oxford University Press, 2000.
Substitution Matrices Multiple Sequence Alignment (I)
Mark M Metzstein, H.Robert Horvitz  Molecular Cell 
Introduction to bioinformatics 2007
Introduction to bioinformatics Lecture 7
Genomes and Their Evolution
There are four levels of structure in proteins
A Brief History What is molecular biology?
Large-Scale Genomic Surveys
Sequence Based Analysis Tutorial
Prediction of protein function from sequence analysis
Relationship between Genotype and Phenotype
Molecular Biology Fourth Edition
Volume 88, Issue 5, Pages (March 1997)
Alignment IV BLOSUM Matrices
Volume 2, Issue 1, Pages 1-4 (January 1994)
Presentation transcript:

Molecular biology databases Based on Chapter 2 of Post-genome Informatics by Minoru Kanehisa, Oxford University Press, History 2.2 Information Technology 2.3 New generation databases

Evolution of molecular biology databases

The addresses for the major databases

New generation of molecular biology databases

Example of sequence database entry for Genbank LOCUSDRODPPC4001 bpINV15-MAR-1990 DEFINITIOND.melanogaster decapentaplegic gene complex (DPP-C), complete cds. ACCESSIONM30116 KEYWORDS. SOURCED.melanogaster, cDNA to mRNA. ORGANISMDrosophila melanogaster Eurkaryote; mitochondrial eukaryotes; Metazoa; Arthropoda; Tracheata; Insecta; Pterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophilia. REFERENCE1 (bases 1 to 4001) AUTHORSPadgett, R.W., St Johnston, R.D. and Gelbart, W.M. TITLEA transcript from a Drosophila pattern gene predicts a protein homologous to the transforming growth factor-beta family JOURNALNature 325, (1987) MEDLINE COMMENTThe initiation codon could be at either or FEATURESLocation/Qualifiers source /organism=“Drosophila melanogaster” /db_xref=“taxon:7227” mRNA< /gene=“dpp” /note=“decapentaplegic protein mRNA” /db_xref=“FlyBase:FBgn ” gene /note=“decapentaplegic” /gene=“dpp” /allele=“” /db_xref=“FlyBase:FBgn ” CDS /gene=“dpp” /note=“decapentaplegic protein (1188 could be 1587)” /codon_start=1 /db_xref=“FlyBase:FBgn ” /db_xref=“PID:g157292” /translation=“MRAWLLLLAVLATFQTIVRVASTEDISQRFIAAIAPVAAHIPLA SASGSGSGRSGSRSVGASTSTALAKAFNPFSEPASFSDSDKSHRSKTNKKPSKSDANR …………………… LGYDAYYCHGKCPFPLADHFNSTNAVVQTLVNNMNPGKVPKACCVPTQLDSVAMLYL NDQSTBVVLKNYQEMTBBGCGCR” BASE COUNT1170 a1078 c956 g797 t ORIGIN 1 gtcgttcaac agcgctgatc gagtttaaat ctataccgaa atgagcggcg gaaagtgagc 61 cacttggcgt gaacccaaag ctttcgagga aaattctcgg acccccatat acaaatatcg 121 gaaaaagtat cgaacagttt cgcgacgcga agcgttaaga tcgcccaaag atctccgtgc 181 ggaaacaaag aaattgaggc actattaaga gattgttgtt gtgcgcgagt gtgtgtcttc 241 agctgggtgt gtggaatgtc aactgacggg ttgtaaaggg aaaccctgaa atccgaacgg 301 ccagccaaag caaataaagc tgtgaatacg aattaagtac aacaaacagt tactgaaaca 361 gatacagatt cggattcgaa tagagaaaca gatactggag atgcccccag aaacaattca 421 attgcaaata tagtgcgttg cgcgagtgcc agtggaaaaa tatgtggatt acctgcgaac 481 cgtccgccca aggagccgcc gggtgacagg tgtatccccc aggataccaa cccgagccca 541 gaccgagatc cacatccaga tcccgaccgc agggtgccag tgtgtcatgt gccgcggcat 601 accgaccgca gccacatcta ccgaccaggt gcgcctcgaa tgcggcaaca caattttcaa ………………………… aactgtataa acaaaacgta tgccctataa atatatgaat aactatctac atcgttatgc 3901 gttctaagct aagctcgaat aaatccgtac acgttaatta atctagaatc gtaagaccta 3961 acgcgtaagc tcagcatgtt ggataaatta atagaaacga g //

Example of sequence database entry for SWISS-PROT IDDECA_DROMESTANDARD;PRT;588AA. ACP07713; DT01-APR-1988 (REL. 07, CREATED) DT01-APR-1988 (REL. 07, LAST SEQUENCE UPDATE) DT01-FEB-1995 (REL. 31, LAST ANNOTATION UPDATE) DEDECAPENTAPLEGIC PROTEIN PRECURSOR (DPP-C PROTEIN). GNDPP. OSDROSOPHILA MELANOGASTER (FRUIT FLY). OCEUKARYOTA; METAZOA; ARTHROPODA; INSECTA; DIPTERA. RN[1] RPSEQUENCE FROM N.A. RM RAPADGETT R.W., ST JOHNSTON R.D., GELBART W.M.; RLNATURE 325:81-84 (1987) RN[2] RPCHARACTERIZATION, AND SEQUENCE OF RM RAPANGANIBAN G.E.F., RASHKA K.E., NEITZEL M.D., HOFFMANN F.M.; RLMOL. CELL. BIOL. 10: (1990). CC-!- FUNCTION: DPP IS REQUIRED FOR THE PROPER DEVELOPMENT OF THE CC EMBRYONIC DOORSAL HYPODERM, FOR VIABILITY OF LARVAE AND FOR CELL CC VIABILITY OF THE EPITHELIAL CELLS IN THE IMAGINAL DISKS. CC-!- SUBUNIT: HOMODIMER, DISULFIDE-LINKED. CC-!- SIMILARITY: TO OTHER GROWTH FACTORS OF THE TGF-BETA FAMILY. DREMBL; M30116; DMDPPC. DRPIR; A26158; A DRHSSP; P08112; 1TFG. DRFLYBASE; FBGN ; DPP. DRPROSITE; PS00250; TGF_BETA. KWGROWTH FACTOR; DIFFERENTIATION; SIGNAL. FTSIGNAL1?POTENTIAL. FTPROPEP?456 FTCHAIN457588DECAPENTAPLEGIC PROTEIN. FTDISULFID487553BY SIMILARITY. FTDISULFID516585BY SIMILARITY. FTDISULFID520587BY SIMILARITY. FTDISULFID552552INTERCHAIN (BY SIMILARITY). FTCARBOHYD120120POTENTIAL. FTCARBOHYD342342POTENTIAL. FTCARBOHYD377377POTENTIAL. FTCARBOHYD529529POTENTIAL. SQSEQUENCE 588 AA; 65850MW; CN; MRAWLLLLAV LATFQTIVRV ASTEDISQRF IAAIAPVAAH IPLASASGSG SGRSGSRSVG ASTSTAGAKA FNRFSEPASF SDSDKSHRSK TNKKPSKSDA NRQFNEVHKP RTDQLENSKN KSKQLVNKPN HNKMAVKEQR SHHKKSHHHR SHQPKQASAS TESHQSSSIE SIFVEEPTLV LDREVASINV PANAKAIIAE QGPSTYSKEA LIKDKLKPDP STYLVEIKSL LSLFNMKRPP KIDRSKIIIP EPMKKLYAEI MGHELDSVNI PKPGLLTKSA NTVRSFTHKD SKIDDRFPHH HRFRLHFDVK SIPADEKLKA AELQLTRDAL SQQVVASRSS ANRTRYQBLV YDITRVGVRG QREPSYLLLD TKTBRLNSTD TVSLDVQPAV DRWLASPQRN YGLLVEVRTV RSLKPAPHHH VRLRRSADEA HERWQHKQPL LFTYTDDGRH DARSIRDVSG GEGGGKGGRN KRHARRPTRR KNHDDTCRRH SLYVDFSDVG WDDWIVAPLG YDAYYCHGKC PFPLADHRNS TNHAVVQTLV NNMNPGKBPK ACCBPTQLDS VAMLYLNDQS TVVLKNYQEM TVVGCGCR

Functional classification of E. coli genes according to Monica Riley

Relational database. A table (relation) is a set and the three basic table operations shown here are extensions of the standard set operations. Paper 1 Paper 2 Paper 3 Paper 4.. MUID Journal Volume Pages Year SELECT PROJECT MUID Author Author 1-1 Author 1-2 Author 2-1 Author 2-2 Author 2-3 Author JOIN MUID Journal Volume Pages Year Author

A history of database technology development Object-oriented Programming (Kay, 1972) Relational database (Codd, 1970) Logic programming (Kowalski, 1972) Object-oriented Database (1986) Deductive database\ (1977) Deductive, object- Oriented database (1989)

Multimedia in GenomeNet

Pancreatic trypsin inhibitor PDB: 4PTI ribbon model and variant with cylinder for alpha helix (figures from PDB)

The periodic table of chemical elements where the shaded elements are those normally found in biology.

Biologically important classes of organic compounds derived from the six basic elements

The 20 common amino acids

BLO(ck)SU(bstitution)M(atrix) (Henikoff & Henikoff 1992) Derived from a set (2000) of aligned and ungapped regions from protein families; emphasizing more on chemical similarities (versus how easy it is to mutate from one residue to another). BLOSUMx is derived from the set of segments of x% identity. BLOSUM62 Matrix, log-odds representation

Substitution/Scoring Matrices Pam matrices ( Dayhoff et al ) --- phylogeny-based. PAM250 matrix, log-odds representation PAM1: expected number of mutation = 1%

A hidden Markov model for sequence analysis d1d1 d2d2 d3d3 d4d4 I0I0 I2I2 I3I3 I4I4 I1I1 m0m0 m1m1 m2m2 m3m3 m4m4 m5m5 Start End m= match state (output), I = insert state (output), d= delete state (no output)

Globin fold  protein myoglobin PDB: 1MBN

 sandwich  protein immunoglobulin PDB: 7FAB

TIM barrel  /  protein Triose phosphate IsoMerase PDB: 1TIM

A fold in  +  protein ribonuclease A PDB: 7RSA

434 Cro protein complex (phage) PDB: 3CRO

Zinc finger DNA recognition (Drosophila) PDB: 2DRP..YRCKVCSRVY THISNFCRHY VTSH...

Leucine zipper (yeast) PDB: 1YSA..RA RKLQRMKQLE DKVEE LLSKN YHLENEVARL...

The orthologue group table for F1-F0 ATP synthase (upper) and V-type ATP synthase (lower).

Reactions and interactions Biochemical pathways Genome diversity Note notion of Enzyme Commission (EC) number.

The tree of life showing the relationship of archaea, bacteria, and eukaryotes, as well as the relationship of fungi, plants and animals. Animals Fungi Plants Protists Eukaryotes Archae Bacteria