COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering.

Slides:



Advertisements
Similar presentations
Bioinformatics Ayesha M. Khan Spring 2013.
Advertisements

Databases (“knowledge bases”) used in genome analysis
NCBI data, sliding window programs and dot plots Sept. 25, 2012 Learning objectives-Become familiar with OMIM and PubMed. Understand the difference between.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
1.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
Structural Genomics and Human Health
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Introduction to Bioinformatics Lecturer: Dr. Yael Mandel-Gutfreund Teaching Assistant: Shula Shazman Sivan Bercovici Course web site :
Archives and Information Retrieval
Biological databases.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Alignment of mRNAs to genomic DNA Sequence Martin Berglund Khanh Huy Bui Md. Asaduzzaman Jean-Luc Leblond.
Essential Bioinformatics and Biocomputing (LSM2104: Section I) Biological Databases and Bioinformatics Software Prof. Chen Yu Zong Tel:
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
The Cell, Central Dogma and Human Genome Project.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Protein databases Henrik Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Introduction to Bioinformatics Lecturer: Dr. Yael Mandel-Gutfreund Teaching Assistance: Oleg Rokhlenko Ydo Wexler
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Introduction to Bioinformatics / Lecturer: Prof. Yael Mandel-Gutfreund Teaching Assistance: Shai Ben-Elazar Idit kosti Course web site :
An Introduction to Bioinformatics Molecular Biology Databases.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
On line (DNA and amino acid) Sequence Information
Bioinformatics.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Information Resources for Bioinformatics 1 MARC: Developing Bioinformatics Programs July, 2008 Alex Ropelewski Hugh Nicholas
Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
1 Review of Biological Database Utilization. 2 Biological Databases We will discuss: Usefulness to the bioinformaticist Database types Search methods.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Organizing information in the post-genomic era The rise of bioinformatics.
Online Mendelian Inheritance in Man (OMIM): What it is & What it can do for you Knowledge Management & Eskind Biomedical Library January 27, 2012 helen.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
NCBI Literature Databases: PubMed
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
BIOLOGICAL DATABASES. BIOLOGICAL DATA Bioinformatics is the science of Storing, Extracting, Organizing, Analyzing, and Interpreting information in biological.
Introduction to Bioinformatics Lecturer: Dr. Yael Mandel-Gutfreund Teaching Assistance: Martin Akerman Sivan Bercovici Course web site :
Bioinformatics and Computational Biology
Computer Storage of Sequences
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Biotechnology and Bioinformatics: Bioinformatics Essential Idea: Bioinformatics is the use of computers to analyze sequence data in biological research.
Information retrieval and sliding window programs April 5, 2011 Hand in Homework #1. Homework #2 due Tuesday, April 12. Learning objectives- Understand.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
生物資料庫搜尋 ( 第八組 ) 連威森 王鼎 黃智楹 張鈞淵
NCBI PubMed NCBI Literature Databases: PubMed Session #1, April 28, 2005 Session #2, April 29, 2005 Ho Chi Minh City, VietNam.
Entrez, dbSNP, GEO, OMIM & LinkOut JanPlan Entrez Distributed by NCBI in 1991 on CD-ROM Included linked nodes: GenBank & PDB Translated GenBank,
Introduction to Genes and Genomes with Ensembl
Introduction to Bioinformatics
Introduction to Bioinformatics
Biological databases: Collection, storage and maintenance
Archives and Information Retrieval
생물정보학 Bioinformatics.
Introduction to Bioinformatics /234525
Introduction to Bioinformatics
Introduction to Databases
Biological Databases.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering

DNARNA cDNA ESTs UniGene phenotype Genomic DNA Databases Protein sequence databases protein Protein structure databases transcriptiontranslation Gene expression database

Gene

Different transcripts can be related to the same gene!

EST Expressed Sequence Tags Partial copies of mRNA found within a particular cell Can be used to identify genitc regions; splicing patterns of genes; etc

Outline Bioinformatics Databases Primary databases Derived databases Nucleotide databases GenBank (P), EMBL-Bank (P) Protein databases Swiss-Prot (D), PIR-PSD (D) GenPept (D), TrEMBL (D) Protein Data Bank (P) Other Examples RefSeq UniGene PubMed SNP OMIM

Bioinformatics Databases Information DNA sequences Conserved DNA domains Genomes Gene expression (ESTs, microarrays) Protein sequences Protein 3D structure Protein families Mutations / polymorphisms / SNPs Metabolic pathways Chemical compounds (ligands) Biomedical literature (journal papers, online books…)

Primary public domain bioinformatics servers Public Domain Bioinformatics Facilities European Bioinformatics Institute (EBI) United Kingdom National Center For Biotechnology Information (NCBI) United States Genome Net (KEGG & DDBJ) Japan Databases Analysis Tools Databases Analysis Tools Databases Analysis Tools

Major Databases DNA sequences GenBank, RefSeq, UniGene Protein sequences Swiss-Prot, PIR-PSD, GenPept, TrEMBL, RefSeq Protein structure Protein Data Bank (PDB) Gene expression Gene Expression Omnibus (GEO) Biomedical publications PubMed / MedLine

Bioinformatics Data Sources Primary databases Original submissions by researchers Staff organizes information only Generally sequence oriented Examples GenBank, PDB (Protein Data Bank)

Bioinformatics Data Sources Derived databases Compiled from data in primary databases Manually curated (human selection & correction) Advantages – high quality Disadvantages – high expense, low volume Examples  Swiss-Prot, PIR-PSD, RefSeq Computational derivation (automatically generated) Advantages – inexpensive, up-to-date Disadvantages – lower quality Examples  GenPept, TrEMBL, UniGene

Outline Bioinformatics Databases Primary databases Derived databases Nucleotide databases GenBank (P), EMBL-Bank (P) Protein databases Swiss-Prot (D), PIR-PSD (D) GenPept (D), TrEMBL (D) Protein Data Bank (P) Other Examples RefSeq UniGene PubMed SNP OMIM

Bioinformatic Databases – GenBank “GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences” Database type Nucleotide sequences Primary database Current Size (As of Aug. 2006): 65,369,091,950 (bps) 61,132,599 (sequences) Access to GenBank Available for searching at NCBI via several methods Such as BLAST search

Bioinformatic Databases – GenBank Types of submissions to database Genomic DNA High quality complete DNA sequence mRNA / cDNA Partial or complete mRNA (or cDNA) Expressed sequence tag (EST) A short sub-sequence of a transcribed spliced nucleotide sequence (mRNA) ( bps)  May represent portions of expressed genes  Either protein-coding or not  About 43 million ESTs are now available Sequence tagged sites (STS) Short DNA sequences unique in genome Genomic survey sequence (GSS) Single-pass genomic DNA Third-party annotations of GenBank sequences

Bioinformatic Databases – EMBL-Bank Europe's primary nucleotide sequence resource Primary databases Database type Nucleotide sequences Primary database

Outline Bioinformatics Databases Primary databases Derived databases Nucleotide databases GenBank (P), EMBL-Bank (P) Protein databases Swiss-Prot (D), PIR-PSD (D) GenPept (D), TrEMBL (D) Protein Data Bank (P) Other Examples RefSeq UniGene PubMed SNP OMIM

Bioinformatic Databases – Proteins Protein sequence databases Once derived from laboratory experiments Now mostly based on predicted ORFs from DNA Manual curation  Swiss-Prot  PIR-PSD Computational derivation  GenPept  TrEMBL

Bioinformatic Databases – Swiss- Prot, PIR-PSD Database type Protein sequences Derived database Manually curated (non-redundant, annotated) Many annotations Functions of the protein Domains and sites Secondary & quaternary structure Similarities to other proteins Variants Swiss-Prot PIR-PSD

Bioinformatic Databases – GenPept, TrEMBL Database type Protein sequences Computationally derived database Predicted (translating) coding sequences (CDS) from GenBank, EMBL (i.e., gene product) GenPept Download: ftp://ftp.ncifcrf.gov/pub/genpept/ftp://ftp.ncifcrf.gov/pub/genpept/ Release 163 (as of 12/26/2007) 4,970,178 loci containing 1,517,599,916 residues TrEMBL

Structure Databases 3-dimensional structures of proteins, nucleic acids, molecular complexes etc 3-d data is available due to techniques such as NMR and X-Ray crystallography Protein Data Bank Protein 3D structures Primary database (

Protein Data Bank: PDB

Bioinformatic Databases – Connections

Outline Bioinformatics Databases Primary databases Derived databases Nucleotide databases GenBank (P), EMBL-Bank (P) Protein databases Swiss-Prot (D), PIR-PSD (D) GenPept (D), TrEMBL (D) Protein Data Bank (P) Other Examples RefSeq UniGene PubMed SNP OMIM

Bioinformatic Databases – RefSeq The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products. Information derived from GenBank records Database type Nucleotide & protein sequences Derived database Human curated (non-redundant, cross-linked) Data in RefSeq Genomic DNA mRNAs & proteins for known genes, gene models Entire chromosomes Multiple organisms Example

Bioinformatic Databases – UniGene UniGene is an experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location Database type Nucleotide sequences Computationally derived database Partitioned into non-redundant gene-oriented clusters Gene-oriented view Data in UniGene Clusters of genomic DNA & ESTs Multiple organisms

Database type Biomedical papers Manually curated database Service of the National Library of Medicine MEDLINE publication database Over 17,000 journals 15 million citations since Bioinformatic Databases – PubMed

Bioinformatic Databases – Others Gene expression ArrayExpress, Gene Expression Omnibus (GEO) Multi-organism genomes Entrez Genome, HomoloGene, COGs, TIGR Genetic variation & genetic diseases dbSNP, OMIM, CGAP Metabolic pathways WIT, KEGG Many more… Listed in journal “Nucleic Acids Research” each January

Bioinformatic Databases: SNP Database Single Nucleotide Polymorphisms (SNPs) Single base difference in a single position among two different individuals of the same species Play an important role in differentiation and disease

Sickle Cell Anemia Due to 1 swapping an A for a T, causing inserted amino acid to be valine instead of glutamine in hemoglobin Image source:

Healthy Individual >gi| |ref|NM_ | Homo sapiens hemoglobin, beta (HBB), mRNA ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA GG A GAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC >gi| |ref|NP_ | beta globin [Homo sapiens] MVHLTP E EKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH

Diseased Individual >gi| |ref|NM_ | Homo sapiens hemoglobin, beta (HBB), mRNA ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA GG T GAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC >gi| |ref|NP_ | beta globin [Homo sapiens] MVHLTP V EKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH

Disease Databases Genes are involved in disease Many diseases are well studied Description of diseases and what is known about them is stored A good place to start when you want to know about a certain disease Linked to PubMed, the OMIM Morbid Map OMIM - Online Mendelian Inheritance in Man “A catalog of human genes and genetic disorders maintained by Johns Hopkins University”

Putting it All Together Each Database contains specific information Like other biological systems also these databases are interrelated

GENOMIC DATA GenBank DDBJ EMBL ASSEMBLED GENOMES GoldenPath WormBase TIGR PROTEIN PIR SWISS-PROT STRUCTURE PDB MMDB SCOP LITERATURE PubMed PATHWAY KEGG COG DISEASE LocusLink OMIM OMIA GENES RefSeq AllGenes GDB SNPs dbSNP ESTs dbEST unigene MOTIFS BLOCKS Pfam Prosite GENE EXPRESSION Stanford MGDB NetAffx ArrayExpress

Where to get started? NCBI ENTREZ A search engine that provides access and links between various databases ENTREZ PubMed GenBank Protein databases Genomes SNP Taxonomy OMIM

Outline Bioinformatics Databases Primary databases Derived databases Nucleotide databases GenBank (P), EMBL-Bank (P) Protein databases Swiss-Prot (D), PIR-PSD (D) GenPept (D), TrEMBL (D) Protein Data Bank (P) Other Examples RefSeq UniGene PubMed SNP OMIM