Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.

Slides:



Advertisements
Similar presentations
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Advertisements

On line (DNA and amino acid) Sequence Information Lecture 7.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction methods Gene indices Mapping cDNA on genomic DNA Genome-genome.
PROMoter SCanning/ANalysis tool. Goal Creating a tool to analyse a set of putative promoter sequences and recognize known and unknown promoters, with.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Genome Assembly and Annotation Erik Arner Omics Science Center, RIKEN Yokohama, Japan
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Genome Browsers Ensembl (EBI, UK) and UCSC (Santa Cruz, California)
Gene Finding Charles Yan.
CSE182-L12 Gene Finding.
Genome Browsers UCSC (Santa Cruz, California) and Ensembl (EBI, UK)
Bioinformatics Alternative splicing Multiple isoforms Exonic Splicing Enhancers (ESE) and Silencers (ESS) SpliceNest Lecture 13.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Biological Motivation Gene Finding in Eukaryotic Genomes
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
BIOLOGY 3020 Fall 2008 Gene Hunting (DNA database searching)
Arabidopsis Gene Project GK-12 April Workshop Karolyn Giang and Dr. Mulligan.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
Doug Brutlag 2011 Next Generation Sequencing and Human Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University.
On line (DNA and amino acid) Sequence Information
Fine Structure and Analysis of Eukaryotic Genes
Genome Sequencing & App. of DNA Technologies Genomics is a branch of science that focuses on the interactions of sets of genes with the environment. –
Bioinformatics.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bikash Shakya Emma Lang Jorge Diaz.  BLASTx entire sequence against 9 plant genomes. RepeatMasker  55.47% repetitive sequences  82.5% retroelements.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
1 Review of Biological Database Utilization. 2 Biological Databases We will discuss: Usefulness to the bioinformaticist Database types Search methods.
1. Bacterial genomes - genes tightly packed, no introns... HOW TO FIND GENES WITHIN A DNA SEQUENCE? Scan for ORFs (open reading frames) - check all 6 reading.
Part I: Identifying sequences with … Speaker : S. Gaj Date
MPL Identification of alternative spliced mRNA variants related to cancers by genome-wide ESTs alignment KIM DAE SOO Oncogene Apr.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Sackler Medical School
Overview  Introduction  Biological network data  Text mining  Gene Ontology  Expression data basics  Expression, text mining, and GO  Modules and.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Genomics.
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
Research about Alternative Splicing recently 楊佳熒.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
What is BLAST? Basic BLAST search What is BLAST?
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Bioinformatics for Research
bacteria and eukaryotes
생물정보학 Bioinformatics.
Visualization of genomic data
Ensembl Genome Repository.
Next Generation Sequencing and Human Genome Databases
Overview Domains and conclusion Introduction Biological network data
Introduction to Alternative Splicing and my research report
Presentation transcript:

Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding the meaning & effect of search (e.g. BLAST) parameters 2. functional analysis of single sequences - i.e. how to work out what your unknown protein might be doing - complex searches for (e.g.) patterns of motifs & secondary structure elements

Workshop 1. overall survey of data Mutation between species -> orthologs Mutation between duplications -> domains Search methods – 2D vs. 3D Search methods – similarity vs. models vs. comparative Main data axes Main Portals Database searches vs. genome browsers Finding similar sequences BLAST, et al E-values! Biological origin of sequences Genes vs.loci Random sequences

Using Public Data Resources There is (are!) data out there There are methods out there Quite often they are combined –BLAST searches of sequence databases

Notes… Sequence databases –Entrez queries… Genome browsers/databases Regulatory Elements SNPs Functional Sequence Models (PFam domains, etc.) Expression Data –Array data –in situ data

Notes II Blast parameters –Low complexity: frameshifted cDNA –miRNAs vs genome –morpholinos for other genes –-q-2 for EST vs EST alignments –Entrez queries

What have we got… gene model locus ~ gene mRNA protein genome primary transcript

Derivative Sequences mRNA clone into cDNA library 3’ EST 5’ EST cDNA sequence Single pass sequence from each end of the clone Multiple pass sequencing over whole length of the clone

Initial Growth of Databases Lots of ESTs were generated Some clones were selected for full-insert sequencing -> cDNAs cDNAs were translated to yield presumed protein sequences

Then Came Genomes With increasing larger fragments of genomic sequence came the ability to align cDNAs to create gene models And then to apply our understanding of exon/intron structure to predict theoretical genes…

Introns and Exons gene model genome CTACCATCCATGCTAACCATTCTACCATTTTATACTCATGCAACGGACCGTAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA CTACCATCCATGCTAACCATTCTACGTAAGTCATCTATATCAATATTATTTCAGCATTTTATACTCATGCAACGGACCGTGTCAGTATTACAGAGCGTAGTCGCTTAGCATCCTTTATAACTGGCTA GTAAG. donor.TTTCAG acceptor mRNA exon intron exon intron exon splice sites

Gene Predictions Given: - coding sequence must run from ATG – STOP codon in-frame - introns GT AG can be spliced out Also take a statistical approach: - coding and non-coding sequence are slightly different in composition - some ‘possible’ splice sites are more likely than others...CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA.....CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA......CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA......CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA.....CGTCGTATGGCTTCGATTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA......CGTCGTATGGCTTCGATGTAGTACATCGGATCGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA... scan genomic sequence …...CGTCGTATGGCTTCGATGTAGTACATCGGATCGGTATGGAATCATTTCAGTCGCTAGCTAGCCTAACGTATATAGCTAGGTAAGACTA.. most likely gene model

Supporting Evidence! EST evidence genome gene model We note that even though there is good evidence for the existence of all four exons, there is no evidence that all the exons would appear on a real transcript. An alternative transcript, skipping exon 3, would be plausible, if a little unlikely. This gets less ambiguous as more ESTs are available, and clones are sequenced at both ends (which helps put distant exons into the same transcripts), and eventually full-length transcript sequences are available. exons:

So What’s in the Databases Now? At NCBI –15,000,000 EST sequences – 3,329,110 non-redundant DNA sequences (excluding ESTs, etc.) –2,693,904 non-redundant translated coding sequences –954,378 Protein Reference Sequences sequences (RefSeq) But the majority of RefSeq may be translations of theoretical transcripts…

Main Data Axes Europe: EBI/EMBL –Swiss-Prot/Trembl/Ensembl/UniProt US: NIH/NCBI –GenBank/UniGene/RefSeq/Entrez Japan: DNA Data Bank of Japan –National Institute of Genetics

Synchronisation… GenBank DDBJ EMBL ATCGATCGATCATAGTATGCTAGCTGCTA BC ATCGATCGATCATAGTATGCTAGCTGCTA BC ATCGATCGATCATAGTATGCTAGCTGCTA You submit a sequence BC ATCGATCGATCATAGTATGCTAGCTGCTA

Sequences, Accession Numbers and Genes NM_ gi= GATCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA BC gi= GTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAA NM_ gi= GACCGTTCGATTAGCTAGGGACACCACCGATCGATATGACCACAAAAA

Main Data Portals NCBI Entrez DatabasesEntrez Databases ExPASy Proteomics ServerProteomics Server DNA Data Bank of Japan DDBJDDBJ EBI Ensembl Genome BrowserEnsembl Genome Browser Santa Cruz Genome BrowserGenome Browser