Part I: Identifying sequences with … Speaker : S. Gaj Date 11-01-2005.

Slides:



Advertisements
Similar presentations
Bioinformatics Ayesha M. Khan Spring 2013.
Advertisements

BiGCaT Bioinformatics Hunting strategy of the bigcat.
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
NCBI BLAST, CDD, Mini-courses Katia Guimarães 2007/2.
Databases (“knowledge bases”) used in genome analysis
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
On line (DNA and amino acid) Sequence Information Lecture 7.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
The Sense of Sequense The Sense of Sequense Chris Evelo BiGCaT Bioinformatics Universiteit Maastricht.
1 Exercise: BIOINFORMATIC DATABASES and BLAST. 2 Outline  NCBI and Entrez  Pubmed  Google scholar  RefSeq  Swissprot  Fasta format  PDB: Protein.
Archives and Information Retrieval
BLAST Tutorial 3 What is BLAST? Basic Local Alignment Search Tool Is a set of similarity search programs designed to explore sequence databases. What are.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
BLAST.
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
UniProt - The Universal Protein Resource
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
An Introduction to Bioinformatics Molecular Biology Databases.
Arabidopsis Gene Project GK-12 April Workshop Karolyn Giang and Dr. Mulligan.
Login: BITseminar Pass: BITseminar2011 Login: BITseminar Pass: BITseminar2011.
On line (DNA and amino acid) Sequence Information
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
The Ensembl Gene set The “Genebuild” 21 April 2008.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Bioinformatics for biomedicine
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Bikash Shakya Emma Lang Jorge Diaz.  BLASTx entire sequence against 9 plant genomes. RepeatMasker  55.47% repetitive sequences  82.5% retroelements.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases Work Steps Extract and analyze results.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Web Databases for Drosophila An introduction to web tools, databases and NCBI BLAST Wilson Leung08/2015.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Annotation of eukaryotic genomes
What is BLAST? Basic BLAST search What is BLAST?
Welcome to the combined BLAST and Genome Browser Tutorial.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Web Databases for Drosophila
What is BLAST? Basic BLAST search What is BLAST?
Basics of BLAST Basic BLAST Search - What is BLAST?
Bioinformatics and BLAST
Gene Annotation with DNA Subway
BLAST.
Comparative Genomics.
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Part I: Identifying sequences with … Speaker : S. Gaj Date

Annotation Best possible description available for a given sequence at the current time. How to annotate? Combining Alignment Tools Databases Datamining (scripts) Background

Microarrays

Introduction Global alignment Optimal alignment between two sequences containing as much characters of the query as possible. Ex: predicting evolutionary relationship between genes, … Local alignment Optimal alignment between two sequences identifying identical area(s) Ex: Identifying key molecular structures (S-bonds, - helices, …) Background

Introduction Basic Local Alignment Search Tool Aligning an unknown sequence (query) against all sequences present in a chosen database based on a score-value. Aim : Obtaining structural or functional information on the unknown sequence. BLAST

Programs Different BLAST programs available Usable criteria: E-Value, Gap Opening Penalty (GOP), Gap Extension Penalty (GEP), … Terms Query Sequence which will be aligned Subject Sequence present in database Hit Alignment result. BLAST NucleicProtein NucleicBlastNBlastX Protein-BlastP

Common BLAST problems BlastN BLAST CGATAGCCCGCCAGGAT AT ACGATAGCCC -CCAGGAT AT A Sequencing Error Clone seq mRNA Solution: Low penalty for GOP and GEP = 1 |||||||||||||||||||

Translation Problems 6-Frame translation BLAST >embl|J03801|HSLSZ Human lysozyme mRNA, complete cds with an Alu repeat in the 3' flank. ctagcactctgacctagcagtcaacatgaaggctctcattgttctggggct L A L * P S S Q H E G S H C S G A

Translation Problems 6-Frame translation BLAST >embl|J03801|HSLSZ Human lysozyme mRNA, complete cds with an Alu repeat in the 3' flank. ctagcactctgacctagcagtcaacatgaaggctctcattgttctggggct L A L * P S S Q H E G S H C S G A * H S D L A V N M K A L I V L G

Common BLAST problems BLAST Gene X full mRNA mRNA intron exon Translation Splicing

Common BLAST problems BLAST mRNA Clones derived from mRNA Coding region Non-coding region BlastX against protein sequence 3 possible hit-situations

Common BLAST problems BLAST  Yields no protein hit  Aligns with protein in 1 of the 6 frames.  Part perfect alignment Coding region Non-coding region or

Part II: Databases and annotation

Introduction Primary database: – DNA Sequence (EMBL, GenBank, … ) – AminoAcid Sequence (SwissProt, PIR, …) – Protein Structure (PDB, …) Secondary database: – Derived from primary DB – DNA Sequence (UniGene, RefSeq, …) – Combination of all (LocusLink, ENSEMBL, …) Structure: – Flat file databases Databases

Primary Databases EMBL: – DNA Sequence – Human: nucleotides in entries – Clones, mRNA, (Riken) cDNA, … – New sequences can be admitted by everyone. – No curative check before admittance. Databases

Primary Databases SwissProt: – Amino Acid sequence – Human: – Contains protein information – SwissProt (EU)  PIR (USA) – Crosslinks to most informative DB (PDB, OMIM) – Part of UniProt consortium. – Each addition needs validation by appointed curators. – Highly curated Databases

Secondary Databases TrEMBL: – Translated EMBL – Hypothetical proteins – After careful assessment  SpTrEMBL  SwissProt Databases

Secondary Databases UniGene: – Automated clustering of sequences with high similarity – Derived from GenBank / EMBL – 1 consensus-sequence – Species-specific Databases

Secondary Databases LocusLink: – Curated sequences – Descriptive information about genetic loci RefSeq: – Non-redundant set of sequences. – Genomic DNA, mRNA, Protein – Stable reference for gene identification and characterization. – High curation Databases

Database Quality? Databases mRNAProtein EMBLSwissProt Submitter Database Manager Submitter Database Manager Curators DNA

How to Annotate? BlastN against random nucleotide DB – EST’s BlastN against structured nucleotide DB (UniGene, RefSeq) – mRNA hits – Sometimes not annotated at all – Best information Databases

Microarrays

Part III: Annotation Techniques

What do we have? Probe sequence Alignment Tools (e.g. BLAST) Databases !?! What to choose ?!? Annotation

Possibilities? 1.Do it like everyone else does. 2.Make use of curative properties of certain databases Goal: Annotate as many genes with as much information as possible (e.g. SwissProt ID) Annotation

1 st Approach - General “Done by most array manufacturers” Step-by-step approach: – BLAST sequences against nucleic database (preferably UniGene) – Extract high quality (HQ) hits (>95%) – For each HQ hit search crosslinks. – Find a well-described (SwissProt) ID for each sequence. Annotation Techniques

1 st Approach - Concept Annotation Techniques

2 nd Approach - General “Make use of present database curation” Other way around: – Use SwissProt to clean out EMBL – Result: “Cleaned” EMBL database with direct SP crosslinks – BLAST against cEMBL – Extract high quality alignment hits (>95%) – Convert EMBL ID to SP ID. Annotation Techniques

2 nd Approach - Concept Annotation Techniques

Annotating Incyte Reporters Total: cEMBL-approach: (21,47%) SP-IDs DM approach: (74,18%) UG-IDs in which M = (34,9%) SP-IDs ; MR = (38,1%) SP-IDs; MRH = (49,2%) SP-IDs Results

Annotating Incyte Reporters All reporters present on “Incyte Mouse UniGene 1” converted Total: reporters Old annotation : (97,6%) UG-IDs in which Non-existing UG-IDs = (59,5%); M = (20,2%) SP-IDs; MR = (21,8%) SP-IDs; MRH = (26,9%) SP-IDs Datamining approach : (88,9%) UG-IDs in which M = (43,2%) SP-IDs ; MR = (38,1%) SP-IDs; MRH = (60,1%) SP-IDs Custom EMBL-approach : (30,2%) SP-IDs Results

Annotating Incyte Reporters Combined methods “Incyte Mouse UniGene 1” reporters Total: reporters No annotation : (11%) reporters Annotated with SP-ID : (61,3%) reporters of which (22,7%) identical SP-IDs; 532 (5%) reporters with improved SP-IDs by EMBL-method; 174 (1,8%) reporters with different mouse SP-IDs; 5 reporters found only by EMBL-method Results

Conclusions Annotation is much needed  Array sequences can point to different genes Direct translation into protein not best option:  Sequencing errors  Addition or deletion of nucleotides  6-Frame window Public nucleotide databases are redundant.  Sequencing errors  Differences in sequence-length  Attachment of vector-sequence Conclusions

Questions? End