Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.

Slides:



Advertisements
Similar presentations
BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.
Advertisements

BLAST Sequence alignment, E-value & Extreme value distribution.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Ab initio gene prediction Genome 559, Winter 2011.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
Heuristic alignment algorithms and cost matrices
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Introduction to BioInformatics GCB/CIS535
Comparative ab initio prediction of gene structures using pair HMMs
Protein Modules An Introduction to Bioinformatics.
Similar Sequence Similar Function Charles Yan Spring 2006.
The Poor Beginners’ Guide to Bioinformatics. What we have – and don’t have... a computer connected to the Internet (incl. Web browser) a text editor (Notepad.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence alignment, E-value & Extreme value distribution
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Biological Motivation Gene Finding in Eukaryotic Genomes
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Developing Pairwise Sequence Alignment Algorithms
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
An Introduction to Bioinformatics
Intelligent Systems for Bioinformatics Michael J. Watts
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Basic Overview of Bioinformatics Tools and Biocomputing Applications I Dr Tan Tin Wee Director Bioinformatics Centre.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Bioinformatics and Computational Biology
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Sequence Alignment.
Construction of Substitution matrices
Applied Bioinformatics
Step 3: Tools Database Searching
Genome Annotation Assessment in Drosophila melanogaster by Reese, M. G., et al. Summary by: Joe Reardon Swathi Appachi Max Masnick Summary of.
Finding genes in the genome
Annotation of eukaryotic genomes
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Identification of Coding Sequences Bert Gold, Ph.D., F.A.C.M.G.
Introduction to Bioinformatics Summary Thomas Nordahl Petersen.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Basics of Genome Annotation Daniel Standage Biology Department Indiana University.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
bacteria and eukaryotes
High-throughput Biological Data The data deluge
Ab initio gene prediction
Introduction to Bioinformatics II
Dr Tan Tin Wee Director Bioinformatics Centre
Sequence Based Analysis Tutorial
4. HMMs for gene finding HMM Ability to model grammar
Basic Local Alignment Search Tool
Presentation transcript:

Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre

Common Computational Analyses Sequence Assembly Simple sequence analysis –Translation and reverse Complement, ORF –Composition statistics (protein & DNA) –Molecular mass –Total charge and pI; local hydropathy –Simple determination of secondary structures –Restriction site analysis –Internal repeat analysis Detection of active sites, functional residues, characteristic structures, substrates, and processing signals

Common Computational Analyses Database sequence search Multiple alignment 2  and 3  Structure prediction; transmembrane helix detection Structure modeling Docking prediction and design Hidden Markov model searches

Database Searching Text-based Database Searching - using a text string to match an annotation in a sequence database record, ie. Keyword search Sequence-based Database Searching - using a biological sequence to match its whole or parts of its sequence to the sequences of every sequence database records

Text-Based Database Searching Examples: Entrez, SRS, DBGET, AceDB - common integrated database systems Search Concepts –Boolean Search - AND, OR, NOT –Broadening Search –Narrowing the Search –Proximity searching, soundex –Wild Card, Stemming eg. Thala* for thalasemia, thalassemia, thalassemic Use standard string search algorithms and boolean operations, vocabulary matches

Text-based Database Searching Example: To find the human homolog of the Drosophila per gene Procedure –Web to Entrez –All Fields : enter "human" "per" –Hits returned, irrelevant - broaden search –"human" "period" - more hits –check every one, find the human RIGUI gene Hit and miss, clever guess work, free form or controlled vocabulary (MeSH terms)? Use Boolean searches?

Sequence-based Database Searching Homology Search Global or Local Sequence Alignment Needleman-Wunch Algorithm Smith-Waterman Algorithm Lipman - Pearson FASTA Altschul's BLAST Take a sequence, pairwise comparison with each sequence in the database

Sequence-based Database Searching Basic Assumptions: Sequences of homologous Genes/Protein diverge over time even though structure and/or function change little Significant sequence similarity inferred as potential structural /functional similarity or common evolutionary origin Based on well-characterised protein, infer the function of an unknown sequence at gene or protein sequence level.

Sequence-based Database Searching Global Alignment forces complete alignment of the pairwise comparison of the two input sequences Local Alignment looks for local stretches of similarity and tries to align the most similar segments Algorithms used may be similar, but output different, statistics needed to assess results

Sequence-based Database Searching Alignment Scoring Substitution score and substitution matrix PAM, BLOSUM affine gap costs/gap penalty and gap scores Optimal alignments, dynamic programming Needleman-Wunsch algorithm, Smith-Waterman algorithm (SSEARCH) Additional heuristics to speed up the search - FASTA, BLAST

Some definitions Affine gap costs - scoring system for gaps within alignments which charges a penalty for gap formation and additional per- residue penalty proportional to size of gap Alignment score - numerical value indicating the overall quality of an alignment, the higher the better the alignment. Algorithm - fixed procedure embodied in a computer program Heuristics - a computer science term referring to guesses made by the program to approximate results, usually based on arbitrary or predefined rules. Gapped Alignment - alignment of sequences where gaps are permitted

Computational Genefinding Major challenge in genome project Given a DNA sequence, where does a gene begin and stop? - ORF Where are the exons and introns? Where are the transcription elements? Gene structure and other regulatory elements?

Genomic Elements Intron-exon splice sites Start-Stop codons Branch Points Promoters and terminators of transcription Polyadenylation sites ribosomal binding sites Topoisomerase II binding sites Topoisomerase I cleavage sites Transcription factor binding sites

Detecting Genomic Elements Local sites and motifs/patterns for such element - signals and signal sensors Extended variable-length regions eg exons and introns- contents and content sensors Linguistic technique - gene structure described in formal grammar - GeneLang genefinding program

Signal sensors Simple consensus sequence Use of Pattern matching algorithms Weight matrices allow for weighted score for each weight matrix sensors to be summed Use of Artificial Neural Networks (ANN)

Content Sensors Long ORF for bacteria Statistical models eg. Markov models - GeneMark statistical models of nucleotide frequencies and dependencies in codon structure Neural Nets eg Grail exon detection by neural network combined with signal sensors for exon-intron splice sites

Some Definitions Artificial Neural Nets - statistical pattern recognition method - a type of nonlinear regression Markov Models - statistical models for sequences in which the probability of each residue depends on the residues preceding it. Dynamic Programming - type of algorithm widely used for constructing sequence aligments and for evaluating all posible candidate gene structure

Other Genefinding methods Use of dynamic programming Linguistic rules for functional features Parameters of a Markov Process on hidden variables - hidden Markov Models (HMM) HMM genefinder - EcoParse, Xpound GeneMark HMM, Veil, HMMgene, GenScan