Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.

Slides:



Advertisements
Similar presentations
Pairwise Sequence Alignment Sushmita Roy BMI/CS 576 Sushmita Roy Sep 10 th, 2013 BMI/CS 576.
Advertisements

Parallel BioInformatics Sathish Vadhiyar. Parallel Bioinformatics  Many large scale applications in bioinformatics – sequence search, alignment, construction.
Longest Common Subsequence
DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
. Sequence Alignment I Lecture #2 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU
Sequence Alignment Tutorial #2
Measuring the degree of similarity: PAM and blosum Matrix
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Sequence Alignment Tutorial #2
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
Bioinformatics and Phylogenetic Analysis
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Introduction to Bioinformatics Algorithms Sequence Alignment.
. Sequence Alignment Tutorial #3 © Ydo Wexler & Dan Geiger.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –
Class 2: Basic Sequence Alignment
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
. Sequence Alignment I Lecture #2 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then Shlomo Moran. Background Readings:
Sequence Alignment.
Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Brandon Andrews.  Longest Common Subsequences  Global Sequence Alignment  Scoring Alignments  Local Sequence Alignment  Alignment with Gap Penalties.
Protein Evolution and Sequence Analysis Protein Evolution and Sequence Analysis.
Some Independent Study on Sequence Alignment — Lan Lin prepared for theory group meeting on July 16, 2003.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Approximate Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.
1 Sequence Alignment Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: u GCGCATGGATTGAGCGA u TGCGCCATTGATGACCA.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Local Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
. Sequence Alignment Author:- Aya Osama Supervision:- Dr.Noha khalifa.
Sequence Alignment. Assignment Read Lesk, Problem: Given two sequences R and S of length n, how many alignments of R and S are possible? If you.
Introduction to Sequence Alignment. Why Align Sequences? Find homology within the same species Find clues to gene function Practical issues in experiments.
An Improved Search Algorithm for Optimal Multiple-Sequence Alignment Paper by: Stefan Schroedl Presentation by: Bryan Franklin.
DNA Sequences Analysis Hasan Alshahrani CS6800 Statistical Background : HMMs. What is DNA Sequence. How to get DNA Sequence. DNA Sequence formats. Analysis.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Sequence comparison: Local alignment
Sequence Alignment ..
Intro to Alignment Algorithms: Global and Local
Basic Local Alignment Search Tool (BLAST)
Sequence Alignment Tutorial #2
Presentation transcript:

Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas

Other databases NCBI BLAST –Basic Local Alignment Search Tool –Multiple programs for sequence searching and comparisons Gene Expression Omnibus (GEO) –maintained by NCBI –contains output of gene expression experiments

Links GenBank ( ExPASy ( SwissProt ( GO ( PubMed ( MeSH browser ( NCBI Blast ( NCBI GEO ( Human Protein Atlas (

Assignment Search the above databases for information on a gene/protein of your choice Briefly report your findings (90 seconds) next Tuesday, September 30 Examples: interleukin-N (e.g., 3), elastase, thrombin, creatine kinase, myosin-N (e.g., 2)

Sequences Sequences of symbols central to bioinformatics –DNA –RNA –proteins Fixed alphabet (size 4 for DNA/RNA, 20 for proteins)

Sequence similarity Important for many biological problems Examples –Similar primary structure in proteins implies similar form and function –Similar sequences in genes / proteins imply homologues across organisms –Similar short sequences lead to motif finding –Similarities between gene regions can be used for phylogenetic classification

How to measure similarity Given two sequences S and T, we look into ways to derive T from S using elementary operations –Substitution (change a letter) –Deletion –Insertion Process is reversible (S→T and T→S) Many ways, some obviously more efficient

Edit distance Each elementary operation is assigned a cost Overall cost is the sum of the costs for each operation taken (linear model) The edit distance between two strings is the minimum total cost among all possible sequences of operations that transform S into T

Alignment An equivalent way to measuring edit distance is to align the two sequences An alignment extends the sequences S and T into S ′ and T ′ using the same alphabet plus “-” (the space character), and matches S ′ [i] with T ′ [i]

Definitions A string is a finite sequence of characters from a finite alphabet Σ The length of a string S, denoted |S|, is the number of characters it contains (can be 0) S[i] is the i-th character of S A subsequence of a string S is the string formed by omitting a number of characters from S (order of characters does not change)

Defining alignment formally An alignment is the mapping of two strings S and T from alphabet Σ into strings S′ and T′ where –The alphabet of S′ and T′ is Σ plus “-” –S is a subsequence of S′. All characters in S′ not in this subsequence must be “-”. –T is a subsequence of T′. All characters in T′ not in this subsequence must be “-”. –|S′| = |T′| –There is no i for which S′ [i] = T′ [i] = “-”

Example alignment Sequences: GCGCATGGATTGAGCGA TGCGCCATTGATGACCA A possible alignment: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A

Alignment operations -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three elements: u Perfect matches u Mismatches u Insertions & deletions (indel)

Alignments are not unique For example, compare: -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A to GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA--

Measuring alignment quality For each position i in the alignment, calculate the scoring function σ(S′[i], T′[i]) The scoring function depends only on the symbols S ′[i] and T′[i], not on position A very simple scoring function might be – σ(x, x) = +1 for x a letter – σ(x, y) = –2 for x,y different letters – σ(x, -) = σ(-, x) = -1 for indel

Overall alignment score Defined as the sum of the applicable values of the scoring function As with our definition of edit distance, this is a linear model

Scoring functions Usually based on how similar the two symbols are Derived from confusion probabilities In biology, chemically similar amino-acids have lower penalties for substitution In speech recognition, “p”→ “b” costs less than “p”→ “r” Cost of indels depends on application

Comparing alignments -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A 4 indel, 13 matches, 2 mismatches score: GCGCATGGATTGAGCGA TGCGCC----ATTGATGACCA-- 12 indel, 5 matches, 6 mismatches score: -19

Optimal alignment An alignment which maximizes the overall alignment score is called optimal Often, there is more than one optimal alignment for two strings –depends on sophistication of scoring function The optimal alignment score can be used as a similarity value

Finding the optimal alignment Simple algorithm: Construct all possible alignments, score them, and pick the best How many alignments are there for two strings of length n and m?