# Sequence similarity.

## Presentation on theme: "Sequence similarity."— Presentation transcript:

Sequence similarity

Sequence Comparison Much of bioinformatics involves sequences
DNA sequences RNA sequences Protein sequences We can think of these sequences as strings of letters DNA & RNA: alphabet of 4 letters Protein: alphabet of 20 letters

Sequence Comparison - Motivation
Nucleotide Learn about evolutionary relationships Finding genes, domains, signals … Protein Classify protein families (function, structure) Identify common domains (function, structure)

Calculation of an alignment score

How do we align two sequences?
ATTGCAGTGATCG ATTGCGTCGATCG Solution Solution 2 ATTGCAGTGATCG ATTGCAGT-GATCG ||||| ||||| ||||| || ||||| ATTGCGTCGATCG ATTGC-GTCGATCG 10 matches | , 3 mismatches 12 matches |, 2 gaps -

Which alignment is better?
Solution Solution 2 ATTGCAGTGATCG ATTGCAGT-GATCG ||||| ||||| ||||| || ||||| ATTGCGTCGATCG ATTGC-GTCGATCG 10X1+3X(-1) = 7 12X1+2X(-2) = 8 10 matches, 3 mismatches 12 matches, 2 gaps We will use a scoring scheme Match Mismatch –1 0 Indel(gap) 10X1+3X(0) = 10 12X1+2X(-2) = 8

Scoring Alignments - intuition
Similar sequences evolved from a common ancestor Evolution changed the sequences from this ancestral sequence by mutations: Replacements: one letter replaced by another Deletion: deletion of a letter Insertion: insertion of a letter Scoring of sequence similarity should examine how many operations took place

Causes for sequence (dis)similarity
mutation: a nucleotide at a certain location is replaced by another nucleotide (e.g.: ATA → AGA) insertion: at a certain location one new nucleotide is inserted inbetween two existing nucleotides (e.g.: AA → AGA) deletion: at a certain location one existing nucleotide is deleted (e.g.: ACTG → AC-G) indel: an insertion or a deletion

Gaps • Positions at which a letter is paired with a null
are called gaps. • Gap scores are typically negative. • Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap.

Gap Opening The gap-opening penalty defines the cost for opening a gap in one of the sequences. If you raise the gap-opening penalty above default, local alignments that contain gaps may be split into several shorter alignments.

Affine Gap Penalties ATA__GC ATATTGC ATAG_GC AT_GTGC
In nature, a series of indels often come as a single event rather than a series of single nucleotide events: ATA__GC ATATTGC ATAG_GC AT_GTGC This is more likely. This is less likely. Normal scoring would give the same score for both alignments Gap = Gapopen + Len * Gapextend

Gap penalties lead to: Increasing penalties for gaps opening and extension The alignment will contain fewer gaps and more mismatches Decreasing penalties for gaps opening and extension The alignment will contain more gaps (of varied lengths) and fewer mismatches Holding same score of penalty for gap opening and increasing penalty for gap extension Very long gaps will not be tolerated – they will be replaced with additional gaps of medium length and with mismatches.

Sequence similarity

Global alignment A global alignment between two sequences is an alignment in which all the characters in both sequences participate in the alignment. As these sequences are also easily identified by local alignment methods global alignment is now somewhat deprecated as a technique. Global Local _____ _______ __ ____ __ ____ ____ __ ____

Local alignment Local alignment methods find related regions within sequences - they can consist of a subset of the characters within each sequence. For example, positions of sequence A might be aligned with positions 50-70 of sequence B. This is a more flexible technique than global alignment and has the advantage that related regions which appear in a different order in the two proteins can be identified as being related. Global Local _____ _______ __ ____ __ ____ ____ __ ____

Global vs. Local: Global Local Jack Leunissen

Global vs. Local: Use global alignment if Use local alignment if
You expect, based on some biological information, that your sequences will match over the entire length. Your sequences are of similar length. Use local alignment if You expect that only certain parts of two sequences will match (as in the case of conserved segment that can be found in many different proteins). Your sequences are very different in length. You want to search a sequence database (we will talk about it in details later).

Emboss [best solution] vs. Lalign (Embnet) [several solutions]
If two proteins share more than one common region, for example one has a single copy of a particular domain while the other has two copies, it may be possible to "miss" one of the two copies if using local alignment, which presents only the best scoring alignment. Emboss [best solution] vs. Lalign (Embnet) [several solutions]

Comparing nucleotides
Every match got the same score Every mismatch got the same score Gaps- we decided but default usually good. However

In the case of aa Not all matches are the same
Different mismatches get different scores

Serine (S) and Threonine (T) have similar physicochemical properties
Amino acid properties Serine (S) and Threonine (T) have similar physicochemical properties Aspartic acid (D) and Glutamic acid (E) have similar properties => Substitution of S/T or E/D occurs relatively often during evolution => Substitution of S/T or E/D should result in scores that are only moderately lower than identities

So how can we score matches and mismatches?
Each aa is characterized by a combination of features (size, charge, etc.). The relative importance of each feature may vary according to the aa role in the 3-D structure and function of the protein. So how can we score matches and mismatches?

Amino Acids Substitution Matrices
The PAM and BLOSUM substitution matrices describe the likelihood that two residue types would mutate to each other. These matrices are based on biological sequence information: the substitutions observed in structural (BLOSUM) or evolutionary (PAM) alignments of well studied protein families These scoring systems have a probabilistic foundation.

PAM series - Percent Accepted Mutation (Accepted by natural selection)
All the PAM data come from alignments of closely related proteins (>85% amino acid identity) from 71 protein families (total of protein sequences). PAM matrices are based on global sequence alignments - these include both highly conserved and highly mutable regions. Some of the protein families are: Ig kappa chain Kappa casein Lactalbumin Hemoglobin a Myoglobin Insulin Histone H4 Ubiquitin

Various degrees of conservation
The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. At an evolutionary interval of PAM1, one change has occurred over a length of 100 amino acids. Other PAM matrices are extrapolated from PAM1. For PAM250, 250 changes have occurred for two proteins over a length of 100 amino acids. All the PAM data come from closely related proteins (>85% amino acid identity).

PAM series - Percent Accepted Mutation (Accepted by natural selection)
Varying degrees of conservation *

Blocks Substitution Matrices-
THE BLOSUM Family of Matrices Blocks Substitution Matrices- Henikoff and Henikoff, 1992 Blocks are short conserved patterns of 3-60 aa long. Proteins can be divided into families by common blocks. Different BLOSUM matrices emerge by looking at sequences with different identity percentage. Example: BLOSUM62 is derived from an alignment of sequences that share no less than 62% identity. Block A B C D

The Blocks Database Gapless alignment blocks

Blosum62 scoring matrix

Summary: BLOSUM matrices are based on the replacement patterns found in more highly conserved regions of the sequences without gaps PAM matrices based on mutations observed throughout a global alignment, includes both highly conserved and highly mutable regions

PAM versus BLOSUM Based on an explicit evolutionary model
Derived from small, closely related proteins with ~15% divergence Higher PAM numbers to detect more remote sequence similarities Errors in PAM 1 are scaled 250X in PAM 250 Based on empirical frequencies Uses much larger, more diverse set of protein sequences (30-90% ID) Lower BLOSUM numbers to detect more remote sequence similarities Errors in BLOSUM arise from errors in alignment

Guidelines Lower PAMs and higher Blosums find short local alignment of highly similar sequences Higher PAMs and lower Blosums find longer weaker local alignment No single matrix answers all questions

Guidelines BLOSUM is generally better than PAM for local alignments.
The default matrix is often identity matrix for DNA and BLOSUM 62 for proteins When using BLOSUM80 instead of BLOSUM45, local alignments tend to be shorter. Low PAMs have same effects as high Blosums. BLOSUM indicates percent identity while PAM is proportional to the percent of accepted mutations.