Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University 12015-12-09.

Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University 12015-12-09

Sequence alignment A way of arranging two or more sequences to identify regions of similarity Shows locations of similarities and differences between the sequences An 'optimal' alignment exhibits the most similarities and the least differences The aligned residues correspond to original residue in their common ancestor Insertions and deletions are represented by gaps in the alignment Examples Protein sequence alignment MSTGAVLIY--TSILIKECHAMPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** *** Nucleotide sequence alignment attcgttggcaaatcgcccctatccggccttaa att---tggcggatcg-cctctacgggcc---- *** **** **** ** ****** 22015-12-09

Sequence alignment: Purpose Reveal structural, functional and evolutionary relationship between biological sequences Similar sequences may have similar structure and function Similar sequences are likely to have common ancestral sequence Annotation of new sequences Modelling of protein structures Design and analysis of gene expression experiments 32015-12-09

Sequence alignment: Types Global alignment –Aligns each residue in each sequence by introducing gaps –Example: Needleman-Wunsch algorithm 2015-12-094 L G P S S K Q T G K G S - S R I W D N L N - I T K S A G K G A I M R L G D A

Sequence alignment: Types Local alignment –Finds regions with the highest density of matches locally –Example: Smith-Waterman algorithm 2015-12-095 - - - - - - - T G K G - - - - - - - - - - - - - - - A G K G - - - - - - - -

Sequence alignment: Scoring Scoring matrices are used to assign scores to each comparison of a pair of characters Identities and substitutions by similar amino acids are assigned positive scores Mismatches, or matches that are unlikely to have been a result of evolution, are given negative scores ACDEFGHIK ACYEFGRIK +5 -5+5 -5+5 62015-12-09 TACGGGCAG -AC-GGC-G Option 1 TACGGGCAG -ACGG-C-G Option 2 TACGGGCAG -ACG-GC-G Option 3

Sequence alignment: Scoring PAM matrices –PAM - Percent Accepted Mutations –PAM gives the probability that a given amino acid will be replaced by any other amino acid –An accepted point mutation in a protein is a replacement of one amino acid by another, accepted by natural selection –Derived from global alignments of closely related sequences –The numbers with the matrix (PAM40, PAM100) refer to the evolutionary distance (greater numbers mean greater distances) –1-PAM matrix refers to the amount evolution that would change 1% of the residues/bases (on average) –2-PAM matrix does NOT refer to change in 2% of residues Refers 1-PAM twice Some variations may change back to original residue 72015-12-09

PAM-1 82015-12-09

Sequence alignment: Scoring BLOSUM matrices –BLOSUM - Blocks Substitution Matrix –Score for each position refers to obtained frequencies of substitutions in blocks of local alignments of protein sequences [Henikoff & Henikoff]. –For example BLOSUM62 is derived from sequence alignments with no more than 62% identity. 92015-12-09

BLOSUM62 102015-12-09

Which scoring matrix to use? For global alignments use PAM matrices. Lower PAM matrices tend to find short alignments of highly similar regions Higher PAM matrices will find weaker, longer alignments For local alignments use BLOSUM matrices BLOSUM matrices with HIGH number, are better for similar sequences BLOSUM matrices with LOW number, are better for distant sequences 112015-12-09

Sequence alignment: Methods Pairwise alignment –Finding best alignment of two sequences –Often used for searching best similar sequences in the sesequence databases Dot Matrix Analysis Dynamic Programming (DP) Short word matching Multiple Sequence Alignment (MSA) –Alignment of more than two sequences –Often used to find conserved domains, regions or sites among many sequences Dynamic programming Progressive methods Iterative methods Structural alignments –Alignments based on structure 122015-12-09

Dot matrix Method for comparing two amino acid or nucleotide sequences Lets align two sequences using dot matrix A:A G C T A G G A B:G A C T A G G C –Sequence A is organized in X-axis and sequence B in Y-axis AGCTAGGA G A C T A G G C 132015-12-09 Sequence A Sequence B

Dot matrix –Starting from the first nucleotide in B, move along the first row placing a dot in columns with matching nucleotide –Repeat the procedure for all the nucleotides in B –Region of similarity is revealed by a diagonal row of dots –Other isolated dots represent random matches 142015-12-09 AGCTAGGA G●●● A C T A G G C Sequence A Sequence B

Dot matrix –Starting from the first nucleotide in B, move along the first row placing a dot in columns with matching nucleotide –Repeat the procedure for all the nucleotides in B –Region of similarity is revealed by a diagonal row of dots –Other isolated dots represent random matches 152015-12-09 AGCTAGGA G●●● A●●● C T A G G C Sequence A Sequence B

Dot matrix –Starting from the first nucleotide in B, move along the first row placing a dot in columns with matching nucleotide –Repeat the procedure for all the nucleotides in B –Region of similarity is revealed by a diagonal row of dots –Other isolated dots represent random matches 162015-12-09 AGCTAGGA G●●● A●●● C● T● A●●● G●●● G●●● C● Sequence A Sequence B

Dot matrix –Starting from the first nucleotide in B, move along the first row placing a dot in columns with matching nucleotide –Repeat the procedure for all the nucleotides in B –Region of similarity is revealed by a diagonal row of dots –Other isolated dots represent random matches 17 2015-12-09 AGCTAGGA G●●● A●●● C● T● A●●● G●●● G●●● C● Sequence A Sequence B

Dot matrix Two similar, but not identical, sequences An insertion or deletion A tandem duplication 182015-12-09

Dot matrix An inversionJoining sequences 192015-12-09

Limitations of dot matrix Sequences with low-complexity regions give false diagonals –Sequence regions with little diversity Noisy and space inefficient Limited to 2 sequences 202015-12-09

Dotplot exercise Use the following three tools to generate dot plots for the given two sequences YASS:: genomic similarity search tool –http://bioinfo.lifl.fr/yass/yass.phphttp://bioinfo.lifl.fr/yass/yass.php Lalign/Palign –http://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=lalignhttp://fasta.bioch.virginia.edu/fasta_www2/fasta_www.cgi?rm=lalign multi-zPicture –http://zpicture.dcode.org/http://zpicture.dcode.org/ 212015-12-09

Dynamic programming Breaks down the alignment problem into smaller problems Example –Needleman-Wunsch algorithm: global alignment –Smith-Waterman algorithm: local alignment Three steps –Initialization –Scoring –Traceback 222015-12-09

Gap penalties Insertion of gaps in the alignment Gaps should be penalized Gap opening should be penalized higher than gap extension (or at least equal) In BLOSUM62 –Gap opening score = -11 –Gap extension score = -1 232015-12-09 AAAGAGAAA AAA--AAAA Gap extention AAAGAGAAA AAA-A-AAA Gap initiation

Needleman-Wunsch vs Smith-Waterman -AGTTA -0-2-3-4-5 A2 G-2 T-3 G-4 C-5 A-6 -AGTTA -000000 A02 G0 T0 G0 C0 A0 Needleman-Wunsch –Match=+2 –Mismatch=-1 –Gap=-1 Smith-Waterman –Match=+2 –Mismatch=-1 –Gap=-1 All negative values are replaced by 0 Traceback starts at the highest value and ends at 0 242015-12-09

Needleman-Wunsch vs Smith-Waterman Sequence alignment teacher (http://melolab.org/websoftware/web/?sid=3)http://melolab.org/websoftware/web/?sid=3 252015-12-09

Dynamic programming: example http://www.avatar.se/molbioinfo2001/dynprog/dynamic.html Scoring –Match= +2 –Mismatch= -2 –Gap= -1 262015-12-09

Dynamic programming exercise Generate a scoring matrix for nucleotides (A, C, G, and T) Align two sequences using dynamic programming Align two sequences using following tools –EMBOSS Needle http://www.ebi.ac.uk/Tools/psa/emboss_needle/ –EMBOSS Water http://www.ebi.ac.uk/Tools/psa/emboss_water/ 272015-12-09

Multiple sequence alignment A multiple sequence alignment (MSA) is an alignment of three or more sequences Why MSA? –To identify patterns of conservation across more than 2 sequences –To characterize protein families and generate profiles of protein families –To infer relationships within and among gene families –To predict secondary and tertiary structures of new sequences –To perform phylogenetic studies 282015-12-09

Recall: dynamic programming 2 sequences http://ai.stanford.edu/~serafim/CS262_2005/LectureNotes/Lecture17.pdf 3 sequences 292015-12-09

MSA methods Dynamic programming –Align each pair of sequences –Sum scores for each pair at each position Progressive sequence alignment –Hierarchical or tree based method –E.g. ClustalW, T-Coffee Iterative sequence alignment –Improved progressive alignment –Realigns the sequences repeatedly –E.g. MUSCLE 302015-12-09

Tools for MSA 312015-12-09

ClustalW Progressive sequence alignment Basic steps –Calculate pairwise distances based on pairwise alignments between the sequences –Build a guide tree, which is an inferred phylogeny for the sequences –Align the sequences 322015-12-09

Progressive MSA 1 3 2 5 1 3 d 4 5 1 3 2 5 1 3 2 332015-12-09

MUSCLE Iterative sequence alignment Follows 3 steps Second progressive alignment Refinement Progressive alignment 342015-12-09

Phylogenetic tree A phylogenetic tree shows evolutionary relationships between the sequences Types: –Rooted Nodes represent most recent common ancestor Edge lengths represents time estimates –Unrooted No ancestry and time estimates Algorithms to generate phylogenetic tree –Neighbor-joining –Unweighted Pair Group Method with Arithmetic Mean (UPGMA) –Maximum parsimony 352015-12-09

Neighbor joining method http://en.wikipedia.org/wiki/Neighbor_joining 362015-12-09

MSA exercise Align the protein sequences SET 1 and SET 2 using MSA tools and compare the alignments Clustalw2 –http://www.ebi.ac.uk/Tools/msa/clustalw2/http://www.ebi.ac.uk/Tools/msa/clustalw2/ MUSCLE –http://www.ebi.ac.uk/Tools/msa/muscle/http://www.ebi.ac.uk/Tools/msa/muscle/ 372015-12-09

What to align: DNA or protein sequence? Many mis-matches in DNA sequences are synonymous DNA sequences contain non-coding regions, which should be avaided in homology searching Matches are more reliable in protein sequence –Probability to occur randomly at any position in a sequence Amino acids: 1/20 = 0.05 Nucleotides: 1/4 = 0.25 Searcing at protein level: In case of frameshifts, the alignment score for protein sequence may be very low even though the DNA sequence are similar ACTTTTCATGGG... ThrPheHisGly... ACTTTTTCATGGG.. ThrPheSerTrp If ORF exists, then always align at protein level 382015-12-09

Searching bioinformatics databases using: keywords and, sequences 2015-12-0939

Search strategy Keyword search –Find information related to specific keywords –Each bioinformatics database has its own search tool –Some search tools have a wide spectrum which access multiple databases and gather results together –Gquery, EBI search Sequence search –Use a sequence of interest to find more information about the sequence –BLAST, FASTA 2015-12-0940

Keyword search Find information related to specific keywords Gquery –A central search tool to find information in NCBI databases –Searches in large number of NCBI databases and shows them in one page –http://www.ncbi.nlm.nih.gov/gqueryhttp://www.ncbi.nlm.nih.gov/gquery EBI search –Search tool to find infroamtion from databases developed, managed and hosted by EMBL-EBI –http://www.ebi.ac.uk/serviceshttp://www.ebi.ac.uk/services 2015-12-0941

Gquery 2015-12-0942

EBI search 2015-12-0943

Limitations Synonyms Misspellings Old and new names/terms NOTES: –Use different synonyms and read literature to find more approriate keywords –Use boolean operators to combine different keywords –Do not expect to find all the information using keyword search alone –Note the database version or the version of entries in the databases you used 8 64110 ELA2 ELANE 59 20 HIV 1 HIV-1 PubMed ClinVar 2015-12-0944

Gene nomenclature HUGO Gene Nomenclature Committee (HGNC) –Assigns standardized nomenclature to human genes –Each symbol is unique and each gene is given only one name Species specific nomenclature committees –Mouse Genome Informatics Database http://www.informatics.jax.org/mgihome/nomen/ –Rat Genome Database http://rgd.mcw.edu/nomen/nomen.shtml 2015-12-0945

HGNC symbol report Approved symbol Approved name Synonyms –Terms used in literature to indicate the gene –HGNC, Ensembl, Entrez Gene, OMIM Previous symbols and names –Previous HGNC approved symbol NOTE: HGNC does not approve protein names. Usually genes and proteins have the same name and gene names are written in italics. 2015-12-0946

HGNC search 2015-12-0947

Keyword search Exercise 2015-12-0948

Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University 12015-12-09.

Similar presentations

Presentation on theme: "Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University 12015-12-09."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University 12015-12-09.

Similar presentations

Presentation on theme: "Sequence Alignment Abhishek Niroula Department of Experimental Medical Science Lund University 12015-12-09."— Presentation transcript:

Similar presentations

About project

Feedback