Presentation is loading. Please wait.

Presentation is loading. Please wait.

MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.

Similar presentations


Presentation on theme: "MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram."— Presentation transcript:

1 MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram

2 MGM workshop. 19 Oct 2010 Outline  Pairwise Alignment  Global/Local, Scoring  BLAST, BLAT, SIM, LALIGN, Dotlet, Ublast  Multiple Sequence Alignment  ClustalW, Kalign, MAFFT, Muscle, T-Coffee, MSA, DIALIGN, Match-Box, Multalin, MUSCA  Phylogenetic analysis and tree construction  BIONJ, DendroUPGMA, PHYLIP, PhyML, Phylogeny.fr, POWER, BlastO, TraceSuite II  HMM  Protein family profiles http://expasy.org/tools/

3 MGM workshop. 19 Oct 2010 Alignment  Insert spaces in arbitrary locations -> same length and no two spaces in the same position.  Find arrangement of two sequences to identify regions of similarity

4 MGM workshop. 19 Oct 2010 Alignment methods: Dot plots

5 MGM workshop. 19 Oct 2010 Global vs Local alignment  Global alignment: An alignment that assumes that the two sequences are basically similar over the entire length of one another  Local alignment: An alignment that searches for segments of the two sequences that match well  It may seem that one should always use local alignments! However each has its application

6 MGM workshop. 19 Oct 2010 Substitution matrices http://www.russelllab.org/aas/

7 MGM workshop. 19 Oct 2010 Scoring an alignment

8 MGM workshop. 19 Oct 2010 Global alignment S1=HGSAQVKGHG S2=KTEAEMKASEDLKKHGT

9 MGM workshop. 19 Oct 2010 KTEAEMKAESEDLKKHGT --HG--SA--Q-VKGHG-

10 MGM workshop. 19 Oct 2010 Local Alignment

11 MGM workshop. 19 Oct 2010 How BLAST works  Blast uses pre-indexed databases  It remembers the location of every ‘word’ of each database entry  Identify High scoring Segment Pairs (HSP)  Default word lengths 11bp or 3aa  When two non-overlapping words within a certain distance of each other in the query are matched against a database entry the region of the two sequences is called a segment pair.  Slide query and target sequences across each other until the maximum number of HSPs for that target is found  Each segment pair is extended untiil the score drops by X below its maximum value  Score the alignment  A scoring matrix is used  Gaps introduced between HSP during sliding get negative score  A match gets a positive score  Total alignment score is subjected to statistical analysis to calculate the significance vs. chance of the score  Repeat for every sequence in the database  Return total results

12 MGM workshop. 19 Oct 2010 How BLAST works MLVTTILAFALFKNSYAQQCGSQAGGALCSNRLCCSKFGY CGSTDPYCGTGCQSQCGGGG VVWMLLVGGSYGVQCGTEAGGALCPRGLCCSQWGWCG STIDYCGPGCQSQCGG Common 3mer GCQSQCGG extend Query Subject (database) ++ L SY QCG++AGGALC LCCS++G+CGST YCG GCQSQCGG HSP Score = 66.6 bits (161), Expect = 3e-12, Method: Compositional matrix adjust. Identities = 32/53 (60%), Positives = 39/53 (74%), Gaps = 0/53 (0%) Query 6 ILAFALFKNSYAQQCGSQAGGALCSNRLCCSKFGYCGSTDPYCGTGCQSQCGG 58 ++ L SY QCG++AGGALC LCCS++G+CGST YCG GCQSQCGG Sbjct 15 VVWMLLVGGSYGVQCGTEAGGALCPRGLCCSQWGWCGSTIDYCGPGCQSQCGG 67

13 MGM workshop. 19 Oct 2010 Types of Blast Nucleic sequence: atcgatatatatagactgactgact Protein sequence: MTAVYHILRALRARARVARARVH 6 frame translation Nucleic acids sequence database Protein seqeunces database blastn blastp 6 frame translation tblastx blastx tblastn Database Query

14 MGM workshop. 19 Oct 2010

15 Exact multiple alignment by dynamic programming  Compexity= O(n S 2 S S 2 )  N: length of sequences  S: number of sequences  Only feasible for 4-5 sequences max.

16 MGM workshop. 19 Oct 2010

17 Neighbor Joining

18 MGM workshop. 19 Oct 2010 Unrooted NJ tree

19 MGM workshop. 19 Oct 2010 Comparison of Multiple sequence alignment programs

20 MGM workshop. 19 Oct 2010 Primary sequence changes:

21 MGM workshop. 19 Oct 2010 Profiles CGGSV 0.8 * 0.4 * 0.8 * 0.6 * 0.2 =.031 ln(0.8)+ln(0.4)+ln(0.8)+ln(0.6)+ln( 0.2) = -3.48

22 MGM workshop. 19 Oct 2010 Hidden Markov Models  Assumptions:  Observations are ordered  Random process can be represented by a stochastic finite state machine with emitting states Probabilistic parameters of a Hidden Markov Model x – states, y – possible observations a – state transition probabilities, b –output/emision probabilities

23 MGM workshop. 19 Oct 2010 HMM estimation, usage & applications Training/Estimation  Feed an architecture (given in advance) a set of observation sequences  The training process will iteratively alter its parameters to fit the training set  The trained model will assign the training sequences high probabilities Usage  Evaluate the probability of an observation sequence given the model (Forward)  Find the most likely path through the model for a given observation sequence (Viterbi) Applications  Gene finding  Protein family modeling  …

24 MGM workshop. 19 Oct 2010 Profile HMMs  Families of functional biological sequences  Primary sequences have diverged due to evolution, while maintaining structure/function.  Questions:  Does a biological sequence belong to a certain protein family? For example is a given protein (sequence) a globin?  Given a set of sequences, find more sequences of the same family

25 MGM workshop. 19 Oct 2010

26 Trade offs AdvandagesDisadvandages Statistics Modularity Transparency Prior knowledge State independence Over – fitting Local maximums Speed

27 MGM workshop. 19 Oct 2010  Questions?


Download ppt "MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram."

Similar presentations


Ads by Google