Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.

Similar presentations


Presentation on theme: "Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment."— Presentation transcript:

1 Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment of two sequences Multiple Sequence Alignment -Two or more sequences

2 Overview  Why compare sequences  Homology vs. identity/similarity  DotPlots  Scoring Match Match Mismatch Mismatch Gap penality Gap penality  Global vs. local alignment  Do the results make biological sense?

3 Why Align Sequences  Identify conserved sequences

4 Why Align Sequences  Identify conserved sequences Identify elements that repeat in a single sequence. Identify elements that repeat in a single sequence.

5 Why Align Sequences  Identify conserved sequences Identify elements that repeat in a single sequence. Identify elements that repeat in a single sequence. Identify elements conserved between genes. Identify elements conserved between genes.

6 Why Align Sequences  Identify conserved sequences Identify elements that repeat in a single sequence. Identify elements that repeat in a single sequence. Identify elements conserved between genes. Identify elements conserved between genes. Identify elements conserved between species. Identify elements conserved between species.

7 Why Align Sequences  Identify conserved sequences Identify elements that repeat in a single sequence. Identify elements that repeat in a single sequence. Identify elements conserved between genes. Identify elements conserved between genes. Identify elements conserved between species. Identify elements conserved between species. Regulatory elementsRegulatory elements

8 Why Align Sequences  Identify conserved sequences Identify elements that repeat in a single sequence. Identify elements that repeat in a single sequence. Identify elements conserved between genes. Identify elements conserved between genes. Identify elements conserved between species. Identify elements conserved between species. Regulatory elementsRegulatory elements Functional elementsFunctional elements

9 Underlying Hypothesis?

10 EVOLUTION

11 EVOLUTION Based upon conservation of sequence during evolution we can infer function.

12 Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins using concept of conservative substitutions Identity Identity percentage percentage  Homology-specific term indicating relationship by evolution

13 Basic terms:  Orthologs: homologous sequences found in two or more species, that have the same function (i.e. alpha- hemoglobin).

14 Basic terms:  Orthologs: homologous sequences found it two or more species, that have the same function (i.e. alpha- hemoglobin).  Paralogs: homologous sequences found in the same species that arose by gene duplication. ( alpha and beta hemoglobin).

15 Pairwise comparison  Dotplot All against all comparison. All against all comparison. Every position is compared with every other position.Every position is compared with every other position.

16 Pairwise comparison  Dotplot All against all comparison. All against all comparison. Every position is compared with every other position.Every position is compared with every other position. Nucleic acids and proteins have polarity.Nucleic acids and proteins have polarity.

17 Pairwise comparison  Dotplot All against all comparison. All against all comparison. Every position is compared with every other position.Every position is compared with every other position. Nucleic acids and proteins have polarity.Nucleic acids and proteins have polarity. Typically only one direction makes biological sense.Typically only one direction makes biological sense.

18 Pairwise comparison  Dotplot All against all comparison. All against all comparison. Every position is compared with every other position.Every position is compared with every other position. Nucleic acids and proteins have polarity.Nucleic acids and proteins have polarity. Typically only one direction makes biological sense.Typically only one direction makes biological sense. 5’ to 3’ or amino terminus to carboxyl terminus. 5’ to 3’ or amino terminus to carboxyl terminus.

19 DotPlot  Dotplot- matrix, with one sequence across top, other down side. Put a dot, or 1, where ever there is identity.

20 DotPlot G A T C T GATCTGATCT

21 DotPlot  Dotplot- matrix, with one sequence across top, other down side. Put a dot, or 1, where ever there is identity. G A T C T GATCTGATCT.

22 DotPlot  Dotplot- matrix, with one sequence across top, other down side. Put a dot, or 1, where ever there is identity. G A T C T GATCTGATCT..

23 DotPlot  Dotplot- matrix, with one sequence across top, other down side. Put a dot, or 1, where ever there is identity. G A T C T GATCTGATCT....

24 DotPlot  Dotplot- matrix, with one sequence across top, other down side. Put a dot, or 1, where ever there is identity. G A T C T GATCTGATCT.....

25 DotPlot  Dotplot- matrix, with one sequence across top, other down side. Put a dot, or 1, where ever there is identity. G A T C T GATCTGATCT.......

26

27 Simple plot  Window: size of sequence block used for comparison. In previous example: window = 1 window = 1  Stringency = Number of matches required to score positive. In previous example: stringency = 1 (required exact match) stringency = 1 (required exact match)

28

29

30

31 Dot Plot  Compare two sequences in every register.  Vary size of window and stringency depending upon sequences being compared.  For nucleotide sequences typically start with window = 21; stringency = 14

32 GATCGTACCATGGAATCGTCCAGATCA GATC + (4/4) GATC - (0/4) + (2/4) WINDOW = 4; STRINGENCY = 2 DotPlot

33 This “match” from G and C out of the four

34 Top 3 Rows

35 Intragenic Comparison  Rat Groucho Gene

36

37

38

39 Intergenic Comparison  Rat and Drosophila Groucho Gene

40

41 Intergenic comparison  Nucleotide sequence contains three domains.

42 Intergenic comparison  Nucleotide sequence contains three domains.  50 - 350 - Strong conservation Indel places comparison out of registerIndel places comparison out of register

43 Intergenic comparison  Nucleotide sequence contains three domains.  50 - 350 - Strong conservation Indel places comparison out of registerIndel places comparison out of register  450 - 1300 - Slightly weaker conservation

44 Intergenic comparison  Nucleotide sequence contains three domains.  50 - 350 - Strong conservation Indel places comparison out of registerIndel places comparison out of register  450 - 1300 - Slightly weaker conservation  1300 - 2400 - Strong conservation

45 Groucho  These three coding regions correspond to apparent functional domains of the encoded protein

46 Scoring Alignments  Quality Score: Score x for match, -y for mismatch; Score x for match, -y for mismatch;

47 Scoring Alignments  Quality Score: Score x for match, -y for mismatch; Score x for match, -y for mismatch; Penalty for:Penalty for: Creating Gap Creating Gap Extending a gap Extending a gap

48 Scoring Alignments  Quality Score:  Quality = [10(match)]

49 Scoring Alignments  Quality Score:  Quality = [10(match)] + [-1(mismatch)]

50 Scoring Alignments  Quality Score:  Quality = [10(match)] + [-1(mismatch)] - [(Gap Creation Penalty)(#of Gaps)

51 Scoring Alignments  Quality Score:  Quality = [10(match)] + [-1(mismatch)] - [(Gap Creation Penalty)(#of Gaps) +(Gap Ext. Pen.)(Total length of Gaps)]

52 Z Score (standardized score)  Z = (Score alignment - Average Score random ) Standard Deviation random

53 Quality Score:Randomization Program takes sequence and randomizes it X times (user select). Determines average quality score and standard deviation with randomized sequences Compare randomized scores with Quality score to help determine if alignment is potentially significant.

54 Randomization  It has become clear that Sequences appear to evolve in a “word” like fashion. Sequences appear to evolve in a “word” like fashion. 26 letters of the alphabet--combined to make words.26 letters of the alphabet--combined to make words. Words actually communicate information.Words actually communicate information. Randomization should actually occur at the level of strings of nucleotides (2-4). Randomization should actually occur at the level of strings of nucleotides (2-4).

55 Global Alignment  Global - Compares all possible alignments of two sequences and presents the one with the greatest number of matches and the fewest gaps.

56 Global Alignment  Global - Compares all possible alignments of two sequences and presents the one with the greatest number of matches and the fewest gaps.  Alignment will “run” from one end of the longest sequence, to the other end.

57 Global Alignment  Global - Compares all possible alignments of two sequences and presents the one with the greatest number of matches and the fewest gaps.  Alignment will “run” from one end of the longest sequence, to the other end.  Best for closely related sequences.

58 Global Alignment  Global - Compares all possible alignments of two sequences and presents the one with the greatest number of matches and the fewest gaps.  Alignment will “run” from one end of the longest sequence, to the other end.  Best for closely related sequences.  Can miss short regions of strongly conserved sequence.

59 Local Alignment  Identifies segments of alignment with the highest possible score.

60 Local Alignment  Identifies segments of alignment with the highest possible score.  Align sequences, extends aligned regions in both directions until score falls to zero.

61 Local Alignment  Identifies segments of alignment with the highest possible score.  Align sequences, extends aligned regions in both directions until score falls to zero.  Best for comparing sequences whose relationship is unknown.

62 Global Alignment: Local Alignment:

63 Blast 2 Basic Local Alignment Search Tool E (expect) value E (expect) value: number of hits expected by random chance in a database of same size. Larger numerical value = lower significance HIV sequence

64  Both Global (Gap) and Local (Bestfit) tools will (almost) always give a match.

65  It is important to determine if the match is biologically relevant.

66  Both Global (Gap) and Local (Bestfit) tools will (almost) always give a match.  It is important to determine if the match is biologically relevant.  Not necessarily relevant: Low complexity regions. Sequence repeats (glutamine runs) Sequence repeats (glutamine runs)

67  Both Global (Gap) and Local (Bestfit) tools will (almost) always give a match.  It is important to determine if the match is biologically relevant.  Not necessarily relevant: Low complexity regions. Sequence repeats (glutamine runs) Sequence repeats (glutamine runs) Transmembrane regions (high in hydrophobes) Transmembrane regions (high in hydrophobes)

68  Both Global (Gap) and Local (Bestfit) tools will (almost) always give a match.  It is important to determine if the match is biologically relevant.  Not necessarily relevant: Low complexity regions. Sequence repeats (glutamine runs) Sequence repeats (glutamine runs) Transmembrane regions (high in hydrophobes) Transmembrane regions (high in hydrophobes)  If working with coding regions, you are typically better off comparing protein sequences. Greater information content.


Download ppt "Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment."

Similar presentations


Ads by Google