Sequence analysis How to locate rare/important sub- sequences.

Sequence Analysis Tasks Representing sequence features, and finding sequence features using consensus sequences and frequency matrices Sequence features Features following an exact pattern- restriction enzyme recognition sites Features with approximate patterns promoters transcription initiation sites transcription termination sites polyadenylation sites ribosome binding sites protein features

Representing uncertainty in nucleotide sequences It is often the case that we would like to represent uncertainty in a nucleotide sequence, i.e., that more than one base is “possible” at a given position to express ambiguity during sequencing to express variation at a position in a gene during evolution to express ability of an enzyme to tolerate more than one base at a given position of a recognition site

Representing uncertainty in nucleotide sequences To do this for nucleotides, we use a set of single character codes that represent all possible combinations of bases This set was proposed and adopted by the International Union of Biochemistry and is referred to as the I.U.B. code Given the size of the amino acid “alphabet”, it is not practical to design a set of codes for ambiguity in protein sequences

The I.U.B. Code A, C, G, T, U R = A, G (puRine) Y = C, T (pYrimidine) S = G, C (Strong hydrogen bonds) W = A, T (Weak hydrogen bonds) M = A, C (aMino group) K = G, T (Keto group) B = C, G, T (not A) D = A, G, T (not C) H = A, C, T (not G) V = A, C, G (not T/U) N = A, C, G, T/U (iNdeterminate) X or - are sometimes used

Definitions A sequence feature is a pattern that is observed to occur in more than one sequence and (usually) to be correlated with some function A consensus sequence is a sequence that summarizes or approximates the pattern observed in a group of aligned sequences containing a sequence feature Consensus sequences are regular expressions

Finding occurrences of consensus sequences Example: recognition site for a restriction enzyme EcoRI recognizes GAATTC AccI recognizes GTMKAC Basic Algorithm Start with first character of sequence to be searched See if enzyme site matches starting at that position Advance to next character of sequence to be searched Repeat previous two steps until all positions have been tested

Block Diagram for Search with a Consensus Sequence Search Engine Sequence to be searched Consensus Sequence (in IUB codes) List of positions where matches occur

Statistics of pattern appearance Goal: Determine the significance of observing a feature (pattern) Method: Estimate the probability that that pattern would occur randomly in a given sequence. Three different methods Assume all nucleotides are equally frequent Use measured frequencies of each nucleotide (mononucleotide frequencies) Use measured frequencies with which a given nucleotide follows another (dinucleotide frequencies)

Determining mononucleotide frequencies Count how many times each nucleotide appears in sequence Divide (normalize) by total number of nucleotides Result:f A  mononucleotide frequency of A (frequency that A is observed) Define:p A  mononucleotide probability that a nucleotide will be an A p A assumed to equal f A

Determining dinucleotide frequencies Make 4 x 4 matrix, one element for each ordered pair of nucleotides Zero all elements Go through sequence linearly, adding one to matrix entry corresponding to the pair of sequence elements observed at that position Divide by total number of dinucleotides Result: f AC  dinucleotide frequency of AC (frequency that AC is observed out of all dinucleotides)

Determining conditional dinucleotide probabilities Divide each dinucleotide frequency by the mononucleotide frequency of the first nucleotide Result:p * AC  conditional dinucleotide probability of observing a C given an A p * AC = f AC / f A

Illustration of probability calculation What is the probability of observing the sequence feature ART? A followed by a purine, (either A or G), followed by a T? Using equal mononucleotide frequencies p A = p C = p G = p T = 1/4 p ART = 1/4 * (1/4 + 1/4) * 1/4 = 1/32

Illustration (continued) Using observed mononucleotide frequencies: p ART = p A (p A + p G ) p T Using dinucleotide frequencies: p ART = p A (p * AA p * AT + p * AG p * GT )

Another illustration What is p ACT in the sequence TTTAACTGGG? f A = 2/10, f C = 1/10 p A = 0.2 f AC = 1/10, f CT = 1/10 p * AC = 0.1/0.2 = 0.5, p * CT = 0.1/0.1 = 1 p ACT = p A p * AC p * CT = 0.2 * 0.5 * 1 = 0.1 (would have been 1/5 * 1/10 * 4/10 = 0.008 using mononucleotide frequencies)

Expected number and spacing Probabilities are per nucleotide How do we calculate number of expected features in a sequence of length L? Expected number (for large L)  Lp How do we calculate the expected spacing between features?  ART  expected spacing between ART features = 1/p ART

Renewals For greatest accuracy in calculating spacing of features, need to consider renewals of a feature (taking into account whether a feature can overlap with a neighboring copy of that feature) For example what is the frequency of GCGC in : ACTGCATGCGCGCATGCGCATATGACGA

Renewals We define a renewal as the end of a non overlapping motif. For example: The renewals of GCGC in ACTGCATGCGCGCATGCGCATATGCGCGC GC Are at 11,19,27,31 The clamps size are: 2,1,2,1

Renewals and Clump size. Let R be a general pattern: R=(r 1,…,r m ) Let us denote: R (i) =(r 1,…,r i ) R (i) =(r m-i+1,…,r m ) The clamp size is:

Clamp Frequency Let us assume that the clamps are distributed randomly. Their frequency, and the interval between any two clamps would be:

Statistical tests In order to test if the motif is over/under represented or non-uniformly distributed we must test the clamp distribution. In order to test motif frequency we can test if the clamp frequency has an average and variance of n In order to test their distribution, we can divide the entire sequence into k subsequences of size: m<T<<1/ and test that S has a  2 distribution, where T i is the clump frequency in the subsequence and S is:

Frequency of simple motifs

Statistics of AT- or GC-rich regions What is the probability of observing a “run” of the same nucleotide (e.g., 25 A’s) Let p x be the mononucleotide probability of nucleotide x The per nucleotide probability of a run of N consecutive x’s is p x N The probability of occurrence in a sequence of length L much longer than N is ≈ L p x N

Statistics of AT- or GC-rich regions What if J “mismatches” are allowed? Let p y be the probability of observing a different nucleotide (normally p y = 1 - p x ) The probability of observing n-j of nucleotide x and j of nucleotide y in a region of length n is

Statistics of AC- or GC-rich regions As before, we can multiply by L to approximate the probability of observing that combination in a sequence of length L Note that this is the probability of observing exactly N-J matches and exactly J mismatches. We may also wish to know the probability of finding at least N-J matches, which requires summing the probability for I=0 to I=J.

Frequency matrices

Goal: Describe a sequence feature (or motif) more quantitatively than possible using consensus sequences Definition: For a feature of length m using an alphabet of n characters, a frequency matrix is an n by m matrix in which each element contains the frequency at which a given member of the alphabet is observed at a given position in an aligned set of sequences containing the feature

Weight matrix Probabilistic model: How likely is each letter at each motif position? ACGTACGT 123456789.89.02.38.34.22.27.02.03.02.04.91.20.17.28.31.30.04.02.04.05.41.18.29.16.07.92.18.03.02.01.31.21.26.61.01.78

Nomenclature Weight matrices are also known as Position-specific scoring matrices Position-specific probability matrices Position-specific weight matrices

Scoring a motif model A motif is interesting if it is very different from the background distribution more interesting less interesting ACGTACGT 123456789.89.02.38.34.22.27.02.03.02.04.91.20.17.28.31.30.04.02.04.05.41.18.29.16.07.92.18.03.02.01.31.21.26.61.01.78

Relative entropy A motif is interesting if it is very different from the background distribution Use relative entropy*: p i,  = probability of  in matrix position i b  = background frequency (in non-motif sequence) * Relative entropy is sometimes called information content.

Scoring motif instances A motif instance matches if it looks like it was generated by the weight matrix Matches weight matrix Hard to tell “ A C G G C G C C T” Not likely! ACGTACGT 123456789.89.02.38.34.22.27.02.03.02.04.91.20.17.28.31.30.04.02.04.05.41.18.29.16.07.92.18.03.02.01.31.21.26.61.01.78

Log likelihood ratio A motif instance matches if it looks like it was generated by the weight matrix Use log likelihood ratio Measures how much more like the weight matrix than like the background.  i : the character at position i of the instance

Alternating approach 1. Guess an initial weight matrix 2. Use weight matrix to predict instances in the input sequences 3. Use instances to predict a weight matrix 4. Repeat 2 & 3 until satisfied. Examples: Gibbs sampler (Lawrence et al.) MEME (expectation max. / Bailey, Elkan) ANN-Spec (neural net / Workman, Stormo)

Expectation-maximization foreach subsequence of width W convert subsequence to a matrix do { re-estimate motif occurrences from matrix re-estimate matrix model from motif occurrences } until (matrix model stops changing) end select matrix with highest score EM

Sample DNA sequences >ce1cg TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATA GCGCGTGGTGTGAAAGACTGTTTTTTTGATCGTTTTCAC AAAAATGGAAGTCCACAGTCTTGACAG >ara GACAAAAACGCGTAACAAAAGTGTCTATAATCACGGCAG AAAAGTCCACATTGATTATTTGCACGGCGTCACACTTTG CTATGCCATAGCATTTTTATCCATAAG >bglr1 ACAAATCCCAATAACTTAATTATTGGGATTTGTTATATA TAACTTTATAAATTCCTAAAATTACACAAAGTTAATAAC TGTGAGCATGGTCATATTTTTATCAAT >crp CACAAAGCGAAAGCTATGCTAAAACAGTCAGGATGCTAC AGTAATACATTGATGTACTGCATGTATGCAAAGGACGTC ACATTACCGTGCAGTACAGTTGATAGC

Motif occurrences >ce1cg taatgtttgtgctggtttttgtggcatcgggcgagaata gcgcgtggtgtgaaagactgttttTTTGATCGTTTTCAC aaaaatggaagtccacagtcttgacag >ara gacaaaaacgcgtaacaaaagtgtctataatcacggcag aaaagtccacattgattaTTTGCACGGCGTCACactttg ctatgccatagcatttttatccataag >bglr1 acaaatcccaataacttaattattgggatttgttatata taactttataaattcctaaaattacacaaagttaataac TGTGAGCATGGTCATatttttatcaat >crp cacaaagcgaaagctatgctaaaacagtcaggatgctac agtaatacattgatgtactgcatgtaTGCAAAGGACGTC ACattaccgtgcagtacagttgatagc

Starting point …gactgttttTTTGATCGTTTTCACaaaaatgg… T T T G A T C G T T A 0.17 0.17 0.17 0.17 0.50... C 0.17 0.17 0.17 0.17 0.17 G 0.17 0.17 0.17 0.50 0.17 T 0.50 0.50 0.50 0.17 0.17

Re-estimating motif occurrences TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATA T T T G A T C G T T A 0.17 0.17 0.17 0.17 0.50... C 0.17 0.17 0.17 0.17 0.17 G 0.17 0.17 0.17 0.50 0.17 T 0.50 0.50 0.50 0.17 0.17 Score = 0.50 + 0.17 + 0.17 + 0.17 + 0.17 +...

Scoring each subsequence Subsequences Score TGTGCTGGTTTTTGT 2.95 GTGCTGGTTTTTGTG 4.62 TGCTGGTTTTTGTGG 2.31 GCTGGTTTTTGTGGC... Sequence: TGTGCTGGTTTTTGTGGCATCGGGCGAGAATA Select from each sequence the subsequence with maximal score.

Re-estimating motif matrix Occurrences TTTGATCGTTTTCAC TTTGCACGGCGTCAC TGTGAGCATGGTCAT TGCAAAGGACGTCAC Counts A 000132011000040 C 001010300200403 G 020301131130000 T 423001002114001

Adding pseudocounts Counts A 000132011000040 C 001010300200403 G 020301131130000 T 423001002114001 Counts + Pseudocounts A 111243122111151 C 112121411311514 G 131412242241111 T 534112113225112

Converting to frequencies Counts + Pseudocounts A 111243122111151 C 112121411311514 G 131412242241111 T 534112113225112 T T T G A T C G T T A 0.13 0.13 0.13 0.25 0.50... C 0.13 0.13 0.25 0.13 0.25 G 0.13 0.38 0.13 0.50 0.13 T 0.63 0.38 0.50 0.13 0.13

Amino acid weight matrices A sequence logo is a scaled position-specific A.A. distribution. Scaling is by a measure of a position’s information content.

Sequence logos A visual representation of a position-specific distribution. Easy for nucleotides, but we need colour to depict up to 20 amino acid proportions. Idea: overall height at position l proportional to information content (2-Hl); proportions of each nucleotide ( or amino acid) are in relation to their observed frequency at that position, with most frequent on top, next most frequent below, etc..

Summary of motif detection

Block Diagram for Searching with a PSSM PSSM search PSSM Set of Sequences to search Sequences that match above threshold Threshold Positions and scores of matches

Block Diagram for Searching for sequences related to a family with a PSSM PSSM search PSSM Set of Sequences to search Sequences that match above threshold Threshold Positions and scores of matches PSSM builder Set of Aligned Sequence Features Expected frequencies of each sequence element

Consensus sequences vs. frequency matrices Should I use a consensus sequence or a frequency matrix to describe my site? If all allowed characters at a given position are equally "good", use IUB codes to create consensus sequence Example: Restriction enzyme recognition sites If some allowed characters are "better" than others, use frequency matrix Example: Promoter sequences Advantages of consensus sequences: smaller description, quicker comparison Disadvantage: lose quantitative information on preferences at certain locations

Similarity Functions Used to facilitate comparison of two sequence elements logical valued (true or false, 1 or 0) test whether first argument matches (or could match) second argument numerical valued test degree to which first argument matches second

Logical valued similarity functions Let Search(I)=‘A’ and Sequence(J)=‘R’ A Function to Test for Exact Match MatchExact(Search(I),Sequence(J)) would return FALSE since A is not R A Function to Test for Possibility of a Match using IUB codes for Incompletely Specified Bases MatchWild(Search(I),Sequence(J)) would return TRUE since R can be either A or G

Numerical valued similarity functions return value could be probability (for DNA) Let Search(I) = 'A' and Sequence(J) = 'R' SimilarNuc (Search(I),Sequence(J)) could return 0.5 since chances are 1 out of 2 that a purine is adenine return value could be similarity (for protein) Let Seq1(I) = 'K' (lysine) and Seq2(J) = 'R' (arginine) SimilarProt(Seq1(I),Seq2(J)) could return 0.8 since lysine is similar to arginine usually use integer values for efficiency

Concluding Notes: Protein detection Given a DNA or RNA sequence, find those regions that code for protein(s) Direct approach:

Genetic codes The set of tRNAs that an organism possesses defines its genetic code(s) The universal genetic code is common to all organisms Prokaryotes, mitochondria and chloroplasts often use slightly different genetic codes More than one tRNA may be present for a given codon, allowing more than one possible translation product

Genetic codes Differences in genetic codes occur in start and stop codons only Alternate initiation codons: codons that encode amino acids but can also be used to start translation (GTG, TTG, ATA, TTA, CTG) Suppressor tRNA codons: codons that normally stop translation but are translated as amino acids (TAG, TGA, TAA)

Reading Frames Since nucleotide sequences are “read” three bases at a time, there are three possible “frames” in which a given nucleotide sequence can be “read” (in the forward direction) Taking the complement of the sequence and reading in the reverse direction gives three more reading frames

Reading frames TTC TCA TGT TTG ACA GCT RF1 Phe Ser Cys Leu Thr Ala> RF2 Ser His Val *** Gln Leu> RF3 Leu Met Phe Asp Ser> AAG AGT ACA AAC TGT CGA RF4 <Glu *** Thr Gln Cys Ser RF5 <Glu His Lys Val Ala RF6 <Arg Met Asn Ser Leu

Reading frames To find which reading frame a region is in, take nucleotide number of lower bound of region, divide by 3 and take remainder (modulus 3) 1=RF1, 2=RF2, 0=RF3 For reverse reading frames, take nucleotide number of upper bound of region, subtract from total number of nucleotides, divide by 3 and take remainder (modulus 3) 0=RF4, 1=RF5, 2=RF6 This is because the convention MacVector uses is that RF4 starts with the last nucleotide and reads backwards

Open Reading Frames (ORF) Concept: Region of DNA or RNA sequence that could be translated into a peptide sequence (open refers to absence of stop codons) Prerequisite: A specific genetic code Definition: (start codon) (amino acid coding codon) n (stop codon) Note: Not all ORFs are actually used

Block Diagram for Direct Search for ORFs Search Engine Sequence to be searched Genetic code List of ORF positions Both strands? Ends start/stop?

Statistical Approaches

Calculation Windows Many sequence analyses require calculating some statistic over a long sequence looking for regions where the statistic is unusually high or low To do this, we define a window size to be the width of the region over which each calculation is to be done Example: %AT

Base Composition Bias For a protein with a roughly “normal” amino acid composition, the first 2 positions of all codons will be about 50% GC If an organism has a high GC content overall, the third position of all codons must be mostly GC Useful for prokaryotes Not useful for eukaryotes due to large amount of noncoding DNA

Fickett ’ s statistic Also called TestCode analysis Looks for asymmetry of base composition Strong statistical basis for calculations Method: For each window on the sequence, calculate the base composition of nucleotides 1, 4, 7..., then of 2, 5, 8..., and then of 3, 6, 9... Calculate statistic from resulting three numbers

Codon Bias (Codon Preference) Principle Different levels of expression of different tRNAs for a given amino acid lead to pressure on coding regions to “conform” to the preferred codon usage Non-coding regions, on the other hand, feel no selective pressure and can drift

Codon Bias (Codon Preference) Starting point: Table of observed codon frequencies in known genes from a given organism best to use highly expressed genes Method Calculate “coding potential” within a moving window for all three reading frames Look for ORFs with high scores

Codon Bias (Codon Preference) Works best for prokaryotes or unicellular eukaryotes because for multicellular eukaryotes, different pools of tRNA may be expressed at different stages of development in different tissues may have to group genes into sets Codon bias can also be used to estimate protein expression level

Portion of D. melanogaster codon frequency table

Comparison of Glycine codon frequencies

Sequence analysis How to locate rare/important sub- sequences.

Similar presentations

Presentation on theme: "Sequence analysis How to locate rare/important sub- sequences."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sequence analysis How to locate rare/important sub- sequences.

Similar presentations

Presentation on theme: "Sequence analysis How to locate rare/important sub- sequences."— Presentation transcript:

Similar presentations

About project

Feedback