Lecture 6. Sequence motif models and counting The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

Lecture 6. Sequence motif models and counting The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

Lecture outline 1.Sequence motifs – Biological motivations – Representations – k-mer counting 2.Introduction to statistical modeling – Motivating examples – Generative and discriminative models – Classification and regression – Example: Naive Bayes classifier Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 20162

SEQUENCE MOTIFS Part 1

Sequence motifs Many biological activities are facilitated by particular sequence patterns – The restriction enzyme EcoRI recognizes the DNA pattern GAATTC and cuts the DNA as follows: – The human protein GATA3 binds DNA at regions that exhibit the pattern AGTAAGA, where the G at position 6 can also be A, and the A at position 7 can also be G or C Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 20164 G CTTAA AATTC G

Sequence motifs In general, small recurrent patterns on biological sequences with particular functions are called sequence motifs We need models to represent the motifs, usually based on some examples. Goals: – These models do not miss true occurrences (i.e., have low false negative rate), and do not include false occurrences (i.e., have low false positive rate) – These models should take uncertainty into account – These models should be as simple as possible Computability Interpretability Generality Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 20165

Motif representations Suppose we have the following sequences known to be bound by a protein: – CACAAAA – CACAAAT – CGCAAAA – CACAAAA Consensus sequence: – CACAAAA – Problem: Information loss Degenerate sequence in IUPAC (International Union of Pure and Applied Chemistry) code (see http://www.bio- soft.net/sms/iupac.html): http://www.bio- soft.net/sms/iupac.html – CRCAAAW Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 20166 Example source: http://conferences.computer.org/bioinformatics/CSB2003/NOTES/Liu_Color.pdf IUPAC nucleotide code Base AAdenine CCytosine GGuanine T (or U)Thymine (or Uracil) RA or G YC or T SG or C WA or T KG or T MA or C BC or G or T DA or G or T HA or C or T VA or C or G Nany base. or -gap (not used in motifs)

Motif representations Suppose we have the following TFBS sequences: – CACAAAAA – CACAAA_T – CGCAAAAA – CACAAA_A Regular expression (see http://en.wikipedia.org/wiki/Regular_expression for syntax) http://en.wikipedia.org/wiki/Regular_expression – E.g., C[AG]CA{3,4}[AT] Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 20167

Motif representations Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 20168 ATGGCATG AGGGTGCG ATCGCATG TTGCCACG ATGGTATT ATTGCACG AGGGCGTT ATGACATG ATGGCATG ACTGGATG 12345678 A0.90.0 0.10.00.80.0 C 0.1 0.70.00.30.0 G 0.20.70.80.10.20.00.8 T0.10.70.20.00.20.00.70.2 12345678 A10/141/14 2/141/149/141/14 C 2/14 8/141/144/141/14 G 3/148/149/142/143/141/149/14 T2/148/143/141/143/141/148/143/14 Example source: http://conferences.computer.org/bioinformatics/CSB2003/NOTES/Liu_Color.pdf Position weight matrix Pseudo-counts: add a small number to each count, to alleviate problems due to small sample size

Motif representations Sequence logo – Nucleotide with the highest probability on top – Total height of the nucleotides at the i-th position, p i,x : probability of character x at position i n: number of sequences – Height of nucleotide x = p i,x h i Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 20169 12345678 A0.90.0 0.10.00.80.0 C 0.1 0.70.00.30.0 G 0.20.70.80.10.20.00.8 T0.10.70.20.00.20.00.70.2

Using a motif Consensus sequence: – Predict “Yes” if a sequence matches the consensus sequence; “No” otherwise Regular expression: – Predict “Yes” if a sequence can be generated by the regular expression; “No” otherwise Position weight matrix: – Compute a matching score for a sequence, and consider a sequence to be more likely to belong to the class if it has a higher score Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201610

PWM matching score Suppose the PWM of the binding sites of a protein is as follows: 1.For the sequence ATGGGGTG, the likelihood is 0.9  0.7  0.7  0.8  0.1  0.2  0.7  0.8 = 0.00395136 2.Compute the odds against background probabilities of the four nucleotides: 0.00395136 / (p A  p G 5  p T 2 ) 3.Usually take log of the odds as the final score Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201611 12345678 A0.90.0 0.10.00.80.0 C 0.1 0.70.00.30.0 G 0.20.70.80.10.20.00.8 T0.10.70.20.00.20.00.70.2

k-mers Another way to represent sequence motifs: k- mers Training examples: – ACCGCT – TACCGG – TTACCA – AACCTG One vague way to summarize: “This motif is AC - and CC -rich” Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201612 2-merCount2-merCount AA 1 GA 0 AC 4 GC 1 AG 0 GG 1 AT 0 GT 0 CA 1 TA 2 CC 4 TC 0 CG 2 TG 1 CT 2 TT 1

k-mers Considerations: – Value of k Too small: – Capture only local patterns Too large: – Too restrictive – Too many possible k-mers (computationally difficult) – Allowing gaps or not g-gapped k-mer: among the g+k positions, only k of them are considered and the remaining g gap positions are ignored – Representation and final use of the k-mers Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201613

Problem to study here Using g-gapped k-mer counts as features, compute similarity of two sequences as their inner product Example (k=2, g=1) – Full set of g-gapped k-mers ( * means any one nucleotide): *AA, *AC, *AG,..., *TT A*A, A*C, A*G,..., T*T AA*, AC*, AG*,..., TT* Number of possible g-gapped k-mers = k+g C k 4 k = 3C2 4 2 = 48 Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201614

Problem to study here Example (k=2, g=1) (cont’d) – Sequence s 1 = ACCGCT – Sequence s 2 = TACCGG – Similarity between s 1 and s 2, sim(s 1,s 2 ) = 0  0 + 0  1 + 0  0 + 0  0 + 0  0 + 1  1 + 1  1 +... = 8 (verify by yourself) These similarity values can help separate sequences that belong to a class from those that do not Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201615 g-gapped k-mer *AA*AC*AG*AT*CA*CC*CG... TT* Count00000110 g-gapped k-mer *AA*AC*AG*AT*CA*CC*CG... TT* Count01000110

Time complexity analysis For two sequences each with n characters, using the brute-force way of calculation: – Filling each table takes (n-g-k+1) g+k C k = 3n-6 additions Linear w.r.t. sequence length g+k C k can be large when g is large – Computing the inner product takes k+g C k 4 k = 48 multiplications, followed by 47 additions Exponential w.r.t. k Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201616

Speeding up the calculations Ideas: – The exponential time complexity can be avoided only if sim(s 1,s 2 ) can be computed without filling in the two whole tables – When k is large, the tables contain many zeroes that can be ignored Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201617

Using the ideas Example (k=2, g=1) (cont’d) – Sequence s 1 = ACCGCT New representation: { *CC :1, *CG :1, *CT :1, *GC :1, A*C :1, C*C :1, C*G :1, G*T :1, AC* :1, CC* :1, CG* :1, GC* :1} – Sequence s 2 = TACCGG New representation: { *AC :1, *CC :1, *CG :1, *GG :1, A*C :1, C*G :2, T*C :1, AC* :1, CC* :1, CG* :1, TA* :1} – Looking for common g-gap k-mers and multiplying the corresponding numbers, sim(s 1,s 2 ) = 1 (due to *CC ) + 1 ( *CG ) + 1 ( A*C ) + 2 ( C*G ) + 1 ( AC* ) + 1 ( CC* ) + 1 ( CG* ) = 8 Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201618

Time complexity analysis Suppose the new representations can be produced with the help of hash tables, the final calculation involves linear scan of the two lists, each with at most (n-g-k+1) g+k C k entries – (6-1-2+1) 3C2 = 12 entries when n=6, k=2, g=1 Can be slow when g and k are large – For example, with k=6, g=8, g+k C k = 3003 Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201619

Speeding up further Another idea: – Some g-gapped k-mers are related, and their corresponding calculations can be grouped For example, s 1 [3-5] = CGC and s 2 [4-6] = CGG – g-gapped k-mers involved: s 1 [3-5]: { *GC, C*C, CG* } s 2 [4-6]: { *GG, C*G, CG* } – Similarity between s 1 and s 2 due to these sub- sequences: 1 (due to C*G ) Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201620

Speeding up further Given two length-(k+g) sub-sequences from s 1 and s 2 (e.g., CGC and CGG ), how much do they contribute to sim(s 1,s 2 )? Important observation: The answer depends only on their number of mismatches – In this case, there is one mismatch between CGC and CGG, and the corresponding contribution to the similarity between s 1 and s 2 is 1 – In the same way, between s 1 [2-4]= CCG and s 2 [4- 6]= CGG, since they have one mismatch, the contribution is also 1 Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201621

Computing the contribution For any two length-(k+g) sub-sequences s 1 [i 1 -j 1 ] and s 2 [i 2 -j 2 ] with m mismatches – There are in total k+g C k ways to generate g-gapped k- mers from each of them, by choosing k non-gapped positions – For a particular choice of the k positions, if they do not involve any of the mismatch positions, their contribution to sim(s 1,s 2 ) is 1 Otherwise, their contribution is 0 – Therefore, their total contribution to sim(s 1,s 2 ) is the number of ways to choose the k positions such that none of them is a mismatch position The total number of ways is k+g-m C k if k+g-m > k (i.e., g>m); 0 otherwise Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201622

Computing the contribution A bigger example: suppose k=2, g=2 – s 1 [2-5] = CCGC – s 2 [3-6] = CCGG Previous way of calculating their contribution to sim(s 1,s 2 ) : – g-gapped k-mers of s 1 [2-5]: { **GC, *C*C, *CG*, C**C, C*G*, CC** } – g-gapped k-mers of s 2 [3-6]: { **GG, *C*G, *CG*, C**G, C*G*, CC** } – Contribution (number of common g-gapped k-mers): 3 New way of calculating their contribution to sim(s 1,s 2 ): – Number of mismatches between s 1 [2-5] and s 2 [3-6]: 1 – Contribution: k+g-m C k = 3C2 = 3 Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201623

Complete algorithm 1.Extract all (k+g)-mers from s 1 and s 2 2.For each pair of (k+g)-mer taken from s 1 and s 2 respectively, compute their contribution to sim(s 1,s 2 ) 3.Sum all these contributions to get final value of sim(s 1,s 2 ) Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201624

Complete example Back to k=2, g=1 – Sequence s 1 = ACCGCT – Sequence s 2 = TACCGG 1.Extract all 3-mers – s 1 : { ACC, CCG, CGC, GCT } – s 2 : { TAC, ACC, CCG, CGG } 2.For each pair of 3-mers, compute their contribution to sim(s 1,s 2 ) 3.Therefore, sim(s 1,s 2 ) = 3+3+1+1=8 Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201625 TACACCCCGCGG ACC 2023 CCG 3201 CGC 2221 GCT 3223 Number of mismatches: TACACCCCGCGG ACC 0300 CCG 0031 CGC 0001 GCT 0000 Contributions to sim(s 1,s 2 ) :

Time complexity analysis For each length-n sequence, there are n-k-g+1 sub-sequences of length k+g Therefore, there are (n-k-g+1) 2 pairs of (k+g)- mers from the two sequences For each pair, the number of mismatches can be computed by scanning the two (k+g)-mer ones – Can speed up using bitwise XOR operations The total amount of time required is (k+g)(n-k- g+1) 2 – Depends more on n but not so much on k and g Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201626

Speeding up even further? Possible to avoid considering all (k+g)-mer pairs from the two sequences, but just those with less than g mismatches – Won’t go into the details here (see further readings) Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201627 Image credit: Ghandi et al., PLOS Computational Biology 10(7):e1003711, (2014)

INTRODUCTION TO STATISTICAL MODELING Part 2

Statistical modeling We have studied many biological concepts in this course – Genes, exons, introns,... We want to provide a description of a concept by means of some observable features – Sometimes it can be (more or less) an exact rule: The enzyme EcoRI cuts the DNA if and only if it sees the sequence GAATTC – In most cases it is not exact: If a sequence (1) starts with ATG, (2) ends with TAA, TAG or TGA, and (3) has a length about 1,500 and is a multiple of 3, it could be the protein coding sequence of a yeast gene If the BRCA1 or BRCA2 gene is mutated, one may develop breast cancer Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201629

The examples Reasons for the descriptions to be inexact: – Incomplete information What mutations on BRCA1/BRCA2? Any mutations on other genes? – Exceptions “If one has fever, he/she has a flu” – Not everyone with a flu has fever, also not everyone with fever is due to a flu – Intrinsic randomness Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201630 Concept/ Class DNA recognized by the enzyme EcoRI Protein coding sequence of a yeast gene Developing breast cancer Features observable from data The DNA sequence (the string) Raw: The DNA sequence Derived: The first three characters The last three characters The length Mutations at BRCA1 gene Mutations at BRCA2 gene

Features known, concept unsure In many cases, we are interested in the situation that the features are observed but whether a concept is true is unknown – We know the sequence of a DNA region, but we do not know whether it corresponds to a protein coding sequence – We know whether the BRCA1 and BRCA2 genes of a subject are mutated (and in which ways), but we do not know whether the subject has developed/will develop breast cancer – We know a subject is having fever, but we do not know whether he/she has flu infection or not Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201631

Statistical models Statistical models provide a principal way to specify the inexact descriptions For the flu example, using some symbols: – X: a set of features In this example, a single binary feature with X=1 if a subject has fever and X=0 if not – Y: the target concept In this example, a binary concept with Y=1 if a subject has flu and Y=0 if not – A model is a function that predicts values of Y based on observed values X and parameters  Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201632

Parameters Some details of a statistical model are provided by its parameters,  – Suppose whether a person with flu has fever can be modeled as a Bernoulli (i.e., coin-flipping) event with probability q 1, That is, for each person with flu, the probability for him/her to have fever is q 1 and the probability not to have fever is 1-q 1. Different people are assumed to be statistically independent. – Similarly, suppose whether a person without flu has fever can be modeled as a Bernoulli event with probability q 2 – Finally, the probability for a person to have flu is p – Then the whole set of parameters is  = {p, q 1, q 2 } Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201633

Basic probabilities Pr(X)Pr(Y|X) = Pr(X and Y) – If there is a 20% chance to rain tomorrow, and whenever it rains, there is a 60% chance that the temperature will drop, then there is a 0.2*0.6=0.12 chance that tomorrow it will both rain and have a temperature drop – Capital letters mean it is true for all values of X and Y – Can also write Pr(X=x)Pr(Y=y|X=x) = Pr(X=x and Y=y) for particular values of X and Y Law of total probability: (The summation should consider all possible values of Y) – If there is A 0.12 chance that it will both rain and have a temperature drop tomorrow, and A 0.08 chance that it will both rain and not have a temperature drop tomorrow – Then there is a 0.12+0.08 = 0.2 chance that it will rain tomorrow Bayes’ rule: Pr(X|Y) = Pr(Y|X)Pr(X)/Pr(Y) when Pr(Y)  0 – Because Pr(X|Y)Pr(Y) = Pr(Y|X)Pr(X) = Pr(X and Y) – Similarly, Pr(X|Y,Z) = Pr(Y|X,Z)Pr(X|Z)/Pr(Y|Z) when Pr(Y|Z)  0 Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201634

A complete numeric example Assume the following parameters  (X: has fever or not; Y: has flu or not): – 70% of people with flu have fever: Pr(X=1|Y=1) = 0.7 – 10% of people without flu have fever: Pr(X=1|Y=0) = 0.1 – 20% of people have flu: Pr(Y=1) = 0.2 We have a simple model to predict Y from  and X: – Probability that someone has fever: Pr(X=1) = Pr(X=1,Y=1) + Pr(X=1,Y=0) =Pr(X=1|Y=1)Pr(Y=1) + Pr(X=1|Y=0)Pr(Y=0) =(0.7)(0.2) + (0.1)(1-0.2) = 0.22 – Probability that someone has flu, given that he/she has fever: Pr(Y=1|X=1) = Pr(X=1|Y=1)Pr(Y=1)/Pr(X=1) =(0.7)(0.2) / 0.22 = 0.64 – Probability that someone does not have flu, given that he/she has fever: Pr(Y=0|X=1) = 1 - Pr(Y=1|X=1) = 0.36 – Probability that someone has flu, given that he/she does not have fever: Pr(Y=1|X=0) = Pr(X=0|Y=1)Pr(Y=1) / Pr(X=0) =[1 - Pr(X=1|Y=1)]Pr(Y=1) / [1 - Pr(X=1)] =(1 – 0.7)(0.2) / (1 – 0.22) = 0.08 – Probability that someone does not have flu, given that he/she does not have fever: Pr(Y=0|X=0) = 1 – Pr(Y=1|X=0) = 0.92 Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201635

Statistical estimation Questions we can ask: – Given a model, what is the likelihood of the observation? Pr(X|Y,  ) – in the previous page,  was omitted for simplicity If a person has flu, how likely would he/she have fever? – Given an observation, what is the probability that a concept is true? Pr(Y|X,  ) If a person has fever, what is the probability that he/she has flu? – Given some observations, what is the likelihood of a parameter value? Pr(  |X), or Pr(  |X,Y) if whether the concept is true is also known Suppose we have observed that among 100 people with flu, 70 have fever. What is the likelihood that q 1 is equal to 0.7? Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201636

Statistical estimation Questions we can ask (cont’d): – Maximum likelihood estimation: Given a model with unknown parameter values, what parameter values can maximize the data likelihood? or – Prediction of concept: Given a model and an observation, what is the concept most likely to be true? Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201637

Generative vs. discriminative modeling If a model predicts Y by providing information about Pr(X,Y), it is called a generative model – Because we can use the model to generate data – Examples: Naïve Bayes If a model predicts Y by providing information about Pr(Y|X) directly without providing information about Pr(X,Y), it is called a discriminative model – Example: Logistic regression Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201638

Classification vs. regression If there is a finite number of discrete, mutually exclusive concepts, and we want to find out which one is true for an observation, it is a classification problem and the model is called a classifier – Given that the BRCA1 gene of a subject has a deleted exon 2, we want to predict whether the subject will develop breast cancer in the life time Y=1: the subject will develop breast cancer; Y=0: the subject will not develop breast cancer If Y takes on continuous values, it is a regression problem and the model is called an estimator – Given that the BRCA1 gene of a subject has a deleted exon 2, we want to estimate the lifespan of the subject Y: lifespan of the subject Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201639

Bayes classifiers In the example of flu (Y) and fever (X), we have seen that if we know Pr(X|Y) and Pr(Y), we can determine Pr(Y|X) by using Bayes’ rule: – We use capital letter to represent variables (single-valued or vector), and small letters to represent values When we do not specify the value, it means something is true for all values. For example, all the following are true according to Bayes’ rule: – Pr(Y=1|X=1) = Pr(X=1|Y=1) Pr(Y=1) / Pr(X=1) – Pr(Y=1|X=0) = Pr(X=0|Y=1) Pr(Y=1) / Pr(X=0) – Pr(Y=0|X=1) = Pr(X=1|Y=0) Pr(Y=0) / Pr(X=1) – Pr(Y=0|X=0) = Pr(X=0|Y=0) Pr(Y=0) / Pr(X=0) Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201640

Terminology – Pr(Y) is called the prior probability E.g., Pr(Y=1) is the probability of having flu, without considering any evidence such as fever Can be considered the prior guess that the concept is true before seeing any evidence – Pr(X|Y) is called the likelihood E.g., Pr(X=1|Y=1) is the probability of having fever if we know one has flu – Pr(Y|X) is called the posterior probability E.g., Pr(Y=1|X=1) is the probability of having flu, after knowing that one has fever Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201641

Generalizations In general, the above is true even if: – X involves a set of features X={X (1), X (2),..., X (m) } instead of a single feature Example: predict whether one has flu after knowing whether he/she has fever, headache and running nose – X can take on continuous values In that case, Pr(X) is the probability density of X Examples: – Predict whether a person has flu after knowing his/her body temperature – Predict whether a gene is involved in a biological pathway given its expression values in several conditions Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201642

Parameter estimation Let’s consider the discrete case first Suppose we want to estimate the parameters of our flu model by learning from a set of known examples, (X 1, Y 1 ), (X 2, Y 2 ),..., (X n, Y n ) – the training set How many parameters are there in the model? – We need to know the prior probabilities, Pr(Y) Two parameters: Pr(Y=1), Pr(Y=0) Since Pr(Y=1) = 1 - Pr(Y=0), only one independent parameter – We need to know the likelihoods, Pr(X|Y) Suppose we have m binary features, fever, headache, running nose,... 2 m+1 parameters for all X and Y value combinations 2(2 m -1) independent parameters since for each value y of Y, sum of all Pr(X=x|Y=y) is one – Total: 2(2 m -1) + 1 independent parameters How large should n be in order to estimate these parameters accurately? – Very large, given the exponential number of parameters Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201643

List of all the parameters Let Y be having flu (Y=1) or not (Y=0) Let X (1) be having fever (X (1) =1) or not (X (1) =0) Let X (2) be having headache (X (2) =1) or not (X (2) =0) Let X (3) be having running nose (X (3) =1) or not (X (3) =0) Then the complete list of parameters for a generative model is (variables not independent are in gray): – Pr(Y=0), Pr(Y=1) – Pr(X (1) =0, X (2) =0, X (3) =0,|Y=0), Pr(X (1) =0, X (2) =0, X (3) =1,|Y=0), Pr(X (1) =0, X (2) =1, X (3) =0,|Y=0), Pr(X (1) =0, X (2) =1, X (3) =1,|Y=0), Pr(X (1) =1, X (2) =0, X (3) =0,|Y=0), Pr(X (1) =1, X (2) =0, X (3) =1,|Y=0), Pr(X (1) =1, X (2) =1, X (3) =0,|Y=0), Pr(X (1) =1, X (2) =1, X (3) =1,|Y=0) – Pr(X (1) =0, X (2) =0, X (3) =0,|Y=1), Pr(X (1) =0, X (2) =0, X (3) =1,|Y=1), Pr(X (1) =0, X (2) =1, X (3) =0,|Y=1), Pr(X (1) =0, X (2) =1, X (3) =1,|Y=1), Pr(X (1) =1, X (2) =0, X (3) =0,|Y=1), Pr(X (1) =1, X (2) =0, X (3) =1,|Y=1), Pr(X (1) =1, X (2) =1, X (3) =0,|Y=1), Pr(X (1) =1, X (2) =1, X (3) =1,|Y=1) Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201644

Why having many parameters is a problem? Statistically, we will need a lot of data to accurately estimate the values of the parameters – Imagine that we need to estimate the 15 parameters on the last page with only data about 20 people Computationally, estimating the values of an exponential number of parameters could take a long time Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201645

Conditional independence One way to reduce the number of parameters is to assume conditional independence: If X (1) and X (2) are two features, then – Pr(X (1), X (2) |Y) = Pr(X (1) |Y,X (2) )Pr(X (2) |Y) [Standard probability] = Pr(X (1) |Y)Pr(X (2) |Y) [Conditional independence assumption] – Probability for a flu patient to have fever is independent of whether he/she has running nose – Important: This does not imply unconditional independence, i.e., Pr(X (1) ) and Pr(X (2) ) are not assumed independent, and thus we cannot say Pr(X (1), X (2) ) = Pr(X (1) )Pr(X (2) ) Without knowing whether a person has flu, having fever and having running nose are definitely correlated Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201646

Conditional independence and Naïve Bayes Number of parameters after making the conditional independence assumption: – 2 prior probabilities Pr(Y=0) and Pr(Y=1) Only 1 independent parameter, as Pr(Y=1) = 1 – Pr(Y=0) – 4m likelihoods Pr(X (j) =x|Y=y) for all possible values of j, x and y Only 2m independent parameters, as Pr(X (j) =1|Y=y) = Pr(X (j) =0|Y=y) for all possible values of j and y – Total: 2m+1 independent parameters, which is much smaller than 2(2 m -1)+1! The resulting model is usually called a Naïve Bayes model Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201647

Estimating the parameters Now, suppose we have the known examples (X 1, Y 1 ), (X 2, Y 2 ),..., (X n, Y n ) in the training set The prior probabilities can be estimated in this way: –, where is the indicator function, with(true) = 1 and (false) = 0 – That is, fraction of examples with class label y Similarly, for any particular feature X (j), its likelihoods can be estimated in this way: – – That is, fraction of class y examples having value x at feature X (j) – To avoid zeros, we can add pseudo-counts: –, where c has a small value Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201648

Example Suppose we have the training data as shown on the right How many parameters does the Naïve Bayes model have? Estimated parameter values using the formulas on the last page: – Pr(Y=1) = 3/8 – Pr(X (1) =1|Y=1) = 2/3 – Pr(X (1) =1|Y=0) = 2/5 – Pr(X (2) =1|Y=1) = 1/3 – Pr(X (2) =1|Y=0) = 1/5 Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201649 Subject i Has fever? X (1) Has headache? X (2) Has flu? Y 1Yes 2 NoYes 3No Yes 4 No 5 YesNo 6 7 8YesNo

Meaning of the estimations The formulas for estimating the parameters are intuitive In fact they are also the maximum likelihood estimators – the values that maximize the likelihood if we assume the data were generated by independent Bernoulli trials – Let q=Pr(X (j) =1|Y=1) be the probability for a flu patient to have fever – This likelihood can be expressed as That is, if a flu patient has fever, we include a q to the product; If a flu patient does not have fever, we include a 1-q to the product – Finding the value of q that maximizes the likelihood is equivalent to finding the q that maximizes the logarithm of it, since logarithm is an increasing function (a > b  ln a > ln b) – This value can be found by differentiating the log likelihood and equating it to zero: – The formula for estimating the prior probabilities Pr(Y) can be similarly derived Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201650

Short summary So far, we have got the formulas for estimating the parameters of a Naïve Bayes model, which correspond to the parameter values, among all possible values, that maximize the data likelihood The parameter estimates: – Prior probabilities: – Likelihoods: Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201651

Using the model Now with Pr(Y=y) and Pr(X (j) =x|Y=y) estimated for all features j and all values x and y, the model can be applied to estimate Pr(Y=y|X) for any X, either in the training set or not – Recall that – For classification, we can compare Pr(Y=1|X) and Pr(Y=0|X), and Predict X to be of class 1 if the former is larger Predict X to be of class 0 if the latter is larger Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201652

Example Suppose we have the same training data as shown on the right Parameter values of Naïve Bayes model we have previously estimated: – Pr(Y=1) = 3/8 – Pr(X (1) =1|Y=1) = 2/3 – Pr(X (1) =1|Y=0) = 2/5 – Pr(X (2) =1|Y=1) = 1/3 – Pr(X (2) =1|Y=0) = 1/5 Now, for a new subject with fever but not headache, we would predict its probability of having flu as Pr(Y=1|X (1) =1,X (2) =0) = Pr(X (1) =1|Y=1)Pr(X (2) =0|Y=1)Pr(Y=1) / [Pr(X (1) =1|Y=1)Pr(X (2) =0|Y=1)Pr(Y=1) + Pr(X (1) =1|Y=0)Pr(X (2) =0|Y=0)Pr(Y=0)] =(2/3)(1-1/3)(3/8) / [(2/3)(1-1/3)(3/8) + (2/5)(1-1/5)(1-3/8)] =5/11 Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201653 Subject i Has fever? X (1) Has headache? X (2) Has flu? Y 1Yes 2 NoYes 3No Yes 4 No 5 YesNo 6 7 8YesNo

Numeric features If X (j) can take on continuous values, we need a continuous distribution instead of a discrete one – Fever is a feature with binary values: 1 means “has fever”; 0 means “does not have fever” – Body temperature is a feature with continuous values For the features with binary values, we have assumed that each feature X (j) has a Bernoulli distribution conditioned on Y, i.e., Pr(X (j) =1|Y=y) = q with the value of parameter q to be estimated For continuous values, we can similarly estimate Pr(X (j) =x|Y=y) based on an assumed distribution Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201654

Gaussian distribution Suppose the body temperatures of flu patients follow a Gaussian distribution: – There are two parameters to estimate: The mean (center) of the distribution,  The variance (spread) of the distribution,  2 Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201655

Estimating the parameters Maximum likelihood estimations [optional]: Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201656

Estimating the parameters Results: – The formulas: – Meanings: The mean and variance of the training data The above formula for the variance is a biased estimation (i.e., when you have many sets of training data and each time you estimate the variance by this formula, the average of the estimations does not converge to the actual variance of the Gaussian distribution). May use the sample variance instead, which is the minimum variance unbiased estimator – see further readings. Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201657

CASE STUDY, SUMMARY AND FURTHER READINGS Epilogue

Case study: Fallacies related to statistics “According to this gene model, this DNA sequence has a data likelihood of 0.6, while according to this model for intergenic regions, this DNA sequence has a data likelihood of 0.1. Therefore the sequence is more likely to be a gene.” – Right or wrong? Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201659

Case study: Fallacies related to statistics Likelihood vs. posterior: – If Y represents whether the sequence is a gene (Y=1) or not (Y=0), and X is the sequence features, then the above statement is comparing the likelihoods Pr(X|Y=1) and Pr(X|Y=0), but we know that the posterior Pr(Y|X)=Pr(X|Y)Pr(Y)/Pr(X), and Pr(Y=1) << Pr(Y=0) Another famous example: “This cancer test has a 99% accuracy, and therefore highly reliable.” Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201660

Case study: Fallacies related to statistics “Drug A is more effective than drug B for our male patients. Drug A is also more effective than drug B for our female patients. Therefore drug A is a better drug than drug B in general.” – Right or wrong? Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201661

Case study: Fallacies related to statistics Simpson’s paradox: – Consider this situation: Again, it is related to different priors Pr(Gender) for the two drugs. You may argue that more females can be recruited to test drug A and more males can be recruited to test drug B. – How about “Rate of a disease is higher for both males and females in population A than population B”? Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201662 Drug ADrug B EffectiveIneffectiveEffectiveIneffective Male604055 Female736535 Total67437040

Summary Sequence motif – Representations: Strengths and weaknesses – K-mer counting Statistical modeling allows us to predict the class Y (e.g., has flu) of an object by combining some observed features X (e.g., body temperature, fever and running nose) and some parameters  – Generative models: Predict Pr(Y|X) by modeling Pr(Y) and Pr(X|Y) Example: Naïve Bayes classifier – Discriminative models: Predict Pr(Y|X) by modeling it directly Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201663

Further readings Ghandi et al., PLOS Computational Biology 10(7):e1003711, (2014) Ghandi et al., PLOS Computational Biology 10(7):e1003711, (2014) – More details about k-mer counting algorithms A book chapter written by Tom Mitchell, Generative and Discriminative Classifiers: Naïve Bayes and Logistic Regression – Available at http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf Note: In both cases, some notations are different from what we use here Last update: 3-Aug-2016CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 201664

Lecture 6. Sequence motif models and counting The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

Similar presentations

Presentation on theme: "Lecture 6. Sequence motif models and counting The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 6. Sequence motif models and counting The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

Similar presentations

Presentation on theme: "Lecture 6. Sequence motif models and counting The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback