# Introduction to Bioinformatics* Probability Calculations in Bioinformatics *

## Presentation on theme: "Introduction to Bioinformatics* Probability Calculations in Bioinformatics *"— Presentation transcript:

Introduction to Bioinformatics* Probability Calculations in Bioinformatics * http://www.people.vcu.edu/~elhaij/bnfo301-12/http://www.people.vcu.edu/~elhaij/bnfo301-12/ If you’re in the middle of bioinformatics, you are undoubtedly surrounded by a large number of things – nucleotides, genes, metabolites. Large numbers means that your usual idiocy-checking facilities often don’t work, and without them, large numbers often lead to large – embarrassing – mistakes. Probability calculations are often your best defense against foolishness.

Probability Calculations in Bioinformatics TOPIC Utility of probability calculations in bioinformatics The Rule of Multiplication The Rule of Addition The Rule of Subtraction The Rule of Everything Final thoughts SLIDE 3 14 63 85 99 118 To navigate to a specific slide, type the slide number and press Enter

Utility of Probability Calculations Such calculations are useful in a large variety of circumstances. Here are a few, chosen to illustrate certain tools useful in calculation.

Utility of Probability Calculations How frequently would a DNA sequence appear by chance? This question arises in many disguises. I’ll consider some you’re familiar with.

Utility of Probability Calculations How frequently would a DNA sequence appear by chance? You’ll recall this question from Problem Set 1. I emphasized that question because similar questions so commonly arise when considering sequences.

Utility of Probability Calculations How frequently would a DNA sequence appear by chance? Given the proposed binding site, how frequently would you expect RNA polymerase to bind to random DNA? Here’s another instance of the same sort of question. You have some idea about what a certain protein is looking for when it binds DNA. If you’re right, then you might expect the probability of random binding to be relatively low.

Utility of Probability Calculations How frequently would a DNA sequence appear by chance? How specific does the DNA binding site need to be to prevent unwanted repression? The same question approached from the opposite end. A probability calculation tells you how specific you should expect a biologically relevant binding site to be.

Utility of Probability Calculations How frequently would a DNA sequence appear by chance? How much overlap is required to ensure a meaningful sequence assembly? GAATATGAGCCTCTTCCTGA GAAGTTTTCGCATAAAT In sequence assembly, a simple probability calculation helps you judge whether an overlap is worth your attention.

Utility of Probability Calculations How frequently would a DNA sequence appear by chance? What’s the probability that an arginine encoded by AGA will mutate to a hydrophilic amino acid? This kind of question has some theoretical importance. Is the genetic code arranged in such a way that the potential for harmful mutations is minimized?

Utility of Probability Calculations How frequently would a DNA sequence appear by chance? What’s the probability that an arginine encoded by AGA will mutate to a hydrophobic amino acid? Oversampling Completeness How many nucleotides will be missing if a genome sequencing project is taken to 6x coverage? Sequencing a large genome can be expensive! How can you calculate whether the amount of sequencing is enough to produce a reasonably complete genome?

Tools of Probability Calculations How to calculate these probabilities? Probability calculations can be hideously complex, but fortunately, most of the calculations you’ll run across in bioinformatics are of the simple variety, requiring only a few simple tools.

Tools of Probability Calculations Rule of multiplication (intersection) Rule of addition (union) Rule of subtraction (complementation) Probability calculations often boil down to creative counting – how many ways are there that satisfy your criteria? However, I won’t go into that much in this presentation. Instead, I’ll go through three rules, considered within the context of first a simple calculation and then one of bioinformatic relevance. Rule of everything

What’s the probability that Coin#1 AND Coin#2 come up tails Rule of multiplication (Intersection of possibilities) A seemingly simple question….

What’s the probability that Coin#1 AND Coin#2 come up tails Rule of multiplication (Intersection of possibilities) If you’re certain that the four possible outcomes are all equally likely, then you can just count… 1 desired outcome in 4 possible… 1/4.

P(TT) = 1/2 What’s the probability that Coin#1 AND Coin#2 come up tails Gets T from first Rule of multiplication (Intersection of possibilities) Or you can calculate the probability of two simple events both occurring. The probability that the first coin lands tails should be 1/2…

P(TT) = 1/2 1/2 What’s the probability that Coin#1 AND Coin#2 come up tails Gets T from first AND gets T from second Rule of multiplication (Intersection of possibilities) …and the probability the second lands tails should be the same. How do you get from the two individual probabilities the probability that both occur?

P(TT) = 1/2 x 1/2 = 1/4 What’s the probability that Coin#1 AND Coin#2 come up tails Gets T from first AND gets T from second Rule of multiplication (Intersection of possibilities) The probability that both occur is the product of the two individual probabilities. Why?

P(TT) = 1/2 x 1/2 = 1/4 What’s the probability that Coin#1 AND Coin#2 come up tails Gets T from first AND gets T from second Rule of multiplication (Intersection of possibilities) …well, in the universe of possibilities, half the time the first coin lands tails, and in half of those possibilities, the second coin lands tails. Half of half…1/2 x 1/2.

P(TT) = 1/2 x 1/2 = 1/4 What’s the probability that Coin#1 AND Coin#2 come up tails Gets T from first AND gets T from second Rule of multiplication (Intersection of possibilities) When can you resort to the multiplication of probabilities of events to get the joint probability of both events occurring?

P(TT) = 1/2 x 1/2 = 1/4 What’s the probability that Coin#1 AND Coin#2 come up tails Gets T from first AND gets T from second Rule of multiplication (Intersection of possibilities) The Rule of Multiplication may apply if you’re looking for the intersection of two possibilities. Both the first coin lands tails AND the second coin lands tails. Rule of multiplication intersection

P(TT) = 1/2 x 1/2 = 1/4 What’s the probability that Coin#1 AND Coin#2 come up tails Gets T from first AND gets T from second Rule of multiplication (Intersection of possibilities) But it is also necessary that the two events be independent of one another. What is meant by independent? Rule of multiplication intersection independent

What does independent mean? (To illustrate…) I wanted to find how likely it is for there to be a series of nice days in a row here in Richmond. So I went on the web…

What does independent mean? …and found that historically, one out of three days in February had some amount of rain. That’s an average, of course.

What does independent mean? …but when I went to the weather prediction for the week, I found that there were seven consecutive days for which no rain was predicted. Is that credible? Are we looking at a remarkable occurrence, perhaps an effect of global warming?

What does independent mean? P( ) = What’s the probability of 7 non-rainy days in a row? 1 234567

What does independent mean? P( ) = P( ) = 2/3 I know the probability of no rain on the first day. So long as the historical average is pertinent, the probability should be 2/3. 1 234567 1

What does independent mean? P( ) = P( ) = 2/3 …and similarly for days 2 through 7. The Rule of Multiplication tells me how I might combine these probabilities. How? 1 234567 1 AND P( ) = 2/3 2 AND P( ) = 2/3 3 AND P( ) = 2/3 4 AND P( ) = 2/3 5 AND P( ) = 2/3 6 AND P( ) = 2/3 7

What does independent mean? P( ) = P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) = 2/3 = = = = = = = (2/3) I certainly want the intersection of all seven events, i.e. I’m asking for the joint probability of all seven events occurring. According to the rule, I should therefore be able to multiply the individual probabilities. 1 234567 1 2 3 4 5 6 7 = (2/3) 7

What does independent mean? P( ) = P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) = 2/3 = = = = = = = (2/3) You can reach for your calculator (or computer), but before you do this calculation – indeed, before you do any calculation – you should have an estimate in mind of what you expect the answer to be. Otherwise you are selling your soul to the machine. 1 234567 1 2 3 4 5 6 7 = (2/3) 7

What does independent mean? P( ) = P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) = 2/3 = = = = = = = (2/3) Imagine your estimate one day at a time. Where on the number line would you place the probability of just one non-rainy day? 1 234567 1 2 3 4 5 6 7 = (2/3) 7 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

What does independent mean? P( ) = P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) = 2/3 = = = = = = = (2/3) Yes, 2/3 -- about 0.67. What about two non-rainy days? What’s 2/3 of 2/3? Mentally divide the interval 0 and 0.67 into thirds, and move the arrow down to the 2/3 of 2/3 mark. 1 234567 1 2 3 4 5 6 7 = (2/3) 7 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 1

What does independent mean? P( ) = P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) = 2/3 = = = = = = = (2/3) That’s about 0.45. Now divide the new interval into thirds as before to reach 2/3 of 2/3 of 2/3, and move the arrow down again. Notice that to calculate the amount to move down, all you have to do is divide the number by 3. 1 234567 1 2 3 4 5 6 7 = (2/3) 7 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 2

What does independent mean? P( ) = P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) = 2/3 = = = = = = = (2/3) That’s about 0.45. Now divide the new interval into thirds as before to reach 2/3 of 2/3 of 2/3, and move the arrow down again. Notice that to calculate the amount to move down, all you have to do is divide the number by 3. 1 234567 1 2 3 4 5 6 7 = (2/3) 7 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 2

What does independent mean? P( ) = P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) = 2/3 = = = = = = = (2/3) Down to 0.30, the calculated probability of three non-rainy days in a row. Again, divide the new interval into thirds, and move the arrow down. 1 234567 1 2 3 4 5 6 7 = (2/3) 7 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 3

What does independent mean? P( ) = P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) = 2/3 = = = = = = = (2/3) Down to 0.30, the calculated probability of three non-rainy days in a row. Again, divide the new interval into thirds, and move the arrow down. 1 234567 1 2 3 4 5 6 7 = (2/3) 7 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 3

What does independent mean? P( ) = P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) = 2/3 = = = = = = = (2/3) Down to about 0.20 for four non-rainy days in a row. Again,… 1 234567 1 2 3 4 5 6 7 = (2/3) 7 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 4

What does independent mean? P( ) = P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) = 2/3 = = = = = = = (2/3) Down to about 0.20 for four non-rainy days in a row. Again,… 1 234567 1 2 3 4 5 6 7 = (2/3) 7 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 4

What does independent mean? P( ) = P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) = 2/3 = = = = = = = (2/3) Maybe 0.14. Again for six non-rainy days in a row,… 1 234567 1 2 3 4 5 6 7 = (2/3) 7 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 5

What does independent mean? P( ) = P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) = 2/3 = = = = = = = (2/3) 1 234567 1 2 3 4 5 6 7 = 7 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 5 Maybe 0.14. Again for six non-rainy days in a row,…

What does independent mean? P( ) = P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) = 2/3 = = = = = = = (2/3) To about 0.09. Last time for the seventh day… 1 234567 1 2 3 4 5 6 7 = (2/3) 7 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 6

What does independent mean? P( ) = P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) = 2/3 = = = = = = = (2/3) To about 0.09. Last time for the seventh day… 1 234567 1 2 3 4 5 6 7 = (2/3) 7 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 6

What does independent mean? P( ) = P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) = 2/3 = = = = = = = (2/3) Close to 0.06. That should be what the calculator/computer gives us for (2/3) 7, the calculated probability for seven non-rainy days in a row. Actually, no need for the machine. 1 234567 1 2 3 4 5 6 7 = (2/3) 7 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 7

What does independent mean? P( ) = P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) = 2/3 = = = = = = = (2/3) That doesn’t sound very likely! Are we in an unusual stretch of weather? 1 234567 1 2 3 4 5 6 7 = (2/3) 7 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 7 = ~0.06

What does independent mean? P( ) = P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) = 2/3 = = = = = = = (2/3) Or say today is Thursday, and suppose it indeed rains. What’s the probability that it will rain tomorrow? The historical record seems to say there’s a one in three chance… 1 234567 1 2 3 4 5 6 7 = (2/3) 7 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 7 = ~0.06

What does independent mean? P( ) = P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) = 2/3 = = = = = = = (2/3) …but of course that’s absurd! Knowing that it rained today makes it much more likely that it will rain tomorrow. The two events are not independent. 1 234567 1 2 3 4 5 6 7 = (2/3) 7 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 7 = ~0.06

What does independent mean? P( ) = P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) AND P( ) = 2/3 = = = = = = = (2/3) Whenever the outcome of one event biases the outcome of another, those two events are not independent. If two events are not independent, then the Rule of Multiplication cannot be applied to obtain a joint probability. 1 234567 1 2 3 4 5 6 7 = (2/3) 7 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 7 = ~0.06

P(TT) = 1/2 x 1/2 = 1/4 What’s the probability that Coin#1 AND Coin#2 come up tails Gets T from first AND gets T from second Rule of multiplication (Intersection of possibilities) Are the results of two coin flips independent of one another? Rule of multiplication intersection independent

P(TT) = 1/2 x 1/2 = 1/4 What’s the probability that Coin#1 AND Coin#2 come up tails Gets T from first AND gets T from second Rule of multiplication (Intersection of possibilities) Probably,… but maybe not. Maybe the coins are magnets and subtly influence each other’s flight. Rule of multiplication intersection independent

Probability that Coin#1 AND Coin#2 come up tails Rule of multiplication (Intersection of possibilities) GTATAC Back to DNA… Is this question related to the coin-flip question? Can you reword the question so that it is?

Probability that Coin#1 AND Coin#2 come up tails Rule of multiplication (Intersection of possibilities) GTATAC Maybe What’s the probability that: Nucleotide#1 is G AND Nucleotide#2 is T AND Nucleotide#3 is A AND Nucleotide#4 is T AND Nucleotide#5 is A AND Nucleotide#6 is C ?

Probability that Coin#1 AND Coin#2 come up tails Rule of multiplication (Intersection of possibilities) GTATAC Maybe What’s the probability that: Nucleotide#1 is G AND Nucleotide#2 is T AND Nucleotide#3 is A AND Nucleotide#4 is T AND Nucleotide#5 is A AND Nucleotide#6 is C...but the question asks about a random piece of DNA, not specific nucleotides.

Probability that Coin#1 AND Coin#2 come up tails Rule of multiplication (Intersection of possibilities) GTATAC How often would you expect to find in a coin flip? How often would you expect to find in a random series of coin flips? OK, back to coins for a moment. How do the following two questions differ from one other?

Probability that Coin#1 AND Coin#2 come up tails Rule of multiplication (Intersection of possibilities) GTATAC How often would you expect to find in a coin flip? How often would you expect to find in a random series of coin flips? The answer to the first is surely 50%. The answer to the second is… the same, no?

Probability that Coin#1 AND Coin#2 come up tails Rule of multiplication (Intersection of possibilities) GTATAC How often would you expect to find in 3 coin flips? How often would you expect to find in a random series of coin flips? What about these two questions?

Probability that Coin#1 AND Coin#2 come up tails Rule of multiplication (Intersection of possibilities) GTATAC How often would you expect to find in 3 coin flips? How often would you expect to find in a random series of coin flips? Again, the answer is the same for both. (and while in the area, what is the answer, presuming a fair coin?)

Probability that Coin#1 AND Coin#2 come up tails Rule of multiplication (Intersection of possibilities) GTATAC How often would you expect to find GTATAC in 6 random nucleotides? How often would you expect to find GTATAC in a random piece of DNA? And in the context of DNA… How do these two questions differ?

Probability that Coin#1 AND Coin#2 come up tails Rule of multiplication (Intersection of possibilities) GTATAC How often would you expect to find GTATAC in 6 random nucleotides? How often would you expect to find GTATAC in a random piece of DNA? Answer: They don’t. So long as the question calls for a frequency, you can answer either form.

Rule of multiplication (Intersection of possibilities) How often would you expect to find GTATAC in 6 random nucleotides? How often would you expect to find GTATAC in a random piece of DNA? Probability that Coin#1 AND Coin#2 come up tails GTATAC So choose what seems to be the simpler form… How often would you expect to find GTATAC in 6 random nucleotides?

Rule of multiplication (Intersection of possibilities) Now how to proceed? If you’re ever stuck on a problem, simplify it until you reach a problem that you can answer. For example,… How often would you expect to find GTATAC in 6 random nucleotides?

Rule of multiplication (Intersection of possibilities) What’s the probability that one random nucleotide is a G? How often would you expect to find GTATAC in 6 random nucleotides? P( G in ) 123456 1 = p 1 By now you realize that this question depends on the organism. The answer might be 25%, but more likely it is significantly higher or lower. For now, just call the answer p 1.

Rule of multiplication (Intersection of possibilities) And the same is true for the nucleotide in the other 5 positions. Each probability is some number, usually easy to obtain. How often would you expect to find GTATAC in 6 random nucleotides? P( G in ) 123456 1 = p 1 But how do we combine these six numbers for a single expected frequency for GTATAC? Are we looking for an intersection of events? Are the events independent? AND P( C in ) 6 = p 6 AND P( A in ) 5 = p 5 AND P( T in ) 4 = p 4 AND P( A in ) 3 = p 3 AND P( T in ) 2 = p 2

Rule of multiplication (Intersection of possibilities) And the same is true for the nucleotide in the other 5 positions. Each probability is some number, usually easy to obtain. How often would you expect to find GTATAC in 6 random nucleotides? P( G in ) 123456 1 = p 1 If so, then you can use the Rule of Multiplication. AND P( C in ) 6 = p 6 AND P( A in ) 5 = p 5 AND P( T in ) 4 = p 4 AND P( A in ) 3 = p 3 AND P( T in ) 2 = p 2

Tools of Probability Calculations Rule of multiplication (intersection) Rule of addition (union) Rule of subtraction (complementation) On to another tool, introduced through another simple calculation and a problem of bioinformatic relevance. Rule of everything

Probability that Coin#1 OR Coin#2 comes up tails Rule of addition (Union of possibilities) Another seemingly simple question… (note that OR always includes the possibility that both events occur).

Probability that Coin#1 OR Coin#2 comes up tails Rule of addition (Union of possibilities) In this problem, it’s easy to count the number of (equally likely?) events to get the answer, 3/4.

Probability that Coin#1 OR Coin#2 comes up tails Rule of addition (Union of possibilities) We can also calculate the result, noting the probability of each desired outcome. P(one T) = 1/4 Gets T from 1 st but not 2 nd

Probability that Coin#1 OR Coin#2 comes up tails Rule of addition (Union of possibilities) You’ll notice that each new event increases the likelihood. P(one T) = 1/4 1/4 Gets T from 1 st but not 2 nd OR 2 nd but not 1 st

Probability that Coin#1 OR Coin#2 comes up tails Rule of addition (Union of possibilities) But the likelihood can never be greater than 100%. P(one T) = 1/4 1/4 1/4 Gets T from 1 st but not 2 nd OR 2 nd but not 1 st OR both

Probability that Coin#1 OR Coin#2 comes up tails Rule of addition (Union of possibilities) The probability that one of the outcomes occurs is 3/4, the sum of the individual probabilities. P(one T) = 1/4 1/4 1/4 = 3/4 Gets T from 1 st but not 2 nd OR 2 nd but not 1 st OR both

Probability that Coin#1 OR Coin#2 comes up tails Rule of addition (Union of possibilities) P(one T) = 1/4 1/4 1/4 = 3/4 Gets T from 1 st but not 2 nd OR 2 nd but not 1 st OR both When can you resort to the summing the probabilities of outcomes to get the probability of at least one of the outcomes occurring?

Probability that Coin#1 OR Coin#2 comes up tails Rule of addition (Union of possibilities) Rule of addition (OR) union mutually exclusive The Rule of Addition may apply if you’re looking for the union of multiple possibilities: Either TH OR HT OR TT has occurred.

Probability that Coin#1 OR Coin#2 comes up tails Rule of addition (Union of possibilities) Rule of addition (OR) union mutually exclusive But it is also necessary that the two events be mutually exclusive of one another. What is meant by mutually exclusive?

What does mutually exclusive mean? Probability that it rains ThursdayOR FridayOR Saturday (To illustrate…) The plants in my garden are drying up. If it doesn’t rain in one of the next three days, they’ll die… unless I shake myself out of my lethargy and go outside and water them. I’m not quite prepared to do that,… I’d rather calculate how likely it is to rain. ???

What does mutually exclusive mean? Probability that it rains ThursdayOR FridayOR Saturday No problem… I recall from 51 slides ago that the historical frequency of rain is 1 in 3, and presuming that’s true for all three days… P(rain) = 1/3 P(rain Thursday) OR P(rain Friday) OR P(rain Saturday) = 1/3 ???

What does mutually exclusive mean? Probability that it rains ThursdayOR FridayOR Saturday Add up the three possibilities, and… Hey! No need to water! P(rain) = 1/3 P(rain Thursday) OR P(rain Friday) OR P(rain Saturday) = 1/3 + 1/3 + 1/3= 1/3 = 100% ???

What does mutually exclusive mean? Probability that it rains ThursdayOR FridayOR Saturday What’s wrong with this scenario? (absolutely nothing in my opinion, but I’m talking about the Rule of Addition) P(rain) = 1/3 P(rain Thursday) OR P(rain Friday) OR P(rain Saturday) = 1/3 + 1/3 + 1/3= 1/3 = 100% ???

What does mutually exclusive mean? Probability that it rains ThursdayOR FridayOR Saturday Well, using the Rule of Addition, implies that I can add outcomes like slices of a pie. The slices can’t overlap. = 1/3 + 1/3 + 1/3 = 100% Thu Fri Sat It will rain… ???

What does mutually exclusive mean? Probability that it rains ThursdayOR FridayOR Saturday But that's ridiculous! It's not true that raining on Thursday makes raining on Friday or Saturday impossible! I could fix this… = 1/3 + 1/3 + 1/3 = 100% Thu Fri Sat It will rain… ???

What does mutually exclusive mean? Probability that it rains ThursdayOR FridayOR Saturday Now the slices separate things that should be separated. Raining all day excludes sunning all day. = 1/3 + 1/3 + 1/3 = 100% rain all day sun all day some rain, some sun The weather on Thursday will be… ???

What does mutually exclusive mean? Probability that it rains ThursdayOR FridayOR Saturday Or I could list mutually exclusive outcomes, e.g., on Thu-Fri-Sat it rained-rained-rained, or it rained-rained-sunned, etc. = 1/3 + 1/3 + 1/3 = 100% ???

What does mutually exclusive mean? Probability that it rains ThursdayOR FridayOR Saturday The problem with my addition of probabilities before was that the events were not mutually exclusive, so the rule doesn't work. = 1/3 + 1/3 + 1/3 = 100% ???

What does mutually exclusive mean? Probability that it rains ThursdayOR FridayOR Saturday What is the probability that it will rain Thu, Fri, or Sat? You can count, but we'll soon consider a better strategy. = 1/3 + 1/3 + 1/3 = 100% ???

Rule of addition (Union of possibilities) What’s the probability that an arginine encoded by AGA will mutate to a hydrophilic amino acid? P(AGA  Gly) = 1/9 P(AGA  Lys) = 1/9 P(AGA  Ser) = 2/9 P(AGA  Thr) = 1/9 Sum? Here's a molecular biology problem I could solve using the Rule of Addition as shown below (and the presumption that all nucleotide mutations are equally likely). But is this a valid use of the rule?

Rule of addition (Union of possibilities) What’s the probability that an arginine encoded by AGA will mutate to a hydrophilic amino acid? P(AGA  Gly) = 1/9 P(AGA  Lys) = 1/9 P(AGA  Ser) = 2/9 P(AGA  Thr) = 1/9 Sum? It is, so long as the events are mutually exclusive. Does the mutation of arginine to glycine exclude the possibility of that amino acid mutating to lysine? Certainly! It can't become two different things!

Tools of Probability Calculations Rule of multiplication (intersection) Rule of addition (union) Rule of subtraction (complementation) On to another tool, introduced through another simple calculation. Rule of everything

Probability that at least one coin comes up tails Rule of subtraction (Complementation of possibilities) We solved this problem before, but can we do so using only the probabilities of the individual events, i.e. P(T) = 1/2?

Probability that at least one coin comes up tails Rule of subtraction (Complementation of possibilities) How about P(at least 1 T) = P(T 1 ) x P(T 2 ) = 1/2 x 1/2

Probability that at least one coin comes up tails Rule of subtraction (Complementation of possibilities) That doesn't work. The events are independent, as required by the Rule of Multiplication, but we're not looking for 1 st coin tails AND 2 nd coin tails.

Probability that at least one coin comes up tails Rule of subtraction (Complementation of possibilities) How about P(at least 1 T) = P(T 1 ) + P(T 2 ) = 1/2 + 1/2

Probability that at least one coin comes up tails Rule of subtraction (Complementation of possibilities) P(T 1 ) + P(T 2 ) This is somewhat better, since we are looking for either the first coin falling tails OR the second doing so, but...

Probability that at least one coin comes up tails Rule of subtraction (Complementation of possibilities) …the events are not mutually exclusive, as required by the Rule of Addition. It's possible that both Coin#1 and Coin#2 land tails.

Probability that at least one coin comes up tails Rule of subtraction (Complementation of possibilities) The problem becomes simpler if I change the wording…

Probability that at least one coin comes up tails Rule of subtraction (Complementation of possibilities) The problem becomes simpler if I change the wording… Probability that it is not true that both coins come up heads

Rule of subtraction (Complementation of possibilities) This says the same thing, but it's easier to calculate, at least the both-heads part. We've seen that before, and solved it with the Rule of Multiplication. Probability that it is not true that both coins come up heads P(HH) = P(H) x P(H) P(HH) = 1/2 x 1/2 = 1/4

Rule of subtraction (Complementation of possibilities) But we don't want the probability of both heads. Rather, we want the probability of NOT both-heads. What would that be? Probability that it is not true that both coins come up heads P(NOT HH) = ??? P(HH) = P(H) x P(H) P(HH) = 1/2 x 1/2 = 1/4

Rule of subtraction (Complementation of possibilities) We can make use of the fact that P(HH) + P(NOT HH) = 1 i.e. there's a 100% chance that two heads either occur or do not occur. Probability that it is not true that both coins come up heads P(NOT HH) = 1 – P(HH) P(HH) = P(H) x P(H) P(HH) = 1/2 x 1/2 = 1/4

Rule of subtraction (Complementation of possibilities) …and from this we can solve the problem. Probability that it is not true that both coins come up heads P(NOT HH) = 1 – P(HH) P(HH) = P(H) x P(H) P(HH) = 1/2 x 1/2 = 1/4 P(NOT HH) = 1 – 1/4 = 3/4 = P(at least one T)

Rule of subtraction (Complementation of possibilities) This trick of simplifying the question worked because the simplification went from a statement to its complement: If one were true the other must be false. Probability that it is not true that both coins come up heads P(NOT HH) = 1 – P(HH) P(HH) = P(H) x P(H) P(HH) = 1/2 x 1/2 = 1/4 P(NOT HH) = 1 – 1/4 = 3/4 = P(at least one T) Rule of Subtraction (NOT) Go from yin to yang Probabilities add to 1

Tools of Probability Calculations Rule of multiplication (intersection) Rule of addition (union) Rule of subtraction (complementation) Now on to the main event… Rule of everything

Rule of everything (Do the right thing) Rule of Everything: Don’t apply rules mindlessly There is no rule so good and so general that it can’t be mangled and abused. Rules cannot replace thought. Visualize what you’re trying to calculate. Estimate what the final number ought to be. Maintain control over the proceedings. Don’t rely on a dumb rule to lead you to success.

How many nucleotides will be missing if a genome sequencing project is taken to 6x coverage? Oversampling Completeness Here we are… You're sequencing the 120 Mb Drosophila genome. Some damn fool budgeted for only 6x coverage. Is that enough? At 6x coverage, what fraction of the genome will be sequenced? How many nucleotides will be missed? Rule of everything (Do the right thing)

How many nucleotides will be missing if a genome sequencing project is taken to 6x coverage? Oversampling Completeness This isn't an easy question to answer, but it's essentially identical to another question you've seen, one that might be a bit easier to think about… Rule of everything (Do the right thing)

Oversampling Completeness How many nucleotides will be missing if a genome sequencing project is taken to 6x coverage? You may remember this one… You're painting a wall by throwing sponges randomly. After a while, almost every sponge you toss overlaps the paint from a previous toss. At 1x coverage, you've tossed 1000 1 sq" sponges, but it doesn't come close to painting the entire 1000 sq" wall. What about 6x coverage? How much surface area will be missing if a wall- painting project is taken to 6x coverage? Rule of everything (Do the right thing)

If the proper course still seems mysterious, consider a lesson from the past… How much surface area will be missing if a wall- painting project is taken to 6x coverage? Rule of everything (Do the right thing)

If the proper course still seems mysterious, consider a lesson from the past… A general problem may be equivalent to a more specific problem that’s easier to visualize How much surface area will be missing if a wall- painting project is taken to 6x coverage? Rule of everything (Do the right thing) How often would you expect to find in 3 coin flips? How often would you expect to find in a random series of coin flips?

A strategy might become clearer if you focus on just one spot on the wall (shown in red). What is the probability that this spot remains unpainted? How much surface area will be missing if a wall- painting project is taken to 6x coverage? Rule of everything (Do the right thing)

A strategy might become clearer if you focus on just one spot on the wall (shown in red). What is the probability that this spot remains unpainted? Still no strategy? Then consider another lesson from the past… How much surface area will be missing if a wall- painting project is taken to 6x coverage? Rule of everything (Do the right thing)

A strategy might become clearer if you focus on just one spot on the wall (shown in red). What is the probability that this spot remains unpainted? Still no strategy? Then consider another lesson from the past… How much surface area will be missing if a wall- painting project is taken to 6x coverage? Rule of everything (Do the right thing) If you’re ever stuck on a problem, simplify it until you reach a problem that you can answer. For example,… P( G in ) 123456 1 = p 1

OK, instead of 6x coverage, try one-sponge coverage… What is the probability that this spot remains unpainted, if you throw just one sponge against the wall? 40 " 25 " 1 sq " How much surface area will be missing if a wall- painting project is taken to 6x coverage? Rule of everything (Do the right thing)

…probability that a spot remains unpainted,… Still not bursting with meaning perhaps, but recall… Applying that to the question (canceling double negatives) gives… 40 " 25 " 1 sq " How much surface area will be missing if a wall- painting project is taken to 6x coverage? Rule of everything (Do the right thing) Rule of Subtraction (NOT) Go from yin to yang Probabilities add to 1

Here’s a problem you can do! What fraction of the wall is covered by a single sponge? 40 " 25 " 1 sq " How much surface area will be missing if a wall- painting project is taken to 6x coverage? Rule of everything (Do the right thing) P(spot unpainted with one sponge) = 1 - P(spot painted with one sponge)

So we have an answer for one sponge. What about two sponges? 40 " 25 " 1 sq " How much surface area will be missing if a wall- painting project is taken to 6x coverage? Rule of everything (Do the right thing) P(spot unpainted with one sponge) = 1 - P(spot painted with one sponge) = 1 - 1 sq” / 1000 sq” = 0.999

Sounds like the Rule of Multiplication. We’re surely looking for the intersection of the two events, but are the events independent of each other? 40 " 25 " 1 sq " How much surface area will be missing if a wall- painting project is taken to 6x coverage? Rule of everything (Do the right thing) P(spot unpainted with Sponge#1 AND unpainted with Sponge#2) Rule of Multiplication intersection independent = 0.999 x 0.999

Independent? Not if the sponge thrower is learning from previous results, but the scenario claimed that the tosses were random, so OK. What about a large number of sponges (call that number n)? 40 " 25 " 1 sq " How much surface area will be missing if a wall- painting project is taken to 6x coverage? Rule of everything (Do the right thing) P(spot unpainted with Sponge#1 AND unpainted with Sponge#2) Rule of Multiplication intersection independent = 0.999 x 0.999

The Rule of Multiplication doesn’t care how many sponges you throw. How would you write the product for n sponges? How much surface area will be missing if a wall- painting project is taken to 6x coverage? Rule of everything (Do the right thing) P( spot unpainted with Sponge#1 AND Sponge#2 AND Sponge#3 AND Sponge#4 AND…) = ???

All that’s left is to determine what n is. How many sponges are implied by 6x coverage? (If that throws you, consider what is the definition for 1x coverage) How much surface area will be missing if a wall- painting project is taken to 6x coverage? Rule of everything (Do the right thing) P( spot unpainted with Sponge#1 AND Sponge#2 AND Sponge#3 AND Sponge#4 AND…) = (0.999)(0.999)(0.999)… = (0.999) n

Sponges, reads… what’s the difference? You should now be able to answer the question regarding the Drosophila genome. How much surface area will be missing if a wall- painting project is taken to 6x coverage? Rule of everything (Do the right thing) How many nucleotides will be missing if a genome sequencing project is taken to 6x coverage? Oversampling Completeness

Final Thoughts I’ve presented several possibly useful tools in this presentation, but I must reiterate the Rule of Everything: Rules don’t give answers. They merely give you tools through which a thoughtful brain can find an answer. Figuring out how to use available tools to connect a confusing question to a satisfying answer is often difficult, and it is easy to fool yourself. Don’t let yourself be fooled! How to avoid this fate?

Final Thoughts Check everything that can conceivably be checked! Don’t trust that your theory is correct. You think that the probability of flipping three tails with three coins is x%? Flip the coins and check! Sometimes (almost all the time in bioinformatics) it is impractical to check some theory in the real world. You need a computer to generate large numbers. Then use a computer to generate large numbers. And then check to make sure the computer is generating them correctly.

Final Thoughts Here’s an example. You will soon be able to make a function in BioBIKE that will simulate coin flips. In the meantime, I’ve supplied such a function. To use the function…

Go into CyanoBIKE, mouse into the FILE menu, and click User Contributed Stuff

This will bring up a menu of user-contributed functions. Click Use this package next to Probability Games.

You’ll get for your efforts a new button, FUNCTIONS. Mouse over that button and click TOSS-COIN.

Try executing the function several times to see whether it matches your notion of what a coin tossing function ought to do.

Once satified... …I hope you’re not satisfied by a few tosses. How do you know it simulates a coin that produces 50% heads and 50% tails? Check that! Mouse over the Options icon and select Trials.

Once you have a large number of results, you can COUNT the number of heads and tails to see whether it corresponds with your expectations..

Once satisfied, you can do various experiments to check whether your calculated probabilities are reasonable. Don’t miss an opportunity to test anything that can be simulated (which is virtually everything). This will save you repeatedly from idiocy.