# Machine Learning and Bioinformatics 機器學習與生物資訊學

## Presentation on theme: "Machine Learning and Bioinformatics 機器學習與生物資訊學"— Presentation transcript:

Machine Learning and Bioinformatics 機器學習與生物資訊學
Machine Learning & Bioinformatics

Machine Learning and Bioinformatics
Statistics Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
Statistical test In statistics, a result is called statistically significant if it is unlikely to have occurred by chance Determines what outcomes of an experiment would lead to a rejection of the null hypothesis; helping to decide whether experimental results contain enough information to cast doubt on conventional wisdom Answers assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the actually observed one? that probability is known as the P-value Machine Learning and Bioinformatics

Similar to a criminal trial
A defendant is considered not guilty until his guilt is proven the prosecutor tries to prove the guilt of the defendant, until there is enough charging evidence the defendant is convicted In the start of the procedure, there are two hypotheses H0: “the defendant is not guilty” H1: “the defendant is guilty” The first one is called null hypothesis, and is for the time being accepted The second one is called alternative (hypothesis), which is the hypothesis one hopes to support Machine Learning and Bioinformatics

The hypothesis of innocence is only rejected when an error is very unlikely, because one doesn’t want to convict an innocent defendant Such an error is called error of the first kind (i.e. the conviction of an innocent person), and the occurrence of this error is controlled to be rare As a consequence of this asymmetric behavior, the error of the second kind (acquitting a person who committed the crime), is often rather large H0 is true Truly not guilty H1 is true Truly guilty Accept Null Hypothesis Acquittal Right decision Wrong decision Type II Error Reject Null Hypothesis Conviction Type I Error Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
Philosopher’s beans Few beans of this handful are white. Most beans in this bag are white. Therefore, probably, these beans were taken from another bag. this is an hypothetical inference Terminology the beans in the bag are the population the handful are the sample the null hypothesis is that the sample originated from the population Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
The criterion for rejecting the null-hypothesis is the “obvious” difference in appearance (an informal difference in the mean) Again, assuming that the null hypothesis is true, what is the probability of observing a difference that is at least as extreme as the actually observed one? To be a real statistical hypothesis test, this example requires the formalities of a probability calculation and a comparison of that probability to a standard Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
Clairvoyant card game A person (the subject) is tested for clairvoyance. He is shown the reverse of a randomly chosen playing card 25 times and asked which of the four suits it belongs to. The number of hits, or correct answers, is called X. As we try to find evidence of his clairvoyance the null hypothesis is that the person is not clairvoyant the alternative is, of course, the person is (more or less) clairvoyant null hypothesis? Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
If the null hypothesis is valid, the only thing the test person can do is guess for every card, the probability (relative frequency) of any single suit appearing is ¼ If the alternative is valid, the test subject will predict the suit correctly with probability greater than ¼ Suppose that the observed probability of guessing correctly is p, then the hypotheses, then are null hypothesis (H0): p = ¼ (just guessing) alternative hypothesis (H1): p > ¼ (true clairvoyant) Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
What’s the decision? When the test subject correctly predicts all 25 cards, we will consider him clairvoyant, and reject the null hypothesis. Thus also with 24 or 23 hits. With only 5 or 6 hits, on the other hand, there is no cause to consider him so. But what about 12 hits, or 17 hits? what is the critical number, c, of hits, at which point we consider the subject to be clairvoyant? how do we determine the critical value c? It is obvious that with the choice c=25 we’re more critical than with c=10 Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
In practice, one decides how critical one will be one decides how often an error of the first kind (false positive or Type I error) With c=25 the probability of such an error is very small 𝑃 reject 𝐻 0 𝐻 0 is valid =𝑃 𝑋=25 𝑃= 1 4 = ≈ 10 −15 Being less critical, with c=10, yields a much grater probability of false positive 𝑃 reject 𝐻 0 𝐻 0 is valid =𝑃 𝑋≥10 𝑃= 1 4 = 𝑘=10 25 𝑃(𝑋=𝑘|𝑝= 1 4 ) ≈0.07 These are p-values Machine Learning and Bioinformatics

The probability of Type I error
Before the test is actually performed, the maximum acceptable probability of a Type I error (α) is determined Depending on this Type I error rate, the critical value c is calculated. For example, if we select an error rate of 1% 𝑃 reject 𝐻 0 𝐻 0 is valid =𝑃 𝑋≥𝑐 𝑃= 1 4 ≤0.01 from all the numbers c with this property we choose the smallest, in order to minimize the probability of a Type II error (false negative) for the above example, we select c=13 Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
P-value vs. α Machine Learning and Bioinformatics

about the figure in the last slide
Machine Learning and Bioinformatics

the distribution (the blue curve) comes from?
Where the distribution (the blue curve) comes from? Machine Learning and Bioinformatics

You have to choose the right one
The hardest part for many people But please understand the basic, rather than the practice Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
Normal distribution A continuous probability distribution, defined on the entire real line, that has a bell-shaped probability density function Known as the Gaussian function 𝑓 𝑥;𝜇, 𝜎 2 = 1 𝜎 2𝜋 𝑒 − 𝑥−𝜇 𝜎 2 μ is the mean or expectation (location of the peak); σ2 is the variance; σ is known as the standard deviation The distribution with μ=0 and σ2=1 is called the standard normal distribution or the unit normal distribution Normal distribution - Wikipedia, the free encyclopedia Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
The normal distribution is considered the most prominent probability distribution in statistics The normal distribution arises from the central limit theorem under mild conditions, the mean of a large number of random variables independently drawn from the same distribution is distributed approximately normally, irrespective of the form of the original distribution Very tractable analytically, that is, a large number of results involving this distribution can be derived in explicit form For these reasons, the normal distribution is commonly encountered in practice for example, the observational error in an experiment is usually assumed to follow a normal distribution Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
Z-test Z-test - Wikipedia, the free encyclopedia For any test statistic of which the distribution under the null hypothesis can be approximated by a normal distribution Because of the central limit theorem, many test statistics are approximately normally distributed for large samples Many statistical tests can be conveniently performed as approximate Z-tests if the sample size is large or the population variance known if the population variance is unknown (and therefore has to be estimated from the sample itself) and the sample size is not large (n < 30), the Student t-test may be more appropriate Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
If T is a statistic that is approximately normally distributed under the null hypothesis estimate the expected value θ of T under the null hypothesis obtain an estimate s of the standard deviation of T calculate the standard score Z = (T − θ) / s one-tailed and two-tailed p-values can be calculated as Φ(−|Z|) and 2Φ(−|Z|), respectively Φ is the standard normal cumulative distribution function Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
Z-test Example Suppose that in a particular geographic region, the mean and standard deviation of scores on a reading test are 100 and 12 points, respectively. Our interest is in the scores of 55 students in a particular school who received a mean score of 96 We can ask whether this mean score is significantly lower than the regional mean are the students in this school comparable to a simple random sample of 55 students from the region as a whole or are their scores surprisingly low? Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
The standard error 𝑆𝐸= 𝜎 𝑛 = = =1.62 The z-score, which is the distance from the sample mean to the population mean in units of the standard error 𝑧= 𝑀−𝜇 𝑆𝐸 = 96− =−2.47 Looking up the table of the standard normal distribution, the probability of observing a standard normal value ≤ is about with 99.32% confidence we reject the null hypothesis If instead of a classroom, we considered a sub-region containing 900 students whose mean score was 99, nearly the same z-score and p-value would be observed Machine Learning and Bioinformatics

Hyper-geometric distribution
A discrete probability distribution that describes the probability of k successes in n draws from a finite population of size N containing m successes without replacement A random variable X follows the hyper-geometric distribution if its probability mass function is given by 𝑃 𝑋=𝑘 = 𝑚 𝑘 𝑁−𝑚 𝑛−𝑘 𝑁 𝑛 N is the population size; m is the number of success states in the population; n is the number of draws; k is the number of successes Hypergeometric distribution - Wikipedia, the free encyclopedia Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
Fisher’s exact test Used in the analysis of contingency tables Although in practice it is employed when sample sizes are small, it is valid for all sample sizes It is called exact because the significance of the deviation from a null hypothesis can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity Fisher devised the test due to a boast try google ‘lady tasting tea’ Fisher's exact test - Wikipedia, the free encyclopedia Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
Men Women Total Dieting 1 9 12 Non-dieting 11 3 24 The test is useful for categorical data that result from classifying objects in two different ways It is used to examine the significance of the association (contingency) between the two kinds of classification The numbers in the cells of the table form a hyper-geometric distribution under the null hypothesis of independence Machine Learning and Bioinformatics

Fisher’s exact test Example
Men Women Total Dieting 1 9 12 Non-dieting 11 3 24 Fisher’s exact test Example A sample of teenagers might be divided into male and female and those that are and are not currently dieting Test whether the observed difference of proportions is significant what is the probability that the 10 dieters would be so unevenly distributed between the women and the men? if we were to choose 10 of the teenagers at random, what is the probability that 9 of them would be among the 12 women, and only 1 from among the 12 men? Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
Men Women Total Dieting a b a+b Non-dieting c d c+d a+c b+d n The probability follows the hyper-geometric distribution 𝑝= ( 𝑎+𝑏 𝑎 )( 𝑐+𝑑 𝑐 ) ( 𝑛 𝑎+𝑐 ) = 𝑎+𝑏 ! 𝑐+𝑑 ! 𝑎+𝑐 ! 𝑏+𝑑 ! 𝑎!𝑏!𝑐!𝑛! the exact probability of this particular arrangement of the data on the null hypothesis of independence that men and women are equally likely to be dieters assuming the given marginal totals We can calculate the exact probability of any arrangement Fisher showed that to generate a significance level, we need consider only the more extreme cases with the same marginal totals Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
so far Machine Learning and Bioinformatics

do you choose the test, or do you know the distribution
How do you choose the test, or do you know the distribution Machine Learning and Bioinformatics

Distribution is “assumed”
Different tests may use the same distribution One test statistic could be tested under different assumptions Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
Overlap significance Determine the degree of the overlap 𝐴∩𝐵 ; 𝐴∩𝐵 𝐴∪𝐵 ; 𝐴∩𝐵 min⁡( 𝐴 , 𝐵 ) The above statistics answer the degree but not the confidence of overlap Consider outside the two leafs Can you formulize a statistical test based on hyper-geometric distribution? Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
Suppose that we are drawing an area as large as the first leaf What’s the probability to obtain an area with larger overlap with the second leaf by chance? 𝑃 𝑋≥ 𝐴∩𝐵 = 𝑥≥ 𝐴∩𝐵 min 𝐴 , 𝐵 𝐵 𝐴∩B 𝑁− 𝐵 𝐴 − 𝐴∩𝐵 𝑁 𝐴 N is the size of the entire area Notice that the p-value answers the confidence when we claim that these two leaves overlapped, but not the degree of the overlap Machine Learning and Bioinformatics

Gene Ontology Enrichment Analysis

Machine Learning and Bioinformatics
Student’s t-test The test statistic follows a Student’s t distribution if the null hypothesis is supported Commonly applied Z-test when the test statistic follows a normal distribution and the value of a scaling term is known When the scaling term is unknown and is replaced by an estimate based on the data, the test statistic follows a Student’s t distribution The t-statistic was introduced in 1908 by William Sealy Gosset (“Student” was his pen name) Student’s t-test - Wikipedia, the free encyclopedia Machine Learning and Bioinformatics

Compared to normal distribution
The probability of seeing a normally distributed value far (i.e. more than a few standard deviations) from the mean drops off extremely rapidly thus, normal distribution is not robust to the presence of outliers (data that are unexpectedly far from the mean, due to exceptional circumstances, observational error, etc.) data with outliers may be better described using a heavy-tailed distribution such as the Student’s t-distribution If 𝑋 1 , 𝑋 2 ,…, 𝑋 𝑛 are independent normally distributed random variables with means μ and variances σ2 Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
The sample mean follows normal distribution 𝑧= 𝑋 = 𝑋 1 + 𝑋 2 +…+ 𝑋 𝑛 𝑛 ~𝑁(𝜇, 𝜎 2 ) The ratio of the sample mean to the sample standard deviation follows the Student’s t-distribution with n−1 degrees of freedom 𝑡= 𝑋 −𝜇 𝑆/ 𝑛 = 𝑋 −𝜇 1 𝑛 𝑛− 𝑋 1 − 𝑋 𝑋 2 − 𝑋 2 +…+ 𝑋 𝑛 − 𝑋 ~ 𝑡 𝑛−1 this is useful to compare two sets of numerical data The sum of their squares has the chi-squared distribution with n degrees of freedom 𝑋 𝑋 2 2 +…+ 𝑋 𝑛 2 ~ 𝜒 𝑛 2 Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
How many test you remember Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
That’s why we have Choosing the Correct Statistical Test in SAS, Stata and SPSS GraphPad - FAQ Choosing a statistical test The testing process Common test statistics But… Machine Learning and Bioinformatics

unless you understand the concepts introduced in this slide
Do not use them unless you understand the concepts introduced in this slide Machine Learning and Bioinformatics

Chi-squared distribution
The chi-squared distribution (also chi-square or χ²-distribution) with k degrees of freedom is the distribution of a sum of k independent standard normal random variables Used in chi-squared tests for goodness of fit of an observed distribution to a theoretical one the independence of two criteria of classification confidence interval estimation for a population standard deviation of a normal distribution from a sample standard deviation many other statistical tests also use this distribution, like Friedman’s analysis of variance by ranks Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
A special case of the gamma distribution If 𝑍 1 , 𝑍 2 ,…, 𝑍 𝑘 are independent, standard normal random variables, then the sum of their squares 𝑄= 𝑖=1 𝑘 𝑍 𝑖 2 is distributed according to the chi-squared distribution with k degrees of freedom This is usually denoted as 𝑄~ 𝜒 2 𝑘 or 𝑄~ 𝜒 𝑘 2 Chi-squared distribution - Wikipedia, the free encyclopedia Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
Chi-squared tests Also known as chi-square test or χ² test Note the distinction between the test statistic and its distribution The distribution is a chi-squared distribution when the null hypothesis is true, or asymptotically true the sampling distribution can be approximated to a chi-squared distribution as closely as desired by enlarging the sample size Often the shorthand for Pearson’s chi-squared test, also known as the chi-squared goodness-of-fit test the chi-squared test for independence Machine Learning and Bioinformatics

Pearson’s chi-squared test
Pearson’s chi-squared test - Wikipedia The best-known of several chi-squared tests Tests the frequency distributions of events the considered events must be mutually exclusive and have total probability 1 e.g., tests the “fairness” of a die Used to assess two types of comparison test of goodness of fit answers if an observed frequency distribution differs from a theoretical one test of independence answers if paired observations on two variables, expressed in a contingency table, are independent Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
Steps Calculate the chi-squared test statistic, χ2, which resembles a normalized sum of squared deviations between observed and theoretical frequencies Determine the degrees of freedom, d, of that statistic, which is essentially the number of frequencies reduced by the number of parameters of the fitted distribution χ2 is then compared to the critical value in the 𝜒 𝑑 2 distribution to obtain a p-value A test that does not rely on the approximation of χ2 is the Fisher’s exact test, which is more accurate in obtaining a significance level, especially with few observations Machine Learning and Bioinformatics

Test for fit of a distribution
Suppose that there N observations divided among n cells A simple application is to test the hypothesis that, in the general population, values would occur in each cell with equal frequency the “theoretical frequency” for any cell (under the null hypothesis of a discrete uniform distribution) is 𝐸 𝑖 = 𝑁 𝑛 the reduction in the degrees of freedom is p=1, notionally because the observed frequencies Oi are constrained to sum to N the degrees of freedom is n-1 degrees of freedom The value of the test-statistic is 𝑋 2 = 𝑖=1 𝑛 𝑂 𝑖 − 𝐸 𝑖 𝐸 𝑖 , where X2 is a Pearson’s cumulative test statistic, which asymptotically approaches 𝜒 𝑑 2 distribution Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
When testing whether observations are random variables whose distribution belongs to a given family of distributions, the “theoretical frequencies” are calculated using a distribution from that family the reduction in the degrees of freedom is calculated as p=s+1, where s is the number of co-variates used in fitting the distribution for instance, when checking a normal distribution (where the parameters are mean and standard deviation), p=3 the degrees of freedom is n-p It should be noted that the degrees of freedom are not based on the number of observations as with a Student’s t distribution if testing for a fair, six-sided die, there would be five degrees of freedom because there are six categories the number of times the die is rolled will have absolutely no effect on the number of degrees of freedom Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
Men Women Total Dieting O1,1 O1,2 O1,1+O1,2 Non-dieting O2,1 O2,2 O2,1+O2,2 O1,1+O2,1 N Test of independence An “observation” consists of the values of two outcomes and the null hypothesis is that the occurrence of these outcomes is statistically independent Each observation is allocated to one cell of a two-dimensional array of cells (called a table) according to the values of the two outcomes If there are r rows and c columns in the table, the value of the test-statistic is 𝑋 2 = 𝑖=1 𝑟 𝑗=1 𝑐 𝑂 𝑖,𝑗 − 𝐸 𝑖,𝑗 𝐸 𝑖,𝑗 Fitting the model of “independence” reduces the number of degrees of freedom by p=r+c−1 The number of degrees of freedom is equal to the number of cells r×c, minus the reduction in degrees of freedom, p, which reduces to (r − 1)(c − 1). Machine Learning and Bioinformatics

Machine Learning and Bioinformatics
Summary Statistical test criminal trial philosopher’s beans clairvoyant card game P-value vs. α You have to choose the right distribution normal distribution (z-test) hyper-geometric distribution (Fisher’s exact test) Distinguish between distributions and tests different tests with the same distribution overlap significance enrichment analysis different distributions for the same test statistic Student’s t-test Chi-squared tests goodness of fit test of independence Machine Learning and Bioinformatics

Machine Learning & Bioinformatics
Feature selection Tests if the selected features are significantly better than other. Upload and test them in our simulation system. Finally, commit your best version and send TA Jang a report before 23:59 1/8 (Tue). Machine Learning & Bioinformatics