Download presentation

Presentation is loading. Please wait.

Published byJayden Tilden Modified over 2 years ago

1
Molecular Biomedical Informatics Machine Learning and Bioinformatics Machine Learning & Bioinformatics 1

2
Statistics 2 Machine Learning and Bioinformatics

3
Statistical test In statistics, a result is called statistically significant if it is unlikely to have occurred by chance Determines what outcomes of an experiment would lead to a rejection of the null hypothesis; helping to decide whether experimental results contain enough information to cast doubt on conventional wisdom Answers –assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the actually observed one? –that probability is known as the P-value Machine Learning and Bioinformatics 3

4
Similar to a criminal trial A defendant is considered not guilty until his guilt is proven –the prosecutor tries to prove the guilt of the defendant, until there is enough charging evidence the defendant is convicted In the start of the procedure, there are two hypotheses –H 0 : the defendant is not guilty –H 1 : the defendant is guilty The first one is called null hypothesis, and is for the time being accepted The second one is called alternative (hypothesis), which is the hypothesis one hopes to support Machine Learning and Bioinformatics 4

5
The hypothesis of innocence is only rejected when an error is very unlikely, because one doesnt want to convict an innocent defendant Such an error is called error of the first kind (i.e. the conviction of an innocent person), and the occurrence of this error is controlled to be rare As a consequence of this asymmetric behavior, the error of the second kind (acquitting a person who committed the crime), is often rather large Machine Learning and Bioinformatics 5 H 0 is true Truly not guilty H 1 is true Truly guilty Accept Null Hypothesis Acquittal Right decision Wrong decision Type II Error Reject Null Hypothesis Conviction Wrong decision Type I Error Right decision

6
Philosophers beans Few beans of this handful are white. Most beans in this bag are white. Therefore, probably, these beans were taken from another bag. –this is an hypothetical inference Terminology –the beans in the bag are the population –the handful are the sample –the null hypothesis is that the sample originated from the population Machine Learning and Bioinformatics 6

7
The criterion for rejecting the null-hypothesis is the obvious difference in appearance (an informal difference in the mean) Again, assuming that the null hypothesis is true, what is the probability of observing a difference that is at least as extreme as the actually observed one? To be a real statistical hypothesis test, this example requires the formalities of a probability calculation and a comparison of that probability to a standard Machine Learning and Bioinformatics 7

8
Clairvoyant card game A person (the subject) is tested for clairvoyance. He is shown the reverse of a randomly chosen playing card 25 times and asked which of the four suits it belongs to. The number of hits, or correct answers, is called X. As we try to find evidence of his clairvoyance –the null hypothesis is that the person is not clairvoyant –the alternative is, of course, the person is (more or less) clairvoyant Machine Learning and Bioinformatics 8 null hypothesis?

9
If the null hypothesis is valid, the only thing the test person can do is guess –for every card, the probability (relative frequency) of any single suit appearing is ¼ If the alternative is valid, the test subject will predict the suit correctly with probability greater than ¼ Suppose that the observed probability of guessing correctly is p, then the hypotheses, then are –null hypothesis (H 0 ): p = ¼ (just guessing) –alternative hypothesis (H 1 ): p > ¼ (true clairvoyant) Machine Learning and Bioinformatics 9

10
Whats the decision? When the test subject correctly predicts all 25 cards, we will consider him clairvoyant, and reject the null hypothesis. Thus also with 24 or 23 hits. With only 5 or 6 hits, on the other hand, there is no cause to consider him so. But what about 12 hits, or 17 hits? –what is the critical number, c, of hits, at which point we consider the subject to be clairvoyant? –how do we determine the critical value c? It is obvious that with the choice c=25 were more critical than with c=10 Machine Learning and Bioinformatics 10

11
Machine Learning and Bioinformatics 11

12
The probability of Type I error Machine Learning and Bioinformatics 12

13
P-value vs. α Machine Learning and Bioinformatics 13

14
Any Questions? Machine Learning and Bioinformatics 14 about the figure in the last slide

15
Where Machine Learning and Bioinformatics 15 the distribution (the blue curve) comes from?

16
You have to choose the right one Machine Learning and Bioinformatics 16 The hardest part for many people But please understand the basic, rather than the practice

17
Normal distribution Machine Learning and Bioinformatics 17

18
The normal distribution is considered the most prominent probability distribution in statistics The normal distribution arises from the central limit theorem –under mild conditions, the mean of a large number of random variables independently drawn from the same distribution is distributed approximately normally, irrespective of the form of the original distribution Very tractable analytically, that is, a large number of results involving this distribution can be derived in explicit form For these reasons, the normal distribution is commonly encountered in practice –for example, the observational error in an experiment is usually assumed to follow a normal distribution Machine Learning and Bioinformatics 18

19
19

20
20

21
Z-test Z-test - Wikipedia, the free encyclopedia For any test statistic of which the distribution under the null hypothesis can be approximated by a normal distribution Because of the central limit theorem, many test statistics are approximately normally distributed for large samples Many statistical tests can be conveniently performed as approximate Z-tests if the sample size is large or the population variance known –if the population variance is unknown (and therefore has to be estimated from the sample itself) and the sample size is not large (n < 30), the Student t-test may be more appropriate Machine Learning and Bioinformatics 21

22
If T is a statistic that is approximately normally distributed under the null hypothesis –estimate the expected value θ of T under the null hypothesis –obtain an estimate s of the standard deviation of T –calculate the standard score Z = (T θ) / s –one-tailed and two-tailed p-values can be calculated as Φ(|Z|) and 2Φ(|Z|), respectively –Φ is the standard normal cumulative distribution function Machine Learning and Bioinformatics 22

23
Z-test Example Suppose that in a particular geographic region, the mean and standard deviation of scores on a reading test are 100 and 12 points, respectively. Our interest is in the scores of 55 students in a particular school who received a mean score of 96 We can ask whether this mean score is significantly lower than the regional mean –are the students in this school comparable to a simple random sample of 55 students from the region as a whole –or are their scores surprisingly low? Machine Learning and Bioinformatics 23

24
Machine Learning and Bioinformatics 24

25
Hyper-geometric distribution Machine Learning and Bioinformatics 25

27
Fishers exact test Used in the analysis of contingency tables Although in practice it is employed when sample sizes are small, it is valid for all sample sizes It is called exact because the significance of the deviation from a null hypothesis can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity Fisher devised the test due to a boast –try google lady tasting tea Fisher's exact test - Wikipedia, the free encyclopedia Machine Learning and Bioinformatics 27

28
The test is useful for categorical data that result from classifying objects in two different ways It is used to examine the significance of the association (contingency) between the two kinds of classification The numbers in the cells of the table form a hyper-geometric distribution under the null hypothesis of independence Machine Learning and Bioinformatics 28 MenWomenTotal Dieting1912 Non-dieting11312 Total12 24

29
Fishers exact test Example A sample of teenagers might be divided into –male and female –and those that are and are not currently dieting Test whether the observed difference of proportions is significant –what is the probability that the 10 dieters would be so unevenly distributed between the women and the men? –if we were to choose 10 of the teenagers at random, what is the probability that 9 of them would be among the 12 women, and only 1 from among the 12 men? Machine Learning and Bioinformatics 29 MenWomenTotal Dieting1912 Non-dieting11312 Total12 24

30
Machine Learning and Bioinformatics 30 MenWomenTotal Dietingaba+b Non-dietingcdc+d Totala+cb+dn

31
Any Questions? Machine Learning and Bioinformatics 31 so far

32
How Machine Learning and Bioinformatics 32 do you choose the test, or do you know the distribution

33
Distribution is assumed Machine Learning and Bioinformatics 33 Different tests may use the same distribution One test statistic could be tested under different assumptions

34
Overlap significance Machine Learning and Bioinformatics 34

35
Machine Learning and Bioinformatics 35

36
Gene Ontology Enrichment Analysis

37
Students t-test The test statistic follows a Students t distribution if the null hypothesis is supported Commonly applied Z-test when the test statistic follows a normal distribution and the value of a scaling term is known When the scaling term is unknown and is replaced by an estimate based on the data, the test statistic follows a Students t distribution The t-statistic was introduced in 1908 by William Sealy Gosset (Student was his pen name) Students t-test - Wikipedia, the free encyclopedia Machine Learning and Bioinformatics 37

38
Compared to normal distribution Machine Learning and Bioinformatics 38

39
Machine Learning and Bioinformatics 39

40
How many Machine Learning and Bioinformatics 40 test you remember

41
Thats why we have Choosing the Correct Statistical Test in SAS, Stata and SPSS Choosing the Correct Statistical Test in SAS, Stata and SPSS GraphPad - FAQ 1790 - Choosing a statistical test The testing process Common test statistics But… Machine Learning and Bioinformatics 41

42
Do not use them Machine Learning and Bioinformatics 42 unless you understand the concepts introduced in this slide

43
Chi-squared distribution The chi-squared distribution (also chi-square or χ²- distribution) with k degrees of freedom is the distribution of a sum of k independent standard normal random variables Used in chi-squared tests for –goodness of fit of an observed distribution to a theoretical one –the independence of two criteria of classification –confidence interval estimation for a population standard deviation of a normal distribution from a sample standard deviation –many other statistical tests also use this distribution, like Friedmans analysis of variance by ranks Machine Learning and Bioinformatics 43

44
Machine Learning and Bioinformatics 44

45
Chi-squared tests Also known as chi-square test or χ² test Note the distinction between the test statistic and its distribution The distribution is a chi-squared distribution when the null hypothesis is true, or asymptotically true –the sampling distribution can be approximated to a chi-squared distribution as closely as desired by enlarging the sample size Often the shorthand for Pearsons chi-squared test, also known as –the chi-squared goodness-of-fit test –the chi-squared test for independence Machine Learning and Bioinformatics 45

46
Pearsons chi-squared test Pearsons chi-squared test - Wikipedia The best-known of several chi-squared tests Tests the frequency distributions of events –the considered events must be mutually exclusive and have total probability 1 –e.g., tests the fairness of a die Used to assess two types of comparison –test of goodness of fit answers if an observed frequency distribution differs from a theoretical one –test of independence answers if paired observations on two variables, expressed in a contingency table, are independent Machine Learning and Bioinformatics 46

47
Steps Machine Learning and Bioinformatics 47

48
Test for fit of a distribution Machine Learning and Bioinformatics 48

49
When testing whether observations are random variables whose distribution belongs to a given family of distributions, the theoretical frequencies are calculated using a distribution from that family –the reduction in the degrees of freedom is calculated as p=s+1, where s is the number of co-variates used in fitting the distribution –for instance, when checking a normal distribution (where the parameters are mean and standard deviation), p=3 –the degrees of freedom is n-p It should be noted that the degrees of freedom are not based on the number of observations as with a Students t distribution –if testing for a fair, six-sided die, there would be five degrees of freedom because there are six categories –the number of times the die is rolled will have absolutely no effect on the number of degrees of freedom Machine Learning and Bioinformatics 49

50
Test of independence Machine Learning and Bioinformatics 50 MenWomenTotal DietingO 1,1 O 1,2 O 1,1 +O 1,2 Non-dietingO 2,1 O 2,2 O 2,1 +O 2,2 TotalO 1,1 +O 2,1 O 2,1 +O 2,2 N

51
Machine Learning and Bioinformatics 51 Summary Statistical test –criminal trial –philosophers beans –clairvoyant card game P-value vs. α You have to choose the right distribution –normal distribution (z-test) –hyper-geometric distribution (Fishers exact test) Distinguish between distributions and tests –different tests with the same distribution overlap significance enrichment analysis –different distributions for the same test statistic Students t-test Chi-squared tests –goodness of fit –test of independence

52
Feature selection Machine Learning & Bioinformatics 52 Tests if the selected features are significantly better than other. Upload and test them in our simulation system. Finally, commit your best version and send TA Jang a report before 23:59 1/8 (Tue).simulation system TA Jang

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google