Wednesday, September 21, 2016 Farrokh Alemi, PhD.

Wednesday, September 21, 2016 Farrokh Alemi, PhD.
Test of Independence Wednesday, September 21, 2016 Farrokh Alemi, PhD. This lecture focuses on possible ways variables can relate to each other. The lecture is based on slides prepared by Dr. Lin and modified by Dr. Alemi. It is also based on the treatment of the topic by Alan Agresti in the book on Categorical Data Analysis.

Independence in 2 Variables
Knowing One Thing Doesn’t Tell Us About Another We say that two variables are independent from each other when knowing the information about one variable does not tell us much about the other. For example, the diseases of two patients may be independent when knowing that one patient is sick does not tell us much about the likelihood of the other being sick. Of course, this is not the case when the disease is infectious or if the two patients are related or if they ate the same food.

This concept is relatively simple if we are thinking of two variables. Then the probability of event B given A, that is the probability of B given that we know A has occurred, is the same as the probability of B. In essence, knowing A does not change the probability of B. Vertical bar is read as “Given”

Conditional As Shrinking Sample
All A Another way of saying this is that if we shrink the universe of possible events to all situations were A has occurred, then still B has the same probability. A is a subset of the possible events and since B is conditioned on A then we can select all events that A has occurred and ignore the rest.

What is probability of B given A?
Event A Total Yes No Event B a b a+b c d c+d a+c b+d a+b+c+d Suppose in this table we are looking for probability of B given A. This table shows a contingency table in which count of various combination of event A and B is given. The total sample size is a plus b plus c plus d. By definition, the probability of any event in a sample is the number of times the event occurs divided by the size of the sample. The probability of an event A is the number of times A occurs divided by the sample size. The real question is what is the probability of B given A. We can take the concept of conditional probability as recalculating the probability of the event but now in a smaller sample where the possible cases are limited to the all cases that meet the condition.

What is probability of B given A?
Event A Total Yes No Event B a b a+b c d c+d a+c b+d a+b+c+d The first step is that we shrink the sample size to the situation where the condition is met. The condition is that event A has occurred. So now the sample shrinks to only cases where A has occurred. All other events are no longer possible and therefore not part of the sample. There are “a” plus “c” cases where event A has occurred. This is now the new sample of possible events. We want to know the frequency of event B among these possible cases. We had earlier defined probability as the frequency of an event among the possible events. We note that event B occurs “a” times. So the probability of B can be calculated as its frequency divided by the new possible events, which is now a plus c. Thus we get to measure a conditional probability by restricting the sample to available cases and examining the frequency of the event in the new sample.

What is probability of A given B?
Event A Total Yes No Event B a b a+b c d c+d a+c b+d a+b+c+d Similarly, we can calculate the probability of A given B by reducing the sample size to all situations where B has occurred. This reduces the cases to a plus b. The probability of event A given event B is then given by number of times event A occurs in the new smaller sample size divided by the size of the new sample.

SQL Statement to Shrink Sample
Select Sum(B)/Count(B) As [Prob of B] From [All] Condition Occurs Or another way that may appeal to those of you who know SQL, if for binary variables A and B, we restrict the data to only cases where A has occurred then the two calculated probabilities for B should be the same. Select Sum(B)/Count(B) as [Prob B given A ] From [All] Where A is true

Example of Calculating Conditionals Using SQL
Event A Event B Count Yes a No c b d Reduce Sample Let us use what we just learned about how conditional probabilities can be calculated in a sample of cases. Here we have two events and the count of their co-occurrences. To estimate probability of A given B, we need to shrink the data to the situations that B has occurred and then calculate the probability of A in the reduced sample size. Event A Event B Count Yes a No b Calculate Probability

Example For example, knowing that patient A has heart attack does not change the fact that the next patient who comes to the ER has heart attack. This is probably a reasonable assumption if the current and the next patient are not related or do not influence each other’s lifestyle.

Note that for independence to hold, it must hold at every level of A. It should be true that when A is present. It should also be true when A is absent. So to stay with our earlier example, then knowing that the current patient does not have MI or heart attack does not change the probability that the next patient will have an MI. The two events are independent and independence holds for both when the condition is present or absent. We typically do not show the absence, as dealing with negatives is cumbersome; but whether we show it or not you should always keep it in mind.

Example Here is another example. Suppose we have two clinics, one in San Francisco and the other in District of Columbia. If I tell you that lots of people are waiting for the service in DC, it may not change the probability of people waiting in San Francisco. It would if there was a national demand for our clinical service, but in absence of this national demand we are safe to assume that the two clinic’s waiting time are independent.

Example As we had also said before, this relationship holds for whether the wait time is long or short in DC.

The independence of two variables can be shown as a function I taking the two parameters of A and B separated by comma. Sometimes I is dropped and only the two variables are shown with comma separating them.

Graphically this is shown as two variables without a link between them.

If a line is show between the two events then a relationship exists between the two events. They are correlated events. So here A and B are related events and knowing something about A changes the probability of B.

If a directed line is shown between the two events then a causal relationship is assumed. This graph for example says that A causes B.

If two events are independent, then neither a causal or an association relationship may exist between them. A B

Of course, sampling data creates random variation and two samples no matter how independent may not have exactly the same probabilities. Chi-square tests allows us to examine if the probability of two events are related. Each observations is compared to the expected value of the observation under assumption of independence.

There are published tables that give the probability of observing different values of chi-square test statistic. This allows us to decide if two variables are independent even if there are small changes in the probability of one event given another.

Wednesday, September 21, 2016 Farrokh Alemi, PhD.

Similar presentations

Presentation on theme: "Wednesday, September 21, 2016 Farrokh Alemi, PhD."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Wednesday, September 21, 2016 Farrokh Alemi, PhD.

Similar presentations

Presentation on theme: "Wednesday, September 21, 2016 Farrokh Alemi, PhD."— Presentation transcript:

Similar presentations

About project

Feedback