Presentation is loading. Please wait.

Presentation is loading. Please wait.

Categorical Data Frühling Rijsdijk 1 & Caroline van Baal 2 1 IoP, London 2 Vrije Universiteit, A’dam Twin Workshop, Boulder Tuesday March 2, 2004.

Similar presentations


Presentation on theme: "Categorical Data Frühling Rijsdijk 1 & Caroline van Baal 2 1 IoP, London 2 Vrije Universiteit, A’dam Twin Workshop, Boulder Tuesday March 2, 2004."— Presentation transcript:

1 Categorical Data Frühling Rijsdijk 1 & Caroline van Baal 2 1 IoP, London 2 Vrije Universiteit, A’dam Twin Workshop, Boulder Tuesday March 2, 2004

2 Aims Introduce Categorical Data Define liability and describe assumptions of the liability model Show how heritability of liability can be estimated from categorical twin data Practical exercises

3 Categorical data Measuring instrument is able to only discriminate between two or a few ordered categories : e.g. absence or presence of a disease Data therefore take the form of counts, i.e. the number of individuals within each category

4 Univariate Normal Distribution of Liability Assumptions: (1) Underlying normal distribution of liability (2) The liability distribution has 1 or more thresholds (cut-offs)

5 The standard Normal distribution Liability is a latent variable, the scale is arbitrary, distribution is, therefore, assumed to be a Standard Normal Distribution (SND) or z-distribution: mean (  ) = 0 and SD (  ) = 1 z-values are the number of SD away from the mean area under curve translates directly to probabilities > Normal Probability Density function (  ) -3 3 - 1 0 1 2 -2 68%

6 Standard Normal Cumulative Probability in right-hand tail (For negative z values, areas are found by symmetry) Z 0 Area 0.5050%.2.4242%.4.3535%.6.2727%.8.2121% 1.1616% 1.2.1212% 1.4.08 8% 1.6.06 6% 1.8.0363.6% 2.0232.3% 2.2.0141.4% 2.4.008.8% 2.6.005.5% 2.8.003.3% 2.9.002.2% -3 3 z0z0 Area=P(z  z 0 )

7 When we have one variable it is possible to find a z-value (threshold) on the SND, so that the proportion exactly matches the observed proportion of the sample i.e if from a sample of 1000 individuals, 150 have met a criteria for a disorder (15%): the z-value is 1.04 -3 3 1.04

8 Two categorical traits Trait2 Trait1 01 0 0001 1 1011 When we have two categorical traits, the data are represented in a Contingency Table, containing cell counts that can be translated into proportions 0 = absent 1 = present

9 Categorical Data for twins: When the measured trait is dichotomous i.e. a disorder either present or absent in an unselected sample of twins: cell a: number of pairs concordant for unaffected cell d: number of pairs concordant for affected cell b/c: number of pairs discordant for the disorder 0 = unaffected 1 = affected

10 Joint Liability Model for twin pairs Assumed to follow a bivariate normal distribution The shape of a bivariate normal distribution is determined by the correlation between the traits Expected proportions under the distribution can be calculated by numerical integration with mathematical subroutines

11 Bivariate Normal R=.00 R=.90

12 Bivariate Normal (R=0.6) partitioned at threshold 1.4 (z-value) on both liabilities

13 Expected Proportions of the BN, for R=0.6, Th1=1.4, Th2=1.4

14 Correlated dimensions: The correlation (shape) and the two thresholds determine the relative proportions of observations in the 4 cells of the CT. Conversely, the sample proportions in the 4 cells can be used to estimate the correlation and the thresholds. a d b c a c b d

15 A variance decomposition (A, C, E) can be applied to liability, where the correlations in liability are determined by path model This leads to an estimate of the heritability of the liability Twin Models

16 ACE Liability Model 1 1 Twin 1 C E A L C A E L Twin 2 Unaf ¯ Aff Unaf ¯ Aff 1 1/.5

17 Summary It is possible to estimate a correlation between categorical traits from simple counts (CT), because of the assumptions we make about their joint distributions

18 How can we fit ordinal data in Mx? Summary statistics: CT Mx has a built-in fit function for the maximum- likelihood analysis of 2-way Contingency Tables >analyses limited to only two variables Raw data analyses - multivariate - handles missing data - moderator variables

19 Model-fitting to CT Mx has a built in fit function for the maximum-likelihood analysis of 2-way Contingency Tables The Fit Function is twice the log-likelihood of the observed frequency data calculated as: n ij is the observed frequency in cell ij p ij is the expected proportion in cell ij

20 Expected proportions Are calculated by numerical integration of the bivariate normal over two dimensions: the liabilities for twin1 and twin2 e.g. the probability that both twins are affected : Φ is the bivariate normal probability density function, L 1 and L 2 are the liabilities of twin1 and twin2, with means 0, and  is the correlation matrix of the two liabilities.

21 d For example: for a correlation of.9 and thresholds (z-values) of 1, the probability that both twins are above threshold (proportion d) is around.12 The probability that both twins are are below threshold (proportion a) is given by another integral function with reversed boundaries : B B L2L2 L1L1 a and is around.80 in this example

22 log-likelihood of the data under the model subtracted from log-likelihood of the observed frequencies themselves: χ² statistic:

23 The model’s failure to predict the observed data i.e. a bad fitting model,is reflected in a significant χ²

24 Model-fitting to Raw Ordinal Data ordinal Zygrespons1respons2 100 101 210 200 111 2.1 20. 201

25 Model-fitting to Raw Ordinal Data The likelihood of a vector of ordinal responses is computed by the Expected Proportion in the corresponding cell of the MN Expected proportion are calculated by numerical integration of the MN normal over n dimensions. In this example it will be two, the liabilities for twin1 and twin2

26 (0 1) (1 0) (0 0)(1 1)

27  is the MN pdf, which is a function of , the correlation matrix of the variables By maximizing the likelihood of the data under a MN distribution, the ML estimate of the correlation matrix and the thresholds are obtained

28 Practical Exercise 1 Simulated data for 625 MZ and 625 DZ pairs (h 2 =.40 c 2 =.20 e 2 =.40 > r mz =.60 r dz =.40) Dichotomized 0 = bottom 88%, 1 = top 12% This corresponds to threshold (z-value) of 1.18 Observed counts: MZDZ 0101 0508 480 49759 1 35341 4920 Raw ORD File: bin.dat Scripts: tetracor.mx and ACEbin.mx

29 Practical Exercise 2 Same simulated data Categorized 0 = bottom 22%, 1 = mid 66%, 2 = top 12% This corresponds to thresholds (z-values) of -0.75 1.18 Observed counts: MZDZ 01 2012 080 58 10 63 74 2 1 68302 471 7128957 2 134 342 4 4520 Raw ORD File: cat.dat Adjust the correlation and ACE script

30 Threshold Specification in Mx 2 Categories Threshold Matrix : 1 x 2 T(1,1) T(1,2) threshold twin1 & twin2 3 Categories Threshold Matrix : 2 x 2 T(1,1) T(1,2)threshold 1 for twin1 & twin2 T(2,1) T(2,2)threshold 2 for twin1 & twin2

31 Threshold Specification in Mx 3 Categories nthresh=2nvar=2 Matrix T: nthresh nvar (2 x 2) T(1,1) T(1,2)threshold 1 for twin1 & twin2 T(2,1) T(2,2)increment L LOW nthresh nthresh Value 1 T 1 1 to T nthresh nthresh Threshold Model L*T / 1 0 1 t 11 t 12 t 21 t 22 * = t 11 t 12 t 11 + t 21 t 12 + t 22

32 Using Frequency Weights ordinal Zygrespons1respons2FREQ 100508 10148 11035 11134 200497 20159 21049 21120 The 1250 lines data file (bin.dat) can be summarized like this

33 Using Frequency Weights G1: Data and model for MZ correlation DAta NGroups=2 NInput_vars=4 Missing=. Ordinal File=binF.dat Labels zyg bin1 bin2 freq SELECT IF zyg = 1 SELECT bin1 bin2 freq / DEFINITION freq / Begin Matrices; R STAN 2 2 FREE!Correlation matrix T FULL nthresh nvar FREE!thresh tw1, thresh tw2 L LOW nthresh nthresh F FULL 1 1 End matrices; Value 1 L 1 1 to L nthresh nthresh ! initialize L COV R / !Predicted Correlation matrix for MZ pairs Thresholds L*T /!to ensure t1>t2>t3 etc....... FREQ F /

34 The likelihood of this vector of observations is the Expected Proportion in the corresponding cell of the MN : Example: 2 Variables measured in twins: x has 2 cat > 0 below T x1, 1 above T x1 y has 3 cat > 0 below T y1, 1 T y1 - T y2, 2 above T y2 Ordinal respons vector (x 1, y 1, x 2, y 2 ) For example (1 2 0 1)

35 Proband-Ascertained Samples For rare disorders (e.g. Schizophrenia), selecting a random sample of twins will lead to the vast majority of pairs being unaffected. A more efficient design is to ascertain twin pairs through a register of affected individuals. When an affected twin (the proband) is identified, the cotwin is followed up to see if he or she is also affected. There are several types of ascertainment

36 Complete Ascertainment Types of ascertainment Single Ascertainment

37 Omission of certain classes from observation leads to an increase of the likelihood of observing the remaining individuals Mx corrects for incomplete ascertainment by dividing the likelihood by the proportion of the population remaining after ascertainment CT from ascertained data can be analysed in Mx by simply substituting a –1 for the missing cells Ascertainment Correction CTable 2 2 -111 -113

38 Summary For a 2 x 2 CT: 3 observed statistics, 3 parameters (1 correlation, 2 threshold) df=0  any pattern of observed frequencies can be accounted for, no goodness of fit of the normal distribution assumption. This problem is solved when we have a CT which is at least 3 x 2: df>0 A significant  2 reflects departure from normality.

39 Summary Power to detect certain effects increases with increasing number of categories > continuous data most powerful For raw ordinal data analyses, the first category must be coded 0! Threshold specification when analyzing CT are different


Download ppt "Categorical Data Frühling Rijsdijk 1 & Caroline van Baal 2 1 IoP, London 2 Vrije Universiteit, A’dam Twin Workshop, Boulder Tuesday March 2, 2004."

Similar presentations


Ads by Google