GENETIC ANALYSIS OF BINARY and CATEGORICAL TRAITS PART ONE.

GENETIC ANALYSIS OF BINARY and CATEGORICAL TRAITS PART ONE

TABLE 1. Twin Pair Concordances for Major Depression (Virginia Twin Study data, adapted from Neale and Cardon, 1992) MZ FEMALE PAIRSDZ FEMALE PAIRS Twin B UnaffectedAffectedUnaffectedAffected Twin A - Unaffected32983 Twin A - Unaffected20194 - Affected9583 - Affected8263

Prevalence= e.g. for MZ pairs= e.g. for DZ pairs= Prevalance = proportion of affected (alcoholic) twins in the general population. 2 x concordant affected pairs + discordant pairs 2 x Total Pairs 166+95+83 1180 126+82+94 880 = 29.2% = 34.3%

Probandwise concordance rate= e.g. for MZ pairs= e.g. for DZ pairs= Probandwise concordance rate = probability that cotwin of a depressed twin will also have a history of depression. 2 x concordant affected pairs 2 x concordant affected pairs + discordant pairs 166 166 + 95 + 83 126 126 + 82 + 94 = 48.3% = 41.7%

Why do we have (2 x number of concordant affected pairs) in the numerator and denominator of the expression for the probandwise concordance rate? Consider a simple example where there are 4 affected individuals, who came from 3 twin pairs, ie, 1 — 01 — 01 — 1 There are 4 potential probands, so if we randomly select an affected individual, the probability that the cotwin of that individual is also affected will be 50%

Recurrence Risk-ratio Probandwise concordance rate Prevalence = e.g. for MZ pairs== 1.65 e.g. for DZ pairs== 1.22 48.3 29.2 41.7 34.3

Odds Ratios -for binary data, a widely used measure of association, especially in epidemiology ab cd Odds Ratio = a x d b x c MZ Odds Ratio for depression :3.46 DZ Odds Ratio for depression: 1.64 Also can be estimated via a multiple logistic regression model to allow statistical control for covariates. In some applications, a probit model may be used instead (see later) – in general, logistic regression and probit regression models lead to almost identical conclusions about the statistical significance of an association.

TABLE 1a. Twin Pair Concordances for Alcohol Dependence (DSM-IIIR) (Virginia Twin Study data, from Kendler et al., 1992) MZ Female Pairs DZ Female Pairs N pairs590440 Population prevalence8.1%10.2% Probandwise concordance31.6%24.4%

Number of concordant alcoholic pairs =N pairs x prevalence x probandwise concordance MZ:15 pairsDZ:11 pairs Number of discordant pairs =2 x N pairs x prevalence x (1 - probandwise concordance) MZ:65 pairsDZ:68 pairs Number of concordant unaffected pairs MZ:510 pairsDZ:361 pairs

Alcoholism Risk UNAFFECTEDAFFECTED a) Normal Liability Threshold Model b) Multiple-threshold Model UNAFFECTED MILD CASES SEVERE CASES t 2 0 0 t 1 t 1

Threshold value (t) Prevalence (area under the standard normal curve) 0.050% 0.2540% 0.5330% 0.8420% 1.0415% 1.2810% 1.645% 1.952.5% 2.331% 3.080.1% -0.2560% CUMULATIVE NORMAL FREQUENCY DISTRIBUTION

Table 3. Population distribution of pairs of relatives with both alcoholic, neither alcoholic, or only one relative alcoholic, as a function of (i) lifetime prevalence of alcoholism, and (ii) liability correlation for alcoholism in relatives PREVALENCE Relative ARelative B Liability correlation Both affected Discordant A affected B affected Both unaffected Risk to relative of an alcoholic a Relatives’ Recurrence Risk Ratio (%) 30% 0.617.312.7 57.357.61.9 0.312.817.2 52.842.71.4 0.1510.919.1 50.936.21.2 20% 0.69.910.1 69.949.62.5 0.36.613.4 66.633.11.7 0.155.214.8 65.226.21.3 10% 0.63.96.1 83.939.03.9 0.32.27.8 82.221.62.2 0.151.58.5 81.515.21.5 a i.e. Probandwise concordance rate

EXAMPLE DATA-FILE FOR MX RAW ORDINAL DATA: MZF DEPRESSION DATA (depmzf.rec) Twin ATwin BFrequency 00329 0183 1095 1183 See MX manual for fit function (pp 89-90)

EXAMPLE DATA-FILE (II): DERIVED FROM PUBLISHED SOURCES MZF ALCOHOL DEPENDENCE DATA (alcmzf.rec) 00310 0132.5 1032.5 1115

! tetrachoric.mx ! estimating tetrachoric correlations #define nvar 1 #define maxthresf 1 ! number of thresholds Analysis of depression data: estimating tetrachorics & confidence intervals data NI=3 NG=4 LAbels twina twinb countmz Ordinal fi=depmzf.rec ! Count is a definition variable that we use to tell MX the frequency count ! for each element of the 2x2 table ! Definition_variables countmz / Begin matrices; W LO nvar nvar fr ! w*w' is the tetrachoric correlation Y LO nvar nvar fr ! y*y' is 1-tetrachoric correlation M FU maxthresf nvar fi! this is where we will store the thresholds S DI nvar nvar ! Matrix that will store weight variable end matrices; SP M 3 MATRIX M 1.5487 ! This tells MX to store the definition variable count in S SP S -1 mat w 0.7 mat y 0.7

Begin algebra; R=W*W'; E=Y*Y'; V=R+E; end algebra; FREQ S; ! tells MX that S contains the weight (frequency) variable TH M|M; ! tells MX that row and column thresholds contained in M|M CO V|R_ R'|V; ! formula for correlation matrix! bo 0.001 1.0 y(1,1) bo 0.0001 0.999 w(1,1) bo -5.0 5.0 m(1,1) interval r(1,1) ! compute 95% confidence interval for correlation OPT func=1.E-12 OPT RS END

Analysis of depression data: DZm data NI=3 LAbels twina twinb countdz OR fi=depdzf.rec Definition_variables countdz / Begin matrices; W LO nvar nvar fr ! w*w' is the tetrachoric correlation for DZ group Y LO nvar nvar fr ! y*y' is 1-tetrachoric correlation for DZ group N FU maxthresf nvar fr S DI nvar nvar ! Matrix that will store weight variable end matrices; SP N 6 MATRIX N 1.4487 SP S -1 mat w 0.6 mat y 0.8 Begin algebra; R=W*W'; E=Y*Y'; V=R+E; end algebra; FREQ S; TH N|N; CO V|R_R'|V; bo 0.001 1.0 y(1,1) bo 0.0001 0.999 w(1,1) bo -5.0 5.0 n(1,1) interval r(1,1) ! compute 95% confidence interval for correlation OPT RS END

Constraint function - constrain variances to unity for MZ group CO NI=1 Begin matrices = group 1; U unit 1 nvar end matrices; CO \d2v(V) = u; end Constraint function - constrain variances to unity for DZ group CO NI=1 Begin matrices = group 2; U unit 1 nvar end matrices; CO \d2v(V) = u; end

Summary of VL file data for group 1 COUNTMZ TWINA TWINB Code -1.0000E+00 1.0000E+00 2.0000E+00 Number 4.0000E+00 4.0000E+00 4.0000E+00 Mean 1.4750E+02 5.0000E-01 5.0000E-01 Variance 1.1005E+04 2.5000E-01 2.5000E-01 Minimum 8.3000E+01 0.0000E+00 0.0000E+00 Maximum 3.2900E+02 1.0000E+00 1.0000E+00 Summary of VL file data for group 2 COUNTDZ TWINA TWINB Code -1.0000 1.0000 2.0000 Number 4.0000 4.0000 4.0000 Mean 110.0000 0.5000 0.5000 Variance 2882.5000 0.2500 0.2500 Minimum 63.0000 0.0000 0.0000 Maximum 201.0000 1.0000 1.0000

PARAMETER SPECIFICATIONS GROUP NUMBER: 1 Analysis of depression data: estimating tetrachorics & confidence intervals MATRIX E This is a computed FULL matrix of order 1 by 1 It has no free parameters specified MATRIX M This is a FULL matrix of order 1 by 1 1 1 3 MATRIX R This is a computed FULL matrix of order 1 by 1 It has no free parameters specified MATRIX S This is a DIAGONAL matrix of order 1 by 1 1 1 -1 MATRIX V This is a computed FULL matrix of order 1 by 1 It has no free parameters specified MATRIX W This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 1 MATRIX Y This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 2

GROUP NUMBER: 2 Analysis of ordinal alcohol tolerance and dependence data: DZm MATRIX E This is a computed FULL matrix of order 1 by 1 It has no free parameters specified MATRIX N This is a FULL matrix of order 1 by 1 1 1 6 MATRIX R This is a computed FULL matrix of order 1 by 1 It has no free parameters specified MATRIX S This is a DIAGONAL matrix of order 1 by 1 1 1 -1 MATRIX V This is a computed FULL matrix of order 1 by 1 It has no free parameters specified MATRIX W This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 4 MATRIX Y This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 5

MX PARAMETER ESTIMATES GROUP NUMBER: 1 Analysis of depression data: estimating tetrachorics & confidence intervals MATRIX E This is a computed FULL matrix of order 1 by 1 [=Y*Y'] 1 1 0.5660 MATRIX M This is a FULL matrix of order 1 by 1 1 1 0.5489 MATRIX R This is a computed FULL matrix of order 1 by 1 [=W*W'] 1 1 0.4340 MATRIX S This is a DIAGONAL matrix of order 1 by 1 1 1 83.0000 MATRIX V This is a computed FULL matrix of order 1 by 1 [=R+E] 1 1 1.0000

MATRIX W This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.6588 MATRIX Y This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.7523 Matrix of EXPECTED thresholds TWINA TWINB Threshold 1 0.5489 0.5489 (OBSERVED MATRIX is nonexistent for raw data) EXPECTED COVARIANCE MATRIX TWINA TWINB TWINA 1.0000 TWINB 0.4340 1.0000 Function value of this group: 1383.2565 Where the fit function is -2 * Log-likelihood of raw ordinal

GROUP NUMBER: 2 Analysis of ordinal alcohol tolerance and dependence data: DZm MATRIX E This is a computed FULL matrix of order 1 by 1 [=Y*Y'] 1 1 0.8157 MATRIX N This is a FULL matrix of order 1 by 1 1 1 0.4038 MATRIX R This is a computed FULL matrix of order 1 by 1 [=W*W'] 1 1 0.1843 MATRIX S This is a DIAGONAL matrix of order 1 by 1 1 1 63.0000 MATRIX V This is a computed FULL matrix of order 1 by 1 [=R+E] 1 1 1.0000

Your model has 6 estimated parameters and 18 Observed statistics Observed statistics include 2 constraints. -2 times log-likelihood of data >>> 2509.632 Degrees of freedom >>>>>>>>>>>>>>>> 12 1 Confidence intervals requested in group 1 Matrix Element Int. Estimate Lower Upper Lfail Ufail R 1 1 1 95.0 0.4340 0.3086 0.5477 0 0 0 0 1 Confidence intervals requested in group 2 Matrix Element Int. Estimate Lower Upper Lfail Ufail R 2 1 1 95.0 0.1843 0.0306 0.3316 0 0 0 0 This problem used 0.2% of my workspace Task Time elapsed (DD:HH:MM:SS) Reading script & data 0: 0: 0: 0.11 Execution 0: 0: 0: 2.85 TOTAL 0: 0: 0: 2.96 Total number of warnings issued: 0 ______________________________________________________________________________

** Mx startup successful ** **MX-Sunos version 1.49** ! tetra2.mx ! estimating tetrachoric correlations The following MX script lines were read for group 1 #DEFINE NVAR 1 #DEFINE MAXTHRESF 1 ! NUMBER OF THRESHOLDS ANALYSIS OF ALCOHOLISM DATA: ESTIMATING TETRACHORICS & CONFIDENCE INTERVALS DATA NI=3 NO=2 NG=4 LABELS TWINA TWINB COUNTMZ ORDINAL FI=ALCMZF.REC Ordinal data read initiated NOTE: Rectangular file contained 4 records with data ! Count is a definition variable that we use to tell MX the frequency count ! for each element of the 2x2 table ! DEFINITION_VARIABLES COUNTMZ / NOTE: Definition yields 4 data vectors for analysis NOTE: Vectors contain a total of 8 observations

Summary of VL file data for group 1 COUNTMZ TWINA TWINB Code -1.0000E+00 1.0000E+00 2.0000E+00 Number 4.0000E+00 4.0000E+00 4.0000E+00 Mean 1.4750E+02 5.0000E-01 5.0000E-01 Variance 4.3853E+04 2.5000E-01 2.5000E-01 Minimum 1.5000E+01 0.0000E+00 0.0000E+00 Maximum 5.1000E+02 1.0000E+00 1.0000E+00 Summary of VL file data for group 2 COUNTDZ TWINA TWINB Code -1.0000E+00 1.0000E+00 2.0000E+00 Number 4.0000E+00 4.0000E+00 4.0000E+00 Mean 1.1000E+02 5.0000E-01 5.0000E-01 Variance 2.1088E+04 2.5000E-01 2.5000E-01 Minimum 1.1000E+01 0.0000E+00 0.0000E+00 Maximum 3.6100E+02 1.0000E+00 1.0000E+00

MX PARAMETER ESTIMATES GROUP NUMBER: 1 Analysis of alcoholism data: estimating tetrachorics & confidence intervals MATRIX E This is a computed FULL matrix of order 1 by 1 [=Y*Y'] 1 1 0.4688 MATRIX M This is a FULL matrix of order 1 by 1 1 1 1.4017 MATRIX R This is a computed FULL matrix of order 1 by 1 [=W*W'] 1 1 0.5312 MATRIX S This is a DIAGONAL matrix of order 1 by 1 1 1 15.0000

MATRIX V This is a computed FULL matrix of order 1 by 1 [=R+E] 1 1 1.0000 MATRIX W This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.7288 MATRIX Y This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.6847 Matrix of EXPECTED thresholds TWINA TWINB Threshold 1 1.4017 1.4017 (OBSERVED MATRIX is nonexistent for raw data) EXPECTED COVARIANCE MATRIX TWINA TWINB TWINA 1.0000 TWINB 0.5312 1.0000 Function value of this group: 635.6429 Where the fit function is -2 * Log-likelihood of raw ordinal

GROUP NUMBER: 2 Analysis of ordinal alcohol tolerance and dependence data: DZm MATRIX E This is a computed FULL matrix of order 1 by 1 [=Y*Y'] 1 1 0.6482 MATRIX N This is a FULL matrix of order 1 by 1 1 1 1.2687 MATRIX R This is a computed FULL matrix of order 1 by 1 [=W*W'] 1 1 0.3518 MATRIX S This is a DIAGONAL matrix of order 1 by 1 1 1 11.0000 MATRIX V This is a computed FULL matrix of order 1 by 1 [=R+E] 1 1 1.0000

MATRIX Y This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.8051 Your model has 6 estimated parameters and 18 Observed statistics Observed statistics include 2 constraints. -2 times log-likelihood of data >>> 1207.896 Degrees of freedom >>>>>>>>>>>>>>>> 12 1 Confidence intervals requested in group 1 Matrix Element Int. Estimate Lower Upper Lfail Ufail R 1 1 1 95.0 0.5312 0.3367 0.6903 0 0 0 0 1 Confidence intervals requested in group 2 Matrix Element Int. Estimate Lower Upper Lfail Ufail R 2 1 1 95.0 0.3518 0.1190 0.5558 0 0 0 0 This problem used 0.2% of my workspace Task Time elapsed (DD:HH:MM:SS) Reading script & data 0: 0: 0: 0.11 Execution 0: 0: 0: 3.41 TOTAL 0: 0: 0: 3.52 Total number of warnings issued: 0

ESTIMATED TETRACHORIC CORRELATIONS (estimating separate thresholds for each zygosity group) DEPRESSION ALCOHOL DEPENDENCE ρ95% CIρ MZF0.430.31-0.550.530.34-0.69 DZF0.180.03-0.330.350.12-0.56 -2 log-likelihood2509.6321207.896

TEST FOR ZYGOSITY DIFFERENCE IN PREVALENCE (takes into account non-independence!) DEPRESSION ALCOHOL DEPENDENCE -2 ln L (i)Separate thresholds model2509.6321207.896 (ii)Equal thresholds2514.8971210.304 Heterogeneity (i - ii)χ 2 = 5.265, p=0.02χ 2 = 2.408, p=0.12 1 1

This approach extends naturally to fitting univariate genetic models.

! univariate.mx ! fitting a univariate genetic model to 2x2 data #define nvar 1 #define maxthresf 1 ! number of thresholds Analysis of depression data: fitting ACE model data NI=3 NG=3 LAbels twina twinb countmz Ordinal fi=depmzf.rec ! Count is a definition variable that we use to tell MX the frequency count ! for each element of the 2x2 table ! Definition_variables countmz / Begin matrices; W LO nvar nvar fr ! additive genetic path (A=w*w') X LO nvar nvar fr ! shared environmental path (C=x*x') Y LO nvar nvar fr ! non-shared environmental path (E=y*y') Z LO nvar nvar fi ! non-additive genetic path (D=z*z') M FU maxthresf nvar fi ! matrix of thresholds S DI nvar nvar ! Matrix that will store weight variable end matrices; SP M 4 MATRIX M 1.5487 ! This tells MX to store the definition variable count in S SP S -1 mat w 0.5 mat x 0.5 mat y 0.7

Begin algebra; A=W*W'; C=X*X'; E=Y*Y'; D=Z*Z'; V=A+C+D+E; end algebra; FREQ S; ! tells MX that S contains the weight (frequency) variable TH M|M; ! tells MX that row and column thresholds contained in M|M CO V|A+D+C_ A'+D'+C'|V; ! formula for correlation matrix! bo 0.001 1.0 y(1,1) bo 0.0001 0.999 w(1,1) x(1,1) bo -5.0 5.0 m(1,1) interval a(1,1) c(1,1) e(1,1) ! compute 95% confidence interval for correlation OPT func=1.E-12 OPT RS END

Analysis of depression data: DZm data NI=3 NO=4 LAbels twina twinb countdz OR fi=depdzf.rec Definition_variables countdz / Begin matrices = group 1; S DI nvar nvar ! Matrix that will store weight variable g DI 1 1 ! constant (=0.5) for coefficient of additive genetic component h DI 1 1 ! constant (=0.25) for coefficient of dominance genetic component n FU maxthresf nvar fi ! matrix of thresholds end matrices; SP N 5 MATRIX N 1.4487 MAT g 0.5 MAT h 0.25 SP S -1 FREQ S; TH N|N; CO V|g@A+h@D+C_ g@A'+h@D'+C'|V; ! formula for correlation matrix! bo -5.0 5.0 n(1,1) OPT RS END

Constraint function - constrain variance to unity CO NI=1 Begin matrices = group 1; U unit 1 nvar end matrices; CO \d2v(V) = u; end

** Mx startup successful ** **MX-Sunos version 1.49** ! univariate.mx ! fitting a univariate genetic model to 2x2 data The following MX script lines were read for group 1 #DEFINE NVAR 1 #DEFINE MAXTHRESF 1 ! NUMBER OF THRESHOLDS ANALYSIS OF DEPRESSION DATA: FITTING ACE MODEL DATA NI=3 NO=2 NG=3 LABELS TWINA TWINB COUNTMZ ORDINAL FI=DEPMZF.REC Ordinal data read initiated NOTE: Rectangular file contained 4 records with data ! Count is a definition variable that we use to tell MX the frequency count ! for each element of the 2x2 table ! DEFINITION_VARIABLES COUNTMZ / NOTE: Definition yields 4 data vectors for analysis NOTE: Vectors contain a total of 8 observations BEGIN MATRICES; W LO NVAR NVAR FR ! ADDITIVE GENETIC PATH (A=W*W') X LO NVAR NVAR FR ! SHARED ENVIRONMENTAL PATH (C=X*X') Y LO NVAR NVAR FR ! NON-SHARED ENVIRONMENTAL PATH (E=Y*Y') Z LO NVAR NVAR FI ! NON-ADDITIVE GENETIC PATH (D=Z*Z') M FU MAXTHRESF NVAR FI ! MATRIX OF THRESHOLDS S DI NVAR NVAR ! MATRIX THAT WILL STORE WEIGHT VARIABLE END MATRICES;

MX PARAMETER ESTIMATES GROUP NUMBER: 1 Analysis of depression data: fitting ACE model MATRIX A This is a computed FULL matrix of order 1 by 1 [=W*W'] 1 1 0.4250 MATRIX C This is a computed FULL matrix of order 1 by 1 [=X*X'] 1 1 1.0000E-08 MATRIX D This is a computed FULL matrix of order 1 by 1 [=Z*Z'] 1 1 0.0000 MATRIX E This is a computed FULL matrix of order 1 by 1 [=Y*Y'] 1 1 0.5750 MATRIX M This is a FULL matrix of order 1 by 1 1 1 0.5493

MATRIX S This is a DIAGONAL matrix of order 1 by 1 1 1 83.0000 MATRIX V This is a computed FULL matrix of order 1 by 1 [=A+C+D+E] 1 1 1.0000 MATRIX W This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.6519 MATRIX X This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 1.0000E-04 MATRIX Y This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.7583 MATRIX Z This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.0000

Matrix of EXPECTED thresholds TWINA TWINB Threshold 1 0.5493 0.5493 (OBSERVED MATRIX is nonexistent for raw data) EXPECTED COVARIANCE MATRIX TWINA TWINB TWINA 1.0000 TWINB 0.4250 1.0000 Function value of this group: 1383.2782 Where the fit function is -2 * Log-likelihood of raw ordinal

Your model has 5 estimated parameters and 17 Observed statistics Observed statistics include 1 constraints. -2 times log-likelihood of data >>> 2509.788 Degrees of freedom >>>>>>>>>>>>>>>> 12 3 Confidence intervals requested in group 1 Matrix Element Int. Estimate Lower Upper Lfail Ufail A 1 1 1 95.0 0.4250 0.1045 0.5325 0 0 0 0 C 1 1 1 95.0 0.0000 0.0000 0.2609 0 0 0 1 E 1 1 1 95.0 0.5750 0.4675 0.6940 0 0 0 0 This problem used 0.1% of my workspace Task Time elapsed (DD:HH:MM:SS) Reading script & data 0: 0: 0: 0.10 Execution 0: 0: 0:15.76 TOTAL 0: 0: 0:15.86 Total number of warnings issued: 1 ______________________________________________________________________________

** Mx startup successful ** **MX-Sunos version 1.49** ! univar2.mx ! fitting a univariate genetic model to 2x2 data The following MX script lines were read for group 1 #DEFINE NVAR 1 #DEFINE MAXTHRESF 1 ! NUMBER OF THRESHOLDS ANALYSIS OF ALCOHOL DEPENDENCE DATA: FITTING ACE MODEL DATA NI=3 NO=2 NG=3 LABELS TWINA TWINB COUNTMZ ORDINAL FI=ALCMZF.REC Ordinal data read initiated NOTE: Rectangular file contained 4 records with data ! Count is a definition variable that we use to tell MX the frequency count ! for each element of the 2x2 table ! DEFINITION_VARIABLES COUNTMZ / NOTE: Definition yields 4 data vectors for analysis NOTE: Vectors contain a total of 8 observations

MX PARAMETER ESTIMATES GROUP NUMBER: 1 Analysis of alcohol dependence data: fitting ACE model MATRIX A This is a computed FULL matrix of order 1 by 1 [=W*W'] 1 1 0.3588 MATRIX C This is a computed FULL matrix of order 1 by 1 [=X*X'] 1 1 0.1724 MATRIX D This is a computed FULL matrix of order 1 by 1 [=Z*Z'] 1 1 0.0000 MATRIX E This is a computed FULL matrix of order 1 by 1 [=Y*Y'] 1 1 0.4688

MATRIX M This is a FULL matrix of order 1 by 1 1 1 1.4017 MATRIX S This is a DIAGONAL matrix of order 1 by 1 1 1 15.0000 MATRIX V This is a computed FULL matrix of order 1 by 1 [=A+C+D+E] 1 1 1.0000 MATRIX W This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.5990 MATRIX X This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.4152 MATRIX Y This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.6847 MATRIX Z This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.0000

Matrix of EXPECTED thresholds TWINA TWINB Threshold 1 1.4017 1.4017 (OBSERVED MATRIX is nonexistent for raw data) EXPECTED COVARIANCE MATRIX TWINA TWINB TWINA 1.0000 TWINB 0.5312 1.0000 Function value of this group: 635.6429 Where the fit function is -2 * Log-likelihood of raw ordinal

Your model has 5 estimated parameters and 17 Observed statistics Observed statistics include 1 constraints. -2 times log-likelihood of data >>> 1207.896 Degrees of freedom >>>>>>>>>>>>>>>> 12 3 Confidence intervals requested in group 1 Matrix Element Int. Estimate Lower Upper Lfail Ufail A 1 1 1 95.0 0.3588 0.0000 0.6902 0 0 0 0 C 1 1 1 95.0 0.1724 0.0000 0.5542 0 0 0 0 E 1 1 1 95.0 0.4688 0.3097 0.6628 0 1 0 0 This problem used 0.1% of my workspace Task Time elapsed (DD:HH:MM:SS) Reading script & data 0: 0: 0: 0.10 Execution 0: 0: 0: 7.55 TOTAL 0: 0: 0: 7.65 Total number of warnings issued: 1 ______________________________________________________________________________

VIRGINIA TWIN STUDY: Female Like-Sex Pairs Summary Model-Fitting Results Additive Genetic Variance 95% CI Shared Environmental Variance 95% CI Non-Shared Environmental Variance 95% CI Major depression42.510.5-53.30.00.0-26.157.546.8-69.4 Alcohol dependence35.90.0-69.017.20.0-55.446.931.0-66.3

Model-fitting results: Depression in the Virginia Twin Study Parameter Estimates (%) Likelihood-ratio versus ACE model ModelACEDd.f.χ2χ2 p A D E30--5713-- A C E43057-- A E43--57--10.001.00 C E--3367--16.290.01 E-- 100--256.400.001

We can easily handle data where only one twin has responded. HOWEVER, we are assuming that missing data are MCAR - Missing Completely at Random. We can include twins with missing cotwins (indicated by.) in the same data-file as complete pairs. Alternatively, if we want to test for differences in prevalence for complete pairs versus singles (suggestive of an ascertainment bias), we can include singleton twins as separate groups, allowing a test of equality of thresholds. Singleton Twins?

EXAMPLE: Alcohol Dependence Data from 1992 Survey of the Australian Twin Panel (1981 cohort) MZ MaleDZ Male 0274138 101037.538.5 010137.538.5 14919 0.0.3431.5 1.1.88.0.03431.5.1.188

Table 2. Numbers of twin pairs concordant and discordant for smoking status in the Australian twin panel 1981 survey. MZ Female (N=1232 pairs)DZ Female (N=747 pairs) IIIIIIIIIIII INon-smoker629310 IISuccessful quitter110649833 IIICurrent smoker1241151901466199 MZ Male (N=567 pairs)DZ Male (N=350 pairs) IIIIIIIIIIII INon-smoker221121 IISuccessful quitter77704427 IIICurrent smoker316177615344

MULTIPLE THRESHOLD MODEL For n categories, we need to estimate (n-1) thresholds. The safest way to estimate multiple thresholds is to estimate: t 0 t 1 = t 0 + t 1 (t 1 > 0) t 2 = t 1 + t 2 (t 2 > 0) and so on. This is especially important when we estimate confidence intervals. Note that if L = andM = then LM = etc. Hence, we merely need to constrain t 1 etc. > 0.

! univariate3x3.mx ! fitting a univariate genetic model to 3x3 data #define nvar 1 #define maxthresf 2 ! number of thresholds Analysis of smoking data: fitting ACE model data NI=3 NG=3 LAbels twina twinb countmz Ordinal fi=smkmzf.rec ! Count is a definition variable that we use to tell MX the frequency count ! for each element of the 3x3 table ! Definition_variables countmz / Begin matrices; W LO nvar nvar fr ! additive genetic path (A=w*w') X LO nvar nvar fr ! shared environmental path (C=x*x') Y LO nvar nvar fr ! non-shared environmental path (E=y*y') Z LO nvar nvar fi ! non-additive genetic path (D=z*z') M FU maxthresf nvar fi ! matrix of thresholds L LO maxthresf maxthresf ! used to ensure t1 < t2 S DI nvar nvar ! Matrix that will store weight variable end matrices; SP M 4 5 MATRIX M 1.5487 0.5

MATRIX L 1 1 1 ! This tells MX to store the definition variable count in S SP S -1 mat w 0.5 mat x 0.5 mat y 0.7 Begin algebra; A=W*W'; C=X*X'; E=Y*Y'; D=Z*Z'; V=A+C+D+E; T=L*M; end algebra; FREQ S; ! tells MX that S contains the weight (frequency) variable TH T|T; ! tells MX that row and column thresholds contained in T|T CO V|A+D+C_ A'+D'+C'|V; ! formula for correlation matrix! bo 0.001 1.0 y(1,1) m(2,1) bo 0.0001 0.999 w(1,1) x(1,1) bo -5.0 5.0 m(1,1) interval a(1,1) c(1,1) e(1,1) ! compute 95% confidence interval for correlation OPT func=1.E-12 OPT RS END

Analysis of ordinal smoking data: DZm data NI=3 X LAbels twina twinb countdz OR fi=smkdzf.rec Definition_variables countdz / Begin matrices = group 1; S DI nvar nvar ! Matrix that will store weight variable g DI 1 1 ! constant (=0.5) for coefficient of additive genetic component h DI 1 1 ! constant (=0.25) for coefficient of dominance genetic component n FU maxthresf nvar fi ! matrix of thresholds end matrices; SP N 6 7 MATRIX N 1.4487 0.5 MAT g 0.5 MAT h 0.25 SP S -1 Begin algebra; T=L*N; end algebra; FREQ S; TH T|T; CO V|g@A+h@D+C_ g@A'+h@D'+C'|V; ! formula for correlation matrix! bo -5.0 5.0 n(1,1) bo 0.001 1.0 n(2,1) OPT RS END

Constraint function - constrain variances to unity CO NI=1 Begin matrices = group 1; U unit 1 nvar end matrices; CO \d2v(V) = u; end

** Mx startup successful ** **MX-Sunos version 1.49** ! univariate3x3.mx ! fitting a univariate genetic model to 3x3 data The following MX script lines were read for group 1 #DEFINE NVAR 1 #DEFINE MAXTHRESF 2 ! NUMBER OF THRESHOLDS ANALYSIS OF SMOKING DATA: FITTING ACE MODEL DATA NI=3 NO=9 NG=3 LABELS TWINA TWINB COUNTMZ ORDINAL FI=SMKMZF.REC Ordinal data read initiated NOTE: Rectangular file contained 9 records with data and 1 records where all data were missing ! Count is a definition variable that we use to tell MX the frequency count ! for each element of the 3x3 table ! DEFINITION_VARIABLES COUNTMZ / NOTE: Definition yields 9 data vectors for analysis NOTE: Vectors contain a total of 18 observations BEGIN MATRICES; W LO NVAR NVAR FR ! ADDITIVE GENETIC PATH (A=W*W') X LO NVAR NVAR FR ! SHARED ENVIRONMENTAL PATH (C=X*X') Y LO NVAR NVAR FR ! NON-SHARED ENVIRONMENTAL PATH (E=Y*Y') Z LO NVAR NVAR FI ! NON-ADDITIVE GENETIC PATH (D=Z*Z') M FU MAXTHRESF NVAR FI ! MATRIX OF THRESHOLDS L LO MAXTHRESF MAXTHRESF ! USED TO ENSURE T1 < T2 S DI NVAR NVAR ! MATRIX THAT WILL STORE WEIGHT VARIABLE END MATRICES;

Summary of VL file data for group 1 COUNTMZ TWINA TWINB Code -1.0000E+00 1.0000E+00 2.0000E+00 Number 9.0000E+00 9.0000E+00 9.0000E+00 Mean 1.3689E+02 1.0000E+00 1.0000E+00 Variance 3.1949E+04 6.6667E-01 6.6667E-01 Minimum 5.5000E+01 0.0000E+00 0.0000E+00 Maximum 6.2900E+02 2.0000E+00 2.0000E+00 Summary of VL file data for group 2 COUNTDZ TWINA TWINB Code -1.0000 1.0000 2.0000 Number 9.0000 9.0000 9.0000 Mean 83.0000 1.0000 1.0000 Variance 6923.2778 0.6667 0.6667 Minimum 30.5000 0.0000 0.0000 Maximum 310.0000 2.0000 2.0000

MX PARAMETER ESTIMATES GROUP NUMBER: 1 Analysis of smoking data: fitting ACE model MATRIX A This is a computed FULL matrix of order 1 by 1 [=W*W'] 1 1 0.5836 MATRIX C This is a computed FULL matrix of order 1 by 1 [=X*X'] 1 1 0.1823 MATRIX D This is a computed FULL matrix of order 1 by 1 [=Z*Z'] 1 1 0.0000 MATRIX E This is a computed FULL matrix of order 1 by 1 [=Y*Y'] 1 1 0.2341 MATRIX L This is a LOWER TRIANGULAR matrix of order 2 by 2 1 2 1 1.0000 2 1.0000 1.0000

MATRIX M This is a FULL matrix of order 2 by 1 1 1 0.2809 2 0.3979 MATRIX S This is a DIAGONAL matrix of order 1 by 1 1 1 190.0000 MATRIX T This is a computed FULL matrix of order 2 by 1 [=L*M] 1 1 0.2809 2 0.6788 MATRIX V This is a computed FULL matrix of order 1 by 1 [=A+C+D+E] 1 1 1.0000 MATRIX W This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.7639 MATRIX X This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.4269

MATRIX Y This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.4839 MATRIX Z This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.0000 Matrix of EXPECTED thresholds TWINA TWINB Threshold 1 0.2809 0.2809 Threshold 2 0.6788 0.6788 (OBSERVED MATRIX is nonexistent for raw data) EXPECTED COVARIANCE MATRIX TWINA TWINB TWINA 1.0000 TWINB 0.7659 1.0000 Function value of this group: 4094.3378 Where the fit function is -2 * Log-likelihood of raw ordinal

MATRIX W This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.7639 MATRIX X This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.4269 MATRIX Y This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.4839 MATRIX Z This is a LOWER TRIANGULAR matrix of order 1 by 1 1 1 0.0000 Matrix of EXPECTED thresholds TWINA TWINB Threshold 1 0.1992 0.1992 Threshold 2 0.6100 0.6100 (OBSERVED MATRIX is nonexistent for raw data) EXPECTED COVARIANCE MATRIX TWINA TWINB TWINA 1.0000 TWINB 0.4741 1.0000 Function value of this group: 2765.7220 Where the fit function is -2 * Log-likelihood of raw ordinal

Your model has 7 estimated parameters and 37 Observed statistics Observed statistics include 1 constraints. -2 times log-likelihood of data >>> 6860.060 Degrees of freedom >>>>>>>>>>>>>>>> 30 3 Confidence intervals requested in group 1 Matrix Element Int. Estimate Lower Upper Lfail Ufail A 1 1 1 95.0 0.5836 0.3990 0.7781 0 0 0 1 C 1 1 1 95.0 0.1823 0.0000 0.3510 0 0 0 0 E 1 1 1 95.0 0.2341 0.1958 0.2777 0 1 0 0 This problem used 0.1% of my workspace Task Time elapsed (DD:HH:MM:SS) Reading script & data 0: 0: 0: 0.24 Execution 0: 0: 0:29.75 TOTAL 0: 0: 0:30.00 Total number of warnings issued: 2 ______________________________________________________________________________

SMOKING IN WOMEN %95% CI Additive genetic variance58.439.9-77.8 Shared environmental variance 18.20.0-35.1 Non-shared environmental variance 23.419.6-27.8 -2 log-likelihood6860.06

BIVARIATE GENETIC APPLICATIONS It is a simple step to modify the univariate script to allow for bivariate (or even trivariate) genetic analyses. If the traits being analyzed have varying numbers of thresholds, maxthres will be the maximum number of thresholds, and we will have, say, MATM 0.5-0.5 01.0 In the next example, we analyze Australian twin data on lifetime history of major depression and current smoking status. Here, the original raw data are given in depsmkmf.rec and depsmkdf.rec. Notice that the data have been sorted -- this will improve the efficiency of the MX run.

! ordinal_bivariate.mx #define nvar 2 #define nvar2 4 #define maxthres 2 Analysis of ordinal depression (0/1) and smoking ! initiation/persistence (0/1/2) data NI=nvar2 NG=3 Ordinal fi=depsmkmf.rec Begin matrices; M FU maxthres nvar fr L LO maxthres maxthres W LO nvar nvar fr X LO nvar nvar fr Y LO nvar nvar fr end matrices; MAT L 1.0 1.0 1.0 MATRIX M 0.5294 0.7191 0.0 0.5 SP M 1 2 0 3 st 0.7 y(1,1) y(2,2) w(1,1) w(2,2) st 0.2 x(1,1) x(2,2) st 0.2 w(2,1) x(2,1) y(2,1)

Begin algebra; A=W*W'; O=\stnd(A); C=X*X'; r=\stnd(C); E=Y*Y'; q=\stnd(E); P=A+C+E; end algebra; TH L*M|L*M; CO P | A + C _ A' + C' | P ; bo 0.001 1.0 y(1,1) y(2,2) bo 0.0001 0.999 x(1,1) x(2,2) w(1,1) w(2,2) bo -0.999 0.999 x(2,1) y(2,1) w(2,1) bo 0.001 3.0 m(2,2) bo -5.0 5.0 m(1,1) ! interval a(1,1) a(2,2) c(1,1) c(2,2) e(1,1) e(2,2) o(1,2) r(1,2) q(1,2) OPT func=1.E-12 OPT RS END

Analysis of ordinal depression and smoking data: DZF data NI=nvar2 Ordinal fi=depsmkdf.rec Begin matrices = group 1; N FU maxthres nvar fr g fu 1 1 end matrices; MATRIX N 0.5781 0.6884 0 0.72 SP N 101 102 0 103 mat g 0.5 TH L*N | L*N ; CO P | g@A + C _ g@A' + C' | P ; bo 0.001 3.0 n(2,2) bo -5.0 5.0 n(1,1) OPT RS END

Data constraint CO NI=1 Begin matrices = group 1; U unit 1 nvar end matrices; CO \d2v(P) = u; end

Summary of VL file data for group 1 Code 1.0000 2.0000 3.0000 4.0000 Number 1013.0000 1286.0000 982.0000 1254.0000 Mean 0.1925 0.6454 0.2169 0.6555 Variance 0.1554 0.7234 0.1699 0.7426 Minimum 0.0000 0.0000 0.0000 0.0000 Maximum 1.0000 2.0000 1.0000 2.0000 Summary of VL file data for group 2 Code 1.0000 2.0000 3.0000 4.0000 Number 598.0000 826.0000 586.0000 786.0000 Mean 0.1940 0.6525 0.2526 0.7468 Variance 0.1564 0.7086 0.1888 0.7820 Minimum 0.0000 0.0000 0.0000 0.0000 Maximum 1.0000 2.0000 1.0000 2.0000

*** WARNING! *** I am not sure I have found a solution that satisfies Kuhn-Tucker conditions for a minimum. NAG's IFAIL parameter is 6 Looks like I got stuck here. Check the following: 1. The model is correctly specified 2. Starting values are good 3. You are not already at the solution The error can arise if the Hessian is ill-conditioned You can try resetting it to an identity matrix and fit from the solution by putting TH=-n on the OU line where n is the number of refits that you want to do If all else fails try putting NAG=30 on the OU line and examine the file NAGDUMP.OUT and the NAG manual

MX PARAMETER ESTIMATES GROUP NUMBER: 1 Analysis of ordinal depression (0/1) and smoking initiation/persistence (0/1/2) MATRIX A This is a computed FULL matrix of order 2 by 2 [=W*W'] 1 2 1 0.3551 0.0782 2 0.0782 0.6147 MATRIX C This is a computed FULL matrix of order 2 by 2 [=X*X'] 1 2 1 3.5944E-08 5.0292E-05 2 5.0292E-05 1.5118E-01 MATRIX E This is a computed FULL matrix of order 2 by 2 [=Y*Y'] 1 2 1 0.6451 0.0270 2 0.0270 0.2343 MATRIX L This is a LOWER TRIANGULAR matrix of order 2 by 2 1 2 1 1.0000 2 1.0000 1.0000

MATRIX M This is a FULL matrix of order 2 by 2 1 2 1 0.8249 0.2698 2 0.0000 0.4014 MATRIX O This is a computed FULL matrix of order 2 by 2 [=\STND(A)] 1 2 1 1.0000 0.1675 2 0.1675 1.0000 MATRIX P This is a computed FULL matrix of order 2 by 2 [=A+C+E] 1 2 1 1.0003 0.1053 2 0.1053 1.0001 MATRIX Q This is a computed FULL matrix of order 2 by 2 [=\STND(E)] 1 2 1 1.0000 0.0695 2 0.0695 1.0000

MATRIX R This is a computed FULL matrix of order 2 by 2 [=\STND(C)] 1 2 1 1.0000 0.6822 2 0.6822 1.0000 MATRIX W This is a LOWER TRIANGULAR matrix of order 2 by 2 1 2 1 0.5959 2 0.1313 0.7729 MATRIX X This is a LOWER TRIANGULAR matrix of order 2 by 2 1 2 1 1.8959E-04 2 2.6527E-01 2.8428E-01 MATRIX Y This is a LOWER TRIANGULAR matrix of order 2 by 2 1 2 1 0.8032 2 0.0337 0.4829

Matrix of EXPECTED thresholds 1 2 3 4 Threshold 1 0.8249 0.2698 0.8249 0.2698 Threshold 2 0.8249 0.6712 0.8249 0.6712 (OBSERVED MATRIX is nonexistent for raw data) EXPECTED COVARIANCE MATRIX 1 2 3 4 1 1.0003 2 0.1053 1.0001 3 0.3551 0.0783 1.0003 4 0.0783 0.7659 0.1053 1.0001 Function value of this group: 6219.6048 Where the fit function is -2 * Log-likelihood of raw ordinal

*** WARNING! *** Minimization may not be successful. See above CODE RED - Hessian/precision problem Your model has 15 estimated parameters and 7333 Observed statistics Observed statistics include 2 constraints. -2 times log-likelihood of data >>> 10511.502 Degrees of freedom >>>>>>>>>>>>>>>> 7318 This problem used 1.2% of my workspace Task Time elapsed (DD:HH:MM:SS) Reading script & data 0: 0: 0: 6.63 Execution 0: 0: 5:18.42 TOTAL 0: 0: 5:25.04 Total number of warnings issued: 2 ______________________________________________________________________________

Controlling for Covariates An advantage of fitting models to raw (binary or ordinal) data is that we can simultaneously control for covariates (e.g., age) while fitting genetic models. To do this, we need to include covariates as “definition variables” in our analysis, and simultaneously model the “probit” regression of liability on covariates, so that we are now testing for genetic effects on the residual variance in the outcome of interest. This approach can also be extended to test for genotype x environment interaction effects (beyond the scope of this workshop)! Previously, we have directly estimated threshold values t that are assumed to be the same for al individuals of a given gender (and sometimes zygosity group). Now we must allow thresholds to differ between individuals as a function of their covariate values C i1, C i2, etc. t i = t o – B 1 C i1 – B 2 C i2, etc. The regression coefficients B 1, B 2, etc. are probit regression coefficients – thus good starting values can be obtained from standard statistical software such STATA. In the next program, we estimate probit regression coefficients and residual twin pair correlation.

Analysis of regression of alcdep on conduct (broad), majdep, panatt ! /data2/boulder/probitmult_withpairs.mx ! Second group is DZF complete pairs data NI=9 la cond majdep panatt alcdep xcond xmjdep xpnatt xlcdep wt ordinal fi=femdzf.dat definition_variables cond majdep panatt xcond xmjdep xpnatt wt / Begin matrices; O FU 1 6 ! store definition variable (i.e. vara) here K FU 6 2 fi ! regression coefficient T FU 1 1 fi ! threshold value V FU 1 1 ! variance (=unity) R FU 1 1 fr ! tetrachoric correlation (to be estimated) W FU 1 1 ! weight variable end matrices; SP O -1 -2 -3 -4 -5 -6 SP W -7 MAT V 1.00 MAT R 0.30 SP T 50 SP K 100 0 101 0 102 0 0 100 0 101 0 102 FREQ W; TH -(T|T)-O*K; ! Note that we now have thresholds for both twins CO V|R_ R|V; bo -0.999 0.999 r(1,1) interval r(1,1) OPT RS END

Analysis of regression of alcdep on conduct (broad), majdep, panatt ! This group is for singleton women data NI=5 la cond majdep panatt alcdep wt ordinal fi=femsing.dat definition_variables cond majdep panatt wt / Begin matrices; O FU 1 3 ! store definition variable (i.e. vara) here K FU 3 1 fr ! regression coefficient T FU 1 1 fr ! threshold value V FU 1 1 ! variance (=unity) W FU 1 1 ! weight variable end matrices; SP O -1 -2 -3 SP W -4 MAT V 1.00 SP T 50 MAT T 0.1 SP K 100 101 102 MAT K 0.05 FREQ W; TH -T-O*K; CO V; OPT func=1.E-12 OPT RS END

** Mx startup successful ** **MX-Sunos version 1.50c** The following MX script lines were read for group 1 ANALYSIS OF REGRESSION OF ALCDEP ON CONDUCT (BROAD), MAJDEP, PANATT ! estimating probit regression coefficients ! ! /data2/boulder/probitmult_withpairs.mx ! First group is MZF complete pairs DATA NI=9 NG=3 LA COND MAJDEP PANATT ALCDEP XCOND XMJDEP XPNATT XLCDEP WT ORDINAL FI=FEMMZF.DAT Ordinal data read initiated NOTE: Rectangular file contained 91 records with data that contained a total of 819 observations DEFINITION_VARIABLES COND MAJDEP PANATT XCOND XMJDEP XPNATT WT / NOTE: Definition yields 91 data vectors for analysis NOTE: Vectors contain a total of 182 observations BEGIN MATRICES; O FU 1 6 ! STORE DEFINITION VARIABLE (I.E. VARA) HERE K FU 6 2 FR ! REGRESSION COEFFICIENT T FU 1 1 FR ! THRESHOLD VALUE V FU 1 1 ! VARIANCE (=UNITY) R FU 1 1 FR ! TETRACHORIC CORRELATION (TO BE ESTIMATED) W FU 1 1 ! WEIGHT VARIABLE END MATRICES; SP O -1 -2 -3 -4 -5 -6 SP W -7 MAT V 1.00 MAT R 0.60 SP T 50 MAT T 0.1 SP K 100 0 101 0 102 0 0 100 0 101 0 102 MAT K 0.05 0 0 0.05 FREQ W; TH -(T|T)-O*K; ! NOTE THAT WE NOW HAVE THRESHOLDS FOR BOTH TWINS CO V|R_ R|V; BO -5.0 5.0 T(1,1) K(1,1) K(2,1) K(3,1) BO -0.999 0.999 R(1,1) INTERVAL K(1,1) K(2,1) K(3,1) T(1,1) R(1,1) OPT FUNC=1.E-12 OPT RS END

Summary of VL file data for group 1 WT XPNATT XMJDEP XCOND PANATT MAJDEP Code -7.0000 -6.0000 -5.0000 -4.0000 -3.0000 -2.0000 Number 91.0000 91.0000 91.0000 91.0000 91.0000 91.0000 Mean 7.5385 0.3297 0.5495 0.3077 0.2967 0.5495 Variance 761.9408 0.2210 0.2476 0.2130 0.2087 0.2476 Minimum 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 Maximum 255.0000 1.0000 1.0000 1.0000 1.0000 1.0000 COND ALCDEP XLCDEP Code -1.0000 1.0000 2.0000 Number 91.0000 91.0000 91.0000 Mean 0.3516 0.4176 0.4066 Variance 0.2280 0.2432 0.2413 Minimum 0.0000 0.0000 0.0000 Maximum 1.0000 1.0000 1.0000 Summary of VL file data for group 2 WT XPNATT XMJDEP XCOND PANATT MAJDEP Code -7.0000 -6.0000 -5.0000 -4.0000 -3.0000 -2.0000 Number 81.0000 81.0000 81.0000 81.0000 81.0000 81.0000 Mean 6.0864 0.3086 0.5185 0.3704 0.3086 0.5679 Variance 399.1407 0.2134 0.2497 0.2332 0.2134 0.2454 Minimum 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 Maximum 170.0000 1.0000 1.0000 1.0000 1.0000 1.0000 COND ALCDEP XLCDEP Code -1.0000 1.0000 2.0000 Number 81.0000 81.0000 81.0000 Mean 0.2840 0.3333 0.4691 Variance 0.2033 0.2222 0.2490 Minimum 0.0000 0.0000 0.0000 Maximum 1.0000 1.0000 1.0000 Summary of VL file data for group 3 WT PANATT MAJDEP COND ALCDEP Code -4.0000E+00 -3.0000E+00 -2.0000E+00 -1.0000E+00 1.0000E+00 Number 1.6000E+01 1.6000E+01 1.6000E+01 1.6000E+01 1.6000E+01 Mean 6.5000E+01 5.0000E-01 5.0000E-01 5.0000E-01 5.0000E-01 Variance 1.7240E+04 2.5000E-01 2.5000E-01 2.5000E-01 2.5000E-01 Minimum 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 Maximum 5.4400E+02 1.0000E+00 1.0000E+00 1.0000E+00 1.0000E+00

MX PARAMETER ESTIMATES GROUP NUMBER: 1 Analysis of regression of alcdep on conduct (broad), majdep, panatt MATRIX K This is a FULL matrix of order 6 by 2 1 2 1 0.6831 0.0000 2 0.3117 0.0000 3 0.2724 0.0000 4 0.0000 0.6831 5 0.0000 0.3117 6 0.0000 0.2724 MATRIX O This is a FULL matrix of order 1 by 6 1 2 3 4 5 6 1 1.0000 1.0000 1.0000 0.0000 1.0000 0.0000 MATRIX R This is a FULL matrix of order 1 by 1 1 1 0.5321 MATRIX T This is a FULL matrix of order 1 by 1 1 1 -1.2307 MATRIX V This is a FULL matrix of order 1 by 1 1 1 1.0000 MATRIX W This is a FULL matrix of order 1 by 1 1 1 1.0000 Matrix of EXPECTED thresholds ALCDEP XLCDEP Threshold 1 -0.0365 0.9190 (OBSERVED MATRIX is nonexistent for raw data) EXPECTED COVARIANCE MATRIX ALCDEP XLCDEP ALCDEP 1.0000 XLCDEP 0.5321 1.0000 Function value of this group: 1017.3444 Where the fit function is -2 * Log-likelihood of raw ordinal

GROUP NUMBER: 2 Analysis of regression of alcdep on conduct (broad), majdep, panatt MATRIX K This is a FULL matrix of order 6 by 2 1 2 1 0.6831 0.0000 2 0.3117 0.0000 3 0.2724 0.0000 4 0.0000 0.6831 5 0.0000 0.3117 6 0.0000 0.2724 MATRIX O This is a FULL matrix of order 1 by 6 1 2 3 4 5 6 1 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 MATRIX R This is a FULL matrix of order 1 by 1 1 1 0.2637 MATRIX T This is a FULL matrix of order 1 by 1 1 1 -1.2307 MATRIX V This is a FULL matrix of order 1 by 1 1 1 1.0000 MATRIX W This is a FULL matrix of order 1 by 1 1 1 1.0000 Matrix of EXPECTED thresholds ALCDEP XLCDEP Threshold 1 -0.0365 -0.0365 (OBSERVED MATRIX is nonexistent for raw data) EXPECTED COVARIANCE MATRIX ALCDEP XLCDEP ALCDEP 1.0000 XLCDEP 0.2637 1.0000 Function value of this group: 808.7919 Where the fit function is -2 * Log-likelihood of raw ordinal

GROUP NUMBER: 3 Analysis of regression of alcdep on conduct (broad), majdep, panatt MATRIX K This is a FULL matrix of order 3 by 1 1 1 0.6831 2 0.3117 3 0.2724 MATRIX O This is a FULL matrix of order 1 by 3 1 2 3 1 1.0000 1.0000 1.0000 MATRIX T This is a FULL matrix of order 1 by 1 1 1 -1.2307 MATRIX V This is a FULL matrix of order 1 by 1 1 1 1.0000 MATRIX W This is a FULL matrix of order 1 by 1 1 1 7.0000 Matrix of EXPECTED thresholds ALCDEP Threshold 1 -0.0365 (OBSERVED MATRIX is nonexistent for raw data) EXPECTED COVARIANCE MATRIX ALCDEP ALCDEP 1.0000 Function value of this group: 916.5047 Where the fit function is -2 * Log-likelihood of raw ordinal Your model has 6 estimated parameters and 360 Observed statistics -2 times log-likelihood of data >>> 2742.641 Degrees of freedom >>>>>>>>>>>>>>>> 354 5 Confidence intervals requested in group 1 Matrix Element Int. Estimate Lower Upper Lfail Ufail K 1 1 1 95.0 0.6831 0.5195 0.8462 0 0 0 0 K 1 2 1 95.0 0.3117 0.2012 0.4222 0 0 0 0 K 1 3 1 95.0 0.2724 0.1145 0.4283 0 0 0 0 T 1 1 1 95.0 -1.2307 -1.3034 -1.1590 0 0 0 0 R 1 1 1 95.0 0.5321 0.3863 0.6553 0 0 0 0 1 Confidence intervals requested in group 2 Matrix Element Int. Estimate Lower Upper Lfail Ufail R 2 1 1 95.0 0.2637 0.0712 0.4408 0 0 0 0 This problem used 0.4% of my workspace Task Time elapsed (DD:HH:MM:SS) Reading script & data 0: 0: 0: 0.80 Execution 0: 0: 1:13.76 TOTAL 0: 0: 1:14.56 Total number of warnings issued: 11 ______________________________________________________________________________

HIGH-RISK SAMPLING SCHEMES The Ordinal data-option in MX allows us to analyze twin or family data collected under a two-stage sampling scheme, where in the first stage we study a random sample of families, but in the second stage the probability that a family will be assigned for interview is a function of phenotypic values observed at the first stage. For example, we may decide that we will do follow-up assessments with all pairs where at least one twin is affected at stage one, but only 10% of pairs where neither twin was affected at stage one.

To illustrate this, we have created a simulated data-set, with the following parameters, using multsim2_2mz.mx and multsim2_2dz.mx. WAVE 1WAVE 2 VA50% r G = 1.00 VC9% r C = 1.00 VE41% r E = 0.71 Prevalence25% First, we analyze this data assuming that all twin pairs (1000 MZ, 1000 DZ pairs) are assessed at both waves (ordinal_bivariate_simulated.mx).

SIMULATED TWO-WAVE DATA TWIN ATWIN B Wave 1Wave 2Wave 1Wave 2MZ_FULLDZ_FULL 0000560519.5 000133.637.8 001033.637.8 001160.3592.4 010033.637.8 01015.84.7 01105.84.7 011117.2515.3 100033.637.8 10015.84.7 10105.84.7 101117.2515.3 110060.3592.4 110117.2515.3 111017.2515.3 111192.764.6 TOTAL PAIRS 1000.001000.10

Estimated parameters: WAVE 1WAVE 2 %95% CI% r VA49.7(34.5-61.7)49.5(26.6-55.0)r G = 1.00 VC9.3(0.0-24.5)9.5(0.0-25.3)r C = 1.00 VE41.1(37.2-47.6)41.1( -- )r E = 0.71 Prevalence25.0 (t 1 =0.6748)(t 2 =0.6747)

HIGH-RISK SAMPLING SCHEMES (II) Next, we analyze the data-set that would arise under our two-stage sampling scheme, using ordinal_bivariate_hirisk.mx. This is exactly the same program as in the previous case, except that we have changed file names! In 90% of cases where neither twin was affected at stage one, the stage two phenotypic values are set to missing. What parameter estimates do we recover in this case?

Two-Wave data simulating high-risk sampling MZHIRISKDZHIRISK 00005651.95 0.0.504467.55 00013.363.78 0.0.30.2434.02 001033.637.8 001160.3592.4 01003.363.78 0.0.30.2434.02 01010.580.47 0.0.5.224.23 01105.84.7 011117.2515.3 100033.637.8 10015.84.7 10105.84.7 101117.2515.3 110060.3592.4 110117.2515.3 111017.2515.3 111192.764.6 TOTAL PAIRS - Wave 1:10001000.1 - Wave 2:430.30460.275

Estimated parameters: WAVE 1WAVE 2 %95% CI% r VA50.2(39.7-55.6)50.3(32.7-67.6)r G = 1.00 VC8.88.6r C = 1.00 VE41.041.1r E = 0.71 Prevalence25.0 (t 1 =0.6745)(t 2 =0.6737)

HIGH-RISK SAMPLING SCHEMES (III) Notice that we included all pairs who were assessed at stage one. What happens if we focus on the stage two phenotype and include only those pairs who have data at stage two? Data are in mzlistwise.rec and dzlistwise.rec; the program is univariate_listwise.mx. Estimated parameters: WAVE 2 %95% CI VA23.04.8-35.6 VC0.00.0-12.4 VE77.064.4-90.1 Prevalence50.049.7-50.3 (t 1 =0.004) (-0.086-0.094) When we ignore the wave one data, our estimates of population prevalence (not unexpectedly) and genetic and environmental parameters, are seriously biased!

HIGH-RISK SAMPLING SCHEMES (IV) Suppose that instead we acknowledge that our population is drawn from a population where the prevalence of the observed trait is 25%, and fix our estimate of the threshold value, t=0.67449. As in the previous example, we limit ourselves to twin pairs where wave two assessments occurred. WAVE 2 %95% CI VA45.822.9-55.2 VC0.00.0-17.8 VE54.244.8-64.3 The bias to parameter estimates is substantially reduced! (There is still a bias, however: in particular, our estimate of the shared environmental variance is now zero.)

HIGH-RISK SAMPLING SCHEMES (V) How do we explain these results? “Missing data theory” is an active area of research in statistics which is concerned with how we should adjust for missing observations -- which may be missing because of subject non- response, or because of sampling design (e.g. our two-stage sampling design). Missing data theory distinguishes between data that are (i)MCAR -- missing completely at random, i.e. non-response is completely unrelated to the variable we are studying (plausible for variables such as finger ridge count). (ii)MAR -- missing at random, i.e. non-response is random, but the probability of non-response may vary as a function of observed trait values (or underlying latent variables).

Suppose we have a 5-level variable with the following probabilities of missing data at subsequent follow-up: Trait ValueProbability of missing data 160% 220% 335% 413% 550% These data are certainly not MCAR, but they do meet the definition of MAR.

If probability of non-response is (i) determined by one or more correlated phenotypes that are not included in the analyses; or (ii) partly a function of the stage-two phenotype (such as would be the case if individuals who were unaffected at wave one but had become affected by the time wave two were more likely not to agree to be assessed at wave two than individuals who remained unaffected throughout), missing data will be non-ignorable. In the case where we analyzed only the wave two data, but fixed the prevalence at 25% (i.e. assuming that missingness is determined by the stage two phenotype), missingness was still strictly non- ignorable, since it was determined by wave two and not wave one phenotypic values. However, since we simulated a very high test- retest correlation between wave one and wave two data, analyzing the data as though they were MAR greatly reduced biases to estimates of genetic and environmental parameters.

HIGH-RISK SAMPLING SCHEMES (VI) Under certain conditions, missingness is said to be ignorable, i.e. we can recover estimates of the underlying population parameters without needing to adjust for differential rates of non-response. For our two-stage high-risk sampling scheme, where we assumed random sampling at the first stage, but that only 10% of concordant unaffected pairs are assessed at the second stage, the stage-two data are MAR. Provided that we use the Ordinal option in MX (or the Raw data option, for continuous variables), and analyze all pairs observed at stage one, we can recover correct estimates of population prevalence and of genetic and environmental parameters.

Missing data theory provides a framework for thinking about several important classes of problems in behavior genetics: (i)clinically ascertained samples; (ii)cooperation or retention bias; (iii)hierarchical or stage-dependent models of genetic and environmental influences on substance use initiation and outcome (e.g. smoking initiation and persisence) or risk of psychopathology.

GENETIC ANALYSIS OF BINARY and CATEGORICAL TRAITS PART ONE.

Similar presentations

Presentation on theme: "GENETIC ANALYSIS OF BINARY and CATEGORICAL TRAITS PART ONE."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GENETIC ANALYSIS OF BINARY and CATEGORICAL TRAITS PART ONE.

Similar presentations

Presentation on theme: "GENETIC ANALYSIS OF BINARY and CATEGORICAL TRAITS PART ONE."— Presentation transcript:

Similar presentations

About project

Feedback