Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genotype Susceptibility And Integrated Risk Factors for Complex Diseases Weidong Mao Dumitru Brinza Nisar Hundewale Stefan Gremalshi Alexander Zelikovsky.

Similar presentations


Presentation on theme: "Genotype Susceptibility And Integrated Risk Factors for Complex Diseases Weidong Mao Dumitru Brinza Nisar Hundewale Stefan Gremalshi Alexander Zelikovsky."— Presentation transcript:

1 Genotype Susceptibility And Integrated Risk Factors for Complex Diseases Weidong Mao Dumitru Brinza Nisar Hundewale Stefan Gremalshi Alexander Zelikovsky Department of Computer Science Georgia State University

2 2 Outline Human genetics basics SNPs, Haplotypes and Genotypes Genetic epidemiology Prediction Methods Genetic susceptibility to complex diseases Conclusions and future plans

3 3 Human Genetics Basics Genetics DNA, gene, chromosome and Genome DNA = two complimentary strands of nucleotides (A-T, G-C) Length of DNA is measured in base pairs (bp) Human Genome Project (1990 - 2003) 3 billion bps of human genome 15,000 genes Over 99% of the genome is identical 1% are SNPs. 3.7 million SNPs

4 4 Single Nucleotide Polymorphisms (SNP) Altered single nucleotide in the genome sequence. Found in at least 1% of the population. Occurs every 100 to 300 bp. Bi-allelic: wild type and mutation. AAGGCATGGCTA AACGCGTGGTTA AACGCGTGGCTA SNPs: genetic risk factors for diseases.

5 5 Diploid organisms = two different “copies” of each chromosome = recombined copies of parents’ chromosomes Too expensive to examine two versions of a chromosome separately Much cheaper to obtain genotype (mixed) data rather than haplotype (separated) data Haplotype = description of single copy (0=wild type,1=minor allele) Genotype = description of mixed two copies (0=00, 1=11, 2=01) 0111 0 0 110 110 00 Twohaplotypesper individual 2121 0 0 120 Genotype for the individual 0111 0 0 110 110 00 Twohaplotypesper individual 2121 0 0 120 Genotype for the individual  homorozigous haplotype SN P heterozigous ATG CTT ACAC TTTT GTGT  Genotypes, Haplotypes, 0,1,2 notations

6 6 Genetic Epidemiology Genetic epidemiology - searching for genetic risk factors for diseases. Monogenic disease A mutated gene is entirely responsible for the disease. Typically rare in population: < 0.1%. Complex disease Affected by the interaction of multiple genes. Common: > 0.1%. In NY city, 12% of the population has Diabetes II. Significance of risk factor is measured by risk rate or odds ratio.

7 7 Genetic Susceptibility to Complex Diseases Given: Genotypes of sick and healthy persons, Genotype of a testing person. Find: The testing person has the disease or not. 0101201020102210 0220110210120021 0200120012221110 0020011002212101 1101202020100110 0120120010100011 0210220002021112 0021011000212120 1 GenotypeDisease Status healthy sick testing - g t 0110211101211201s(g t )

8 8 Prediction Methods Universal prediction methods: Statistical Methods: - Closest Neighbor - Genotype Statistics Support Vector Machine (SVM) Random Forest Ad hoc prediction methods: Pseudo-haplotype statistics Linear programming based prediction method. Adjacent SNP pairs

9 9 Statistical Methods Closest Genotype Neighbor: For the testing genotype g t, find the closest genotype g i using Hamming distance and then set s(g t ) = s(g i ). g i: ATTCTGACCGCATC g t: ATTGTGATCGCCTC H (g i, g t ) = 3 Genotype Statistics: A standard statistical method based on the allele frequency. For each SNP j =1, …, m, we compute the LRR score of risk rate (RR) as follows: For genotype g t, if the cumulative LRR score of all SNPs is greater than 0, then the output disease status s(g t ) =1, (g t is predicted to be in control population) and -1, otherwise.

10 10 Support Vector Machine (SVM) Algorithm Learning Task Given: Genotypes of patients and healthy persons. Compute: A model distinguishing if a person has the disease. Classification Task Given: Genotype of a new patient + a learned model Determine: If a patient has the disease or not. Linear SVMNon-Linear SVM

11 11 Random Forest Algorithm Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down to each tree in the forest. Each tree gives a classification, and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest). Growing Tree, Split selection and Prediction. Random sub-sample of training data, Random splitter selection.

12 12 0110010 0101201020102 Data Set 0120120 0100110 0120210 0110210 0120200 3 5 6 Bootstrapped sample homozygousheterozygous 1 4 7 2 6 7 ….. 0 Test Genotype 01012011 Random Forest Algorithm

13 13 Pseudo-Haplotype Statistics: Genotype 1 012001221000 pseudo-haplotype 010011001000 Genotype 2 220212021000 Genotypes pseudo-haplotypes 1 1 -1 1 -1 1 -1 1 1 ? 1 Ad hoc classification methods

14 14 LP-based Prediction Algorithm Certain haplotypes are susceptible to the disease while others are resistant to the disease. The genotype susceptibility is assumed to be a sum of susceptibilities of its two haplotypes. Assign a positive weight to susceptible haplotypes and a negative weight to resistant haplotypes such that for any control genotype the sum of weights of its haplotypes is negative and for any case genotype it is positive. For each vertex h i (corresponding to a haplotype) of the graph G X we wish to assign the weight p i, such that for any genotype-edge e i,j =(h i,,h j ) where s(e i,j )  {-1,1} is the disease status of genotype represented by edge e i,j. The total sum of absolute values of genotype weights is maximized.

15 15 Most Reliable 2 SNPs Prediction Chooses a pair of adjacent SNPs to predict the disease status of the test genotype by voting among genotypes from training set which have the same SNP values at the chosen sites. The most reliable 2 adjacent SNPs have the highest prediction rate in the training set. 2112100112110101011022111 0011100112101211221101022 1102211010122112101012011 1011012110100221200121011 112100210012211000110222 10210011121012211000101211 120120100210112210121012 11012101101121211100112212 1021001210211001221010011 201110222002101112100102 1021001102102012112101211 10210021112210111120121011 Training set Test Genotype 1021210110012121010111202 60%100%

16 16 Disease Tagging Motivation: Genotyping/analysis a limited number of suspicious SNPs. Tag SNPs: The subset of genotypes, probably are responsible for diseases. 0 0 1 0 2 1 0 1 1 2 2 0 2 1 2 010111120010111120 0 0 Tag SNPs

17 17 Minimal Disease Tagging Problem Given: Genotypes partitioned into groups (e.g., case/control ), Find: Minimal # of SNPs distinguishing any case from any control. Greedy algorithm: Drop a SNP if it does not collapse case and controls. 0 001 0 011 1 1 111 0 101 001 0 0 011 1 1 0 101 001 STOP

18 18 101 110 + 202 01 + 12201 00 00000 10111 01101 + 22121 00001 00010 -00022 00011 01100 -02222 01110 01011 -01212 01010 01001 -01022 01000 11010 -21020 101 000 011 010 110 +122 +202 - 000 - 022 - 010 - 210 +221 - 012 -0.5 1 0.5 -0.25 -0.75 -1.5 0.25 -0.5 1.25 -0.25 0 Decided by other methods 0.75

19 19 Quality Measures of Prediction Sensitivity: The ability to correctly detect disease. sensitivity = TP/(TP+FN) Specificity: The ability to avoid calling normal as disease. specificity = TN/(FP+TN) Accuracy = (TP +TN)/(TP+FP+FN+TN) Risk Rate: Measurements for risk factors. Prediction Disease +- Test + True PositiveFalse Positive (TP)(FP) - False NegativeTrue Negative (FN)(TN)

20 20 Cross-validation Method Leave-one-out test: The disease status of each genotype in the data set is predicted while the rest of the data is regarded as the training set. 0101201020102210 0220110210120021 0200120012221110 0020011002212101 Leave-many-out test: Repeat randomly picking 2/3 of the population as training set and predict the other 1/3. 1 Genotype Real Disease Status 1 Predicted Disease Status 1 0020011002212101 1 1 Accuracy = 80%

21 21 Algorithms Evaluation P-value: A measure of how much evidence we have against the null hypotheses. Null hypotheses: The observed prediction accuracy is obtained by chance. To reject the null hypotheses, p-value < 0.05 Compute p-value: randomization Randomly permute the disease status of the population to generate 1000 instances. Apply prediction methods on each instance to get prediction accuracy. Compute the probability of instances that have a higher prediction accuracy than the observed accuracy. Confidence Intervals: Using bootstrapping to compute 95% CI for each measure.

22 22 Data Sets Crohn's disease (Daly et al ): inflammatory bowel disease (IBD). Location: 5q31 Number of SNPs: 103 Population Size: 387 case: 144 control: 243 Autoimmune disorders (Ueda et al) : Location: containing gene CD28, CTLA4 and ICONS Number of SNPs: 108 Population Size: 1036 case: 384 control: 652

23 23 Experiment Results (IEEE International Conference on Granular Computing, W. Mao, et al)

24 24 Conclusions SNPs are genetic risk factors for complex diseases. Most known methods focus on single markers and are not applicable to complex disease. Propose several ad-hoc algorithms to predict the genetic susceptibility and integrated risk factors for complex diseases. Our algorithms are proved to have a higher statistical significance and higher prediction rate than universal methods.

25 25 Thank You ! Questions ?


Download ppt "Genotype Susceptibility And Integrated Risk Factors for Complex Diseases Weidong Mao Dumitru Brinza Nisar Hundewale Stefan Gremalshi Alexander Zelikovsky."

Similar presentations


Ads by Google