Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classification of Mitochondrial DNA SNPs into Haplogroups Yuran Li Department of Chemistry and Biochemistry University of Delaware Newark, DE 19717 Carol.

Similar presentations


Presentation on theme: "Classification of Mitochondrial DNA SNPs into Haplogroups Yuran Li Department of Chemistry and Biochemistry University of Delaware Newark, DE 19717 Carol."— Presentation transcript:

1 Classification of Mitochondrial DNA SNPs into Haplogroups Yuran Li Department of Chemistry and Biochemistry University of Delaware Newark, DE Carol Wong Department of Bioengineering University of Pennsylvania Philadelphia, PA National Science Foundation – BioGRID REU Fellows Department of Computer Science and Engineering University of Connecticut Storrs, CT 06269

2 Mitochondrial DNA & Haplogroups The Genographic Project 1 Nearest Neighbor Support Vector Machines Random Forest RF PCA Results Discussions Extensions

3 Mitochondrial DNA -Found in Mitochondria - 2 to 10 copies per Mitochondrion - Hundreds to thousands per cell -Circular - Bacterial origin -Uniparental -Non-Combining -High mutation rate -Maternal Inheritence -Egg vs Sperm & Ubiquitin Marker

4 Haplogroups | Sequencing - Coding is done at the D-loop -Mutation Hotspots -Hypervariable Region I (HVR-I) -Nucleotides Each sample is tagged with a haplogroup label representing its genetic content. -Contains similar haplotyes that share a common ancestor based on SNPs.

5 SNPs SNPs (Single nucleotide polymorphisms) Insertion Deletion Transversion Transition Heteroplasmy Variables = SNPs 545 HVS I SNPs

6 Haplotypes/Haplogroups Haplotype – combination of SNPs Haplotype: HVS-I variants samples Dataset - Coarse- Hg labels Coding region SNPs/ HVS – I motifs Considered the ‘gold – standard’

7 ntent/health/pharma/snips/

8 Cladistics -Classification based on shared ancestry -Back Mutations -Homoplasy

9 Cladistics cont.

10 The Genographic Project -The National Geographic Society -Anthropological and Forensic Questions -78,590 Samples -21,141 consented database -Hg labeling is done with both HVR-I motifs and the 22-SNP panel results -Utilizes Nearest Neighbor Algorithm (1-NN)

11 Nearest Neighbor -Pattern recognition | Instance Based Learning -Simple and Power -High accuracy rate with large data sets -Data point is classified by a vote of its k nearest neighbors -Training data is separated in space into regions -Data is classified to the highest number of votes amongst its neighbors.

12 Support Vector Machines -Training and Testing -Data  Vectors -Model Production -Mapping into higher dimensional plane -Maximum separating margin

13 Data Processing (SVM) -Numbering of detailed data -{x,y,z}  (0,0,1), (0,1,0), (1,0,0) -Radial Basis Function (RBF) Kernel -Higher plane mapping -Simplicity -Opitmal Parameters: - Grid Search:

14 Random Forest Tree-based classification algorithm Fortran (computationally oriented programming language) original package Leo Breiman and Adele Cutler Ensemble learning algorithm Implementation through R environment

15 RF Voting for classification Random inputs Variables Samples Ntree single decision trees Mtry variables Random sampling Training set obtained through bootstrap sampling OOB data/error estimate Inputted dataset excludes certain cases 1/3 of input cases left out

16 Random Forest mgurl=http://proteomics.bioengr.u ic.edu/malibu/docs/images/random _forest_thumb.png&imgrefurl=htt p://proteomics.bioengr.uic.edu/ma libu/docs/meta_classifiers.html&us g=__oCugzsEOtYKwtBLo2Mi11k cgkcE=&h=240&w=420&sz=101& hl=en&start=1&um=1&tbnid=bG AVW705VPSR9M:&tbnh=71&tbn w=125&prev=/images%3Fq%3Dra ndomForest%26hl%3Den%26clien t%3Dfirefox- a%26rls%3Dorg.mozilla:en- US:official%26sa%3DN%26um%3 D1 mgurl=http://cg.scs.carleton.ca/~lu c/bst.gif&imgrefurl=http://cg.scs.ca rleton.ca/~luc/trees.html&usg=__ gYANVMgGa_H8CUhJZApOczZ D5Xs=&h=447&w=548&sz=21&hl =en&start=3&um=1&tbnid=lsdiIpj qYFENXM:&tbnh=108&tbnw=133 &prev=/images%3Fq%3Drandom% 2BForest%26hl%3Den%26client% 3Dfirefox- a%26rls%3Dorg.mozilla:en- US:official%26sa%3DN%26um%3 D1

17 5F - Cross Validation Test out predictive model Divide into 5 subsets (5-fold) Training set Test set ‘unseen’ data Five fold cross validation Training set/ testing set Random Forest with training set Testing set

18 5F CV https://esus.genome.tugraz.at/ProClassify/help/contents/pages/images/xv_folds.gif

19 RF model Genotyped mtDNA Dataset 545 SNPs in HVS – I HVS-I haplotypes samples Hg classification from similar haplotypes SNPs dictate Hg classifications SNPs = variables Coarse – Hg classifications in dataset ‘gold standard’

20 Model - Optimal mtry and ntree values for entire dataset ◦ Pair of parameters with lowest OOB error (training set) Mtry SNPs used to construct each tree Ntree decision trees constructed 5 fold Cross validation ◦ Random forest on training set  Training set : random sampling with replacement  Bootstrap sampling  random sampling with replacement of cases  OOB data ◦ Model : random forest object outputted ◦ Apply random forest model on test set ◦ Output = predicted Hg classifications ◦ Compare back to ‘observed’ Hg classifications

21 R environment Bill Venables and David M. Smith Primary programming language: ‘S’ (statistical) Coherent system integrating data manipulation, calculation and graphical display

22 R environment

23 PCA PCA = Principal Component Analysis Feature Selection tool Which variables more informative than others? Confusing dataset Too many variables – 545

24 PCA Reexpress dataset in another basis, the principal components (PCs) Change of basis Possible dimensional reduction Reveal hidden structure, underlying relationships Which basis best represents the dynamics of interest? Maximize variance Minimize covariance (redundancy) Find PCs – new basis vectors

25 PCA on dataset Eigendecomposition on Cx Original dataset = X C X. = 1/(number of samples) *XX T Transformed dataset = Y Y = PX P = orthonormal matrix Rows = principal components of X Rows = eigenvectors of C x C Y = diagonal covariance matrix of transformed X, Y. Diagonal entries = eigenvalues = variances

26 PCA Eigenvalues represent variance Rank order PCs= eigenvectors of original covariance matrix(new variables) by corresponding eigenvalues Subselection of k new variables (PCs) from available pool K = 64, 100, 200, 300, 400, 545 Select the first k rank ordered PCs for input into RF Transformed dataset = by k dimensions

27 Results ClassifierFeature Selection # of featuresNtreeMtryWeighted Accuracy Rate (%) RFRaw RFPCA RFPCA RFPCA RFPCA RFPCA RFPCA

28 SVM Findings Macro: 88.06% Micro: 96.59%

29 Comparison ClassifierMacro AccuracyMicro Accuracy 1-NN (LOO CV)-96.73% 1-NN (5F-CV)87.36%96.26% RF (5F-CV)87.35%96.19% SVM (5F-CV)88.06%96.59%

30 Discussion Unbalanced dataset ◦ Underrepresented haplogroups ◦ Overrepresented haplogroups ◦ Possibility: change weights/ coarser Hgs Bootstrap sampling in RF Cross validation RF vs. SVM

31 Graph

32 Coarse Hg accuracy rates Haplogroup NameSample Size1-NN AccuracyRF AccuracySVM Accuracy HV* % 62.58% L0/ %98.92%99.64% N* %49.21%52.38% M* %75.60%82.44% U* %95.83%96.55% A %98.66%100.00% C %98.70%99.13% B %95.62%97.64% D %89.64%88.60% I %98.11%97.89% H %98.09%98.41% K %99.37% J %100.00% R %64.71%67.65% L %99.07%99.69% T % W %97.54%96.31% V %88.85%89.18% X %97.34%98.22% R* %10.00%11.67% L3* % 95.68% N1* %95.94%97.97% R0* % 94.55%

33 Conclusions RF: ? Random sampling of variables Random sampling of training cases (samples) Repeated trials SVM vs. 1 – NN Deterministic models SVM (5FCV) > 1-NN (5FCV)

34 Acknowledgements Advisor: Chih Lee Dr. Chun-Hsi Huang National Science Foundation REU grant CCF Univ. of Connecticut

35 Thank you! Any questions?


Download ppt "Classification of Mitochondrial DNA SNPs into Haplogroups Yuran Li Department of Chemistry and Biochemistry University of Delaware Newark, DE 19717 Carol."

Similar presentations


Ads by Google