Download presentation

Presentation is loading. Please wait.

Published byLevi Haine Modified about 1 year ago

1
Classification of Mitochondrial DNA SNPs into Haplogroups Yuran Li Department of Chemistry and Biochemistry University of Delaware Newark, DE Carol Wong Department of Bioengineering University of Pennsylvania Philadelphia, PA National Science Foundation – BioGRID REU Fellows Department of Computer Science and Engineering University of Connecticut Storrs, CT 06269

2
Mitochondrial DNA & Haplogroups The Genographic Project 1 Nearest Neighbor Support Vector Machines Random Forest RF PCA Results Discussions Extensions

3
Mitochondrial DNA -Found in Mitochondria - 2 to 10 copies per Mitochondrion - Hundreds to thousands per cell -Circular - Bacterial origin -Uniparental -Non-Combining -High mutation rate -Maternal Inheritence -Egg vs Sperm & Ubiquitin Marker

4
Haplogroups | Sequencing - Coding is done at the D-loop -Mutation Hotspots -Hypervariable Region I (HVR-I) -Nucleotides Each sample is tagged with a haplogroup label representing its genetic content. -Contains similar haplotyes that share a common ancestor based on SNPs.

5
SNPs SNPs (Single nucleotide polymorphisms) Insertion Deletion Transversion Transition Heteroplasmy Variables = SNPs 545 HVS I SNPs

6
Haplotypes/Haplogroups Haplotype – combination of SNPs Haplotype: HVS-I variants samples Dataset - Coarse- Hg labels Coding region SNPs/ HVS – I motifs Considered the ‘gold – standard’

7
ntent/health/pharma/snips/

8
Cladistics -Classification based on shared ancestry -Back Mutations -Homoplasy

9
Cladistics cont.

10
The Genographic Project -The National Geographic Society -Anthropological and Forensic Questions -78,590 Samples -21,141 consented database -Hg labeling is done with both HVR-I motifs and the 22-SNP panel results -Utilizes Nearest Neighbor Algorithm (1-NN)

11
Nearest Neighbor -Pattern recognition | Instance Based Learning -Simple and Power -High accuracy rate with large data sets -Data point is classified by a vote of its k nearest neighbors -Training data is separated in space into regions -Data is classified to the highest number of votes amongst its neighbors.

12
Support Vector Machines -Training and Testing -Data Vectors -Model Production -Mapping into higher dimensional plane -Maximum separating margin

13
Data Processing (SVM) -Numbering of detailed data -{x,y,z} (0,0,1), (0,1,0), (1,0,0) -Radial Basis Function (RBF) Kernel -Higher plane mapping -Simplicity -Opitmal Parameters: - Grid Search:

14
Random Forest Tree-based classification algorithm Fortran (computationally oriented programming language) original package Leo Breiman and Adele Cutler Ensemble learning algorithm Implementation through R environment

15
RF Voting for classification Random inputs Variables Samples Ntree single decision trees Mtry variables Random sampling Training set obtained through bootstrap sampling OOB data/error estimate Inputted dataset excludes certain cases 1/3 of input cases left out

16
Random Forest mgurl=http://proteomics.bioengr.u ic.edu/malibu/docs/images/random _forest_thumb.png&imgrefurl=htt p://proteomics.bioengr.uic.edu/ma libu/docs/meta_classifiers.html&us g=__oCugzsEOtYKwtBLo2Mi11k cgkcE=&h=240&w=420&sz=101& hl=en&start=1&um=1&tbnid=bG AVW705VPSR9M:&tbnh=71&tbn w=125&prev=/images%3Fq%3Dra ndomForest%26hl%3Den%26clien t%3Dfirefox- a%26rls%3Dorg.mozilla:en- US:official%26sa%3DN%26um%3 D1 mgurl=http://cg.scs.carleton.ca/~lu c/bst.gif&imgrefurl=http://cg.scs.ca rleton.ca/~luc/trees.html&usg=__ gYANVMgGa_H8CUhJZApOczZ D5Xs=&h=447&w=548&sz=21&hl =en&start=3&um=1&tbnid=lsdiIpj qYFENXM:&tbnh=108&tbnw=133 &prev=/images%3Fq%3Drandom% 2BForest%26hl%3Den%26client% 3Dfirefox- a%26rls%3Dorg.mozilla:en- US:official%26sa%3DN%26um%3 D1

17
5F - Cross Validation Test out predictive model Divide into 5 subsets (5-fold) Training set Test set ‘unseen’ data Five fold cross validation Training set/ testing set Random Forest with training set Testing set

18
5F CV https://esus.genome.tugraz.at/ProClassify/help/contents/pages/images/xv_folds.gif

19
RF model Genotyped mtDNA Dataset 545 SNPs in HVS – I HVS-I haplotypes samples Hg classification from similar haplotypes SNPs dictate Hg classifications SNPs = variables Coarse – Hg classifications in dataset ‘gold standard’

20
Model - Optimal mtry and ntree values for entire dataset ◦ Pair of parameters with lowest OOB error (training set) Mtry SNPs used to construct each tree Ntree decision trees constructed 5 fold Cross validation ◦ Random forest on training set Training set : random sampling with replacement Bootstrap sampling random sampling with replacement of cases OOB data ◦ Model : random forest object outputted ◦ Apply random forest model on test set ◦ Output = predicted Hg classifications ◦ Compare back to ‘observed’ Hg classifications

21
R environment Bill Venables and David M. Smith Primary programming language: ‘S’ (statistical) Coherent system integrating data manipulation, calculation and graphical display

22
R environment

23
PCA PCA = Principal Component Analysis Feature Selection tool Which variables more informative than others? Confusing dataset Too many variables – 545

24
PCA Reexpress dataset in another basis, the principal components (PCs) Change of basis Possible dimensional reduction Reveal hidden structure, underlying relationships Which basis best represents the dynamics of interest? Maximize variance Minimize covariance (redundancy) Find PCs – new basis vectors

25
PCA on dataset Eigendecomposition on Cx Original dataset = X C X. = 1/(number of samples) *XX T Transformed dataset = Y Y = PX P = orthonormal matrix Rows = principal components of X Rows = eigenvectors of C x C Y = diagonal covariance matrix of transformed X, Y. Diagonal entries = eigenvalues = variances

26
PCA Eigenvalues represent variance Rank order PCs= eigenvectors of original covariance matrix(new variables) by corresponding eigenvalues Subselection of k new variables (PCs) from available pool K = 64, 100, 200, 300, 400, 545 Select the first k rank ordered PCs for input into RF Transformed dataset = by k dimensions

27
Results ClassifierFeature Selection # of featuresNtreeMtryWeighted Accuracy Rate (%) RFRaw RFPCA RFPCA RFPCA RFPCA RFPCA RFPCA

28
SVM Findings Macro: 88.06% Micro: 96.59%

29
Comparison ClassifierMacro AccuracyMicro Accuracy 1-NN (LOO CV)-96.73% 1-NN (5F-CV)87.36%96.26% RF (5F-CV)87.35%96.19% SVM (5F-CV)88.06%96.59%

30
Discussion Unbalanced dataset ◦ Underrepresented haplogroups ◦ Overrepresented haplogroups ◦ Possibility: change weights/ coarser Hgs Bootstrap sampling in RF Cross validation RF vs. SVM

31
Graph

32
Coarse Hg accuracy rates Haplogroup NameSample Size1-NN AccuracyRF AccuracySVM Accuracy HV* % 62.58% L0/ %98.92%99.64% N* %49.21%52.38% M* %75.60%82.44% U* %95.83%96.55% A %98.66%100.00% C %98.70%99.13% B %95.62%97.64% D %89.64%88.60% I %98.11%97.89% H %98.09%98.41% K %99.37% J %100.00% R %64.71%67.65% L %99.07%99.69% T % W %97.54%96.31% V %88.85%89.18% X %97.34%98.22% R* %10.00%11.67% L3* % 95.68% N1* %95.94%97.97% R0* % 94.55%

33
Conclusions RF: ? Random sampling of variables Random sampling of training cases (samples) Repeated trials SVM vs. 1 – NN Deterministic models SVM (5FCV) > 1-NN (5FCV)

34
Acknowledgements Advisor: Chih Lee Dr. Chun-Hsi Huang National Science Foundation REU grant CCF Univ. of Connecticut

35
Thank you! Any questions?

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google