Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.

Slides:



Advertisements
Similar presentations
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Advertisements

Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
MALD Mapping by Admixture Linkage Disequilibrium.
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Wei-Bung Wang Tao Jiang
Human Migrations Saeed Hassanpour Spring Introduction Population Genetics Co-evolution of genes with language and cultural. Human evolution: genetics,
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Genotype Susceptibility And Integrated Risk Factors for Complex Diseases Weidong Mao Dumitru Brinza Nisar Hundewale Stefan Gremalshi Alexander Zelikovsky.
Polymorphisms – SNP, InDel, Transposon BMI/IBGP 730 Victor Jin, Ph.D. (Slides from Dr. Kun Huang) Department of Biomedical Informatics Ohio State University.
SNPs DNA differs between humans by 0.1%, (1 in 1300 bases) This means that you can map DNA variation to around 10,000,000 sites in the genome Almost all.
Haplotype Discovery and Modeling. Identification of genes Identify the Phenotype MapClone.
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
Population Genetics 101 CSE280Vineet Bafna. Personalized genomics April’08Bafna.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Doug Brutlag 2011 Genomics & Medicine Doug Brutlag Professor Emeritus of Biochemistry &
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
SNPs Daniel Fernandez Alejandro Quiroz Zárate. A SNP is defined as a single base change in a DNA sequence that occurs in a significant proportion (more.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
A single-nucleotide polymorphism tagging set for human drug metabolism and transport Kourosh R Ahmadi, Mike E Weale, Zhengyu Y Xue, Nicole Soranzo, David.
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
CS177 Lecture 10 SNPs and Human Genetic Variation
SNPs and the Human Genome Prof. Sorin Istrail. A SNP is a position in a genome at which two or more different bases occur in the population, each with.
Informative SNP Selection Based on Multiple Linear Regression
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
BGRS 2006 SEARCH FOR MULTI-SNP DISEASE ASSOCIATION D. Brinza, A. Perelygin, M. Brinton and A. Zelikovsky Georgia State University, Atlanta, GA, USA 123.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
What is a SNP?. Lecture topics What is a SNP? What use are they? SNP discovery SNP genotyping Introduction to Linkage Disequilibrium.
Genes in human populations n Population genetics: focus on allele frequencies (the “gene pool” = all the gametes in a big pot!) n Hardy-Weinberg calculations.
Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources
Gene Mapping ROBERT SANTOS ENGLISH 100 ESP NOVEMBER
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
The International Consortium. The International HapMap Project.
Bayesian Multi-Population Haplotype Inference via a Hierarchical Dirichlet Process Mixture Duke University Machine Learning Group Presented by Kai Ni August.
Design and Validation of Methods Searching for Risk Factors in Genotype Case- Control Studies Dumitru Brinza Alexander Zelikovsky Department of Computer.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Discrete Algorithms for Disease Association Search and Dumitru Brinza Department of Computer Science Georgia State University UCSD, November 29, 2006 Susceptibility.
Multiple-Locus Genome-Wide Association Testing David Dean CSE280A.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
8 and 11 April, 2005 Chapter 17 Population Genetics Genes in natural populations.
The Haplotype Blocks Problems Wu Ling-Yun
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Common variation, GWAS & PLINK
Introduction to SNP and Haplotype Analysis
Of Sea Urchins, Birds and Men
Constrained Hidden Markov Models for Population-based Haplotyping
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Introduction to SNP and Haplotype Analysis
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The coalescent with recombination (Chapter 5, Part 1)
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Outline Cancer Progression Models
Medical genomics BI420 Department of Biology, Boston College
Medical genomics BI420 Department of Biology, Boston College
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky

Outline SNPs, haplotypes and genotypes Haplotype tagging problem Linear reduction method for tagging Maximizing tagging separability Conclusions & future work

Outline SNPs, haplotypes and genotypes Haplotype tagging problem Linear reduction method for tagging Maximizing tagging separability Conclusions & future work

Human Genome and SNPs Length of Human Genome  3  10 9 base pairs Difference b/w any people  0.1% of genome  3  10 6 SNPs Total #single nucleotide polymorphisms (SNP)  1  10 7 SNPs are mostly bi-allelic, e.g., alleles A and C Minor allele frequency should be considerable e.g. > 1% Diploid = two different copies of each chromosome Haplotype = description of single copy (0,1) Genotype = description of mixed two copies (0=00, 1=11, 2=01) Twohaplotypesper individual Genotype for the individual Twohaplotypesper individual Genotype for the individual 

Haplotype and Disease Association Haplotypes/genotypes define our individuality Genetically engineered athletes might win at Beijing Olympics (Time (07/2004)) Haplotypes contribute to risk factors of complex diseases (e.g., diabetes) International HapMap project: –SNP’s causing disease reason are hidden among 10 million SNPs. –Too expensive to search –HapMap tries to identify 1 million tag SNPs providing almost as much mapping information as entire 10 million SNPs.

Outline SNPs, haplotypes and genotypes Haplotype tagging problem Linear reduction method for tagging Maximizing tagging separability Conclusions & future work

Tagging Reduces Cost Decrease SNP haplotyping cost: –sequence only small amount of SNPs = tag SNP –infer rest of (certain) SNPs based on sequenced tag SNPs Cost-saving ratio = m / k (infinite population) Traditional tagging = linkage disequilibrium (LD) needs too many SNPs, cost-saving ratio is too small (≈ 2) Proposed linear reduction method: cost-saving ratio ≈ 20 Number of SNPs: m Number of Tags : k

Haplotype Tagging Problem Given the full pattern of all SNPs for sample Find minimum number of tag SNPs that will allow for reconstructing the complete haplotype for each individual

Outline SNPs, haplotypes and genotypes Haplotype tagging problem Linear reduction method for tagging Maximizing tagging separability Conclusions & future work

Linear Rank of Recombinations Human Haplotype Evolution = –Mutations – introduce SNPs –Recombinations – propagate SNPs over entire population Replace notations (0, 1) with (–1, 1) Theorem: Haplotype population generated from l haplotypes with recombinations at k spots has linear rank (l- 1)(k+2) It is much less than number of all haplotypes = l k Conclusion: use only linearly independent SNP’s as tags

Tag SNPs Selection Tag Selecting Algorithm –Using Gauss-Jordan Elimination find Row Reduced Echelon Form (RREF) X of sample matrix S. –Extract the basis T of sample S –Factorize sample S = T  X –Output set of tags T Fact: In sample, each SNP is a linear combination of tag SNPs Conjecture: In entire population, each SNP is same linear combination of tags as in sample Sample S rref X × tags T =

Haplotype Reconstruction –Given tags t of unknown haplotype h and RREF X of sample matrix S –Find unknown haplotype h –Predict the h’ = t  X –We may have errors, since predicted h’ may not equal to unknown haplotype h. we assign –1 if predicted values are negative and +1 otherwise. (RLRP) –Variant : randomly reshuffle SNPs before choosing tags (RLR) Unknown haplotype h rref X Predicted haplotype h’ =  tags set

Results for Simulated Data Cost-saving ratio for 2% error for LR is 3.9 and for RLRP is 13 P =1000 different haplotypes m =25000 sites Sample size = k (number of tag SNP’s) = 50,100,…,750

Results for Real Data Cost-saving ratio for 5% error for LR is 2.1 and for RLRP is 2.8 P =158 different haplotypes (Daly el.,) m =103 sites Sample size = k (number of tag SNP’s) = 10,15,20,…,90

Outline SNPs, haplotypes and genotypes Haplotype tagging problem Linear reduction method for tagging Maximizing tagging separability Conclusions & future work

Tag Separability Correlation between number of zeros for SNPs in RREF X and number of errors in prediction column Greedy heuristic gives a more separable basis. For 5% error, cost-saving ratio 2.8 vs 3.3 for RLRP

Conclusions and Future work Our contributions –new SNP tagging problem formulation –linear reduction method for SNP tagging –enhancement of linear reduction using separable basis Future work –application of tagging for genotype and haplotype disease association

Thank you