Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.

Slides:



Advertisements
Similar presentations
Association Studies, Haplotype Blocks and Tagging SNPs Prof. Sorin Istrail.
Advertisements

Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.
Sharlee Climer, Alan R. Templeton, and Weixiong Zhang
Sampling distributions of alleles under models of neutral evolution.
A New Nonparametric Bayesian Model for Genetic Recombination in Open Ancestral Space Presented by Chunping Wang Machine Learning Group, Duke University.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Applying haplotype models to association study design Natalie Castellana June 7, 2005.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
CSE182-L17 Clustering Population Genetics: Basics.
Finding Bit Patterns Applying haplotype models to association study design Natalie Castellana Kedar Dhamdhere Russell Schwartz August 16, 2005.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Haplotype Discovery and Modeling. Identification of genes Identify the Phenotype MapClone.
Gene expression & Clustering (Chapter 10)
C OMPUTATIONAL BIOLOGY. O UTLINE Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity of the Algorithms.
1 Bio + Informatics AAACTGCTGACCGGTAACTGAGGCCTGCCTGCAATTGCTTAACTTGGC An Overview پرتال پرتال بيوانفورماتيك ايرانيان.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
Informative SNP Selection Based on Multiple Linear Regression
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Linkage and Mapping. Figure 4-8 For linked genes, recombinant frequencies are less than 50 percent.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
California Pacific Medical Center
SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail.
Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.
Construction of Substitution matrices
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.
Yufeng Wu and Dan Gusfield University of California, Davis
Single Nucleotide Polymorphisms (SNPs
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Of Sea Urchins, Birds and Men
Constrained Hidden Markov Models for Population-based Haplotyping
Statistical Applications in Biology and Genetics
Learning Sequence Motif Models Using Expectation Maximization (EM)
Estimating Recombination Rates
BI820 – Seminar in Quantitative and Computational Problems in Genomics
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Outline Cancer Progression Models
Approximation Algorithms for the Selection of Robust Tag SNPs
Changes in mutation rate or protein abundance are not observed in HATs when comparing rho+ to rho0 cells. Changes in mutation rate or protein abundance.
SNPs and CNPs By: David Wendel.
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell Schwartz Computer Science Dept, Tepper School of Business and Dept of Biological Sciences. Carnegie Mellon University.

Motivation Individual characteristics genetic factors Model the structure of correlated genetic variation (haplotypes) in the DNA Extract haplotype patterns (motifs) from the model Perform association studies comparing motifs to susceptibility to diseases

Single Nucleotide Polymorphism (SNP) Rows: Individual samples Columns: Nucleotides ACCTGTATACGTA ACATGTAGACGGA ACCTGTAGACGGA ACATGTATACGTA ACCTGTAGACGGA

Related Article

Evolution Two types of events: mutation and recombination Mutation (one strand of one chromosome shown):  ACGTACCGTATATA  ACGTACTGTATATA Recombination (one strand of two homologous chromosomes shown):  ACGTACCGTATATAACGTACCGTACGTA  GTACTACGTACGTAGTACTACGTATATA

Recombination Ancestral Sequences: ████████████████████ ████████████████████ Current Population: ████████████████████

Comparison of blocks and motifs Blocks [Daly et al, 2000]Motifs [Schwartz 2003] Blocks [Daly et al. 2000]Motifs [Schwartz, 2003]

Minimum Description Length (MDL) Let:  M represent the parameters of the model  I represent the input matrix  E be the explanation of I using M  L be the length of encoding Objective:  Minimize L(M) + L(E(I)|M) Complicated models are penalized Prevents over-fitting

Dynamic Program - Blocks Dynamic Program [Koivisto et al. 2003]: where C ( j+1, i ) is the cost of creating a single block from j+1 to i. Running time: O(n 2 ) Work space: O(n) i single block best …

Expectation-Maximization Algorithm - Motifs 1. Create a DAG of all possible motifs with a ‘start’ vertex 2. Initialize probabilities 3. For each EM iteration i. For each row r in R a. In sub-graph corresponding to r find ML path from start ii. Re-normalize probabilities based on the number of times the vertices were used in ML path

Example

Example

Example

Example

Example Example - Re-normalize

Heuristics EM finds P 1 and P 2 but cost(P 1 ) + cost(P 2 ) ≥ cost(P’ 1 U P’ 2 ) Use knowledge from previous EM iteration Multiple shortest paths with weight: (1+ε) -cost(P) Addition of small constants to prevent zero probability in first few iterations Initialize the probabilities to favor smaller motifs Restrict maximum length of motifs

Experimental Results Num Seqs/ Recomb Rt Num SNPsDesc Length – Motifs (in bits) Desc Length- Blocks (in bits) 100/low /low /low /low /high /high /high /high Simulated data using the ms program [Hudson, 2002]

Experimental Results

Motifs: High recombination Blocks: High recombination Motifs: Low Recombination Blocks: Low Recombination

Conclusion Characterized the problem of inferring haplotype structure as an optimization problem that is robust against over-fitting Haplotype motif model better captures the structure than haplotype blocks Furthermore, motif method performs progressively better with larger input size

Discussion & Future Work Extensions:  Polynomial time algorithm/NP-hardness  Clustering and error models  Real data – recombination hot-spots Future directions Genotype data Haplotype Data Motifs/ Blocks/? Disease Analysis/ Drug design phasing direct optimization htSNP, association tests, ?current work

Encoding Motifs Let s i be the start locations of motifs Let t i,j be the number of motifs that start at i and end at j Let E i ={e i, 1, …, e i,k } be the ordered set of end locations for motifs that start at i Cost for encoding model: Additional cost for encoding motif probabilities

Explanation Explanation of a row: specify the ordered set of block haplotypes that produce the bits of the row Cost for explanation of row r : Cost for explanation:

Human Genetic Structure Chromosomes in the nucleus of cells 23 pairs of chromosomes Double helix structure of chromosomes Chromosomes: Genes and inter-genic regions Genes: Encode for proteins

Single Nucleotide Polymorphism (SNP) Human genomes are very similar SNP: Single base with high probability of variation Bi-allelic: Two out of four possible nucleotides In humans reduction in size ~ × 300

Encoding Blocks Let  s i represent the start columns of blocks  t i represent the number of blocks starting at t i Cost of encoding Model: Additionally, encoding for probabilities for block haplotypes

Encoding Blocks Explanation of a row: specify the ordered set of block haplotypes that produce the bits of the row Cost for explanation of row r : Cost for explanation:

DNA Building blocks (nucleotides): Adenine(A), Cytosine(C), Guanine(G) and Thymine(T) Adenine(A) pairs with Thymine(T) Cytosine(C) pairs with Guanine(G)

Haplotypes Contiguous regions of correlated genetic variation Two models: Blocks and Motifs Blocks:  Popular and widely assumed [Daly et al. 2000]  Boundary aligned ‘block haplotypes’ Motifs:  Recently introduced [Schwartz 2003]  Overlapping ‘haplotype motifs’

Comparison of Blocks and Motifs Two models: Haplotype blocks[Daly et al. 2000] and haplotype motifs [Schwartz 2003]

Recent Article – dogs helping humans