Reconstructing the Evolutionary History of Complex Human Gene Clusters

Slides:



Advertisements
Similar presentations
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Advertisements

. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
METHODS FOR HAPLOTYPE RECONSTRUCTION
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.
Genetica per Scienze Naturali a.a prof S. Presciuttini Human and chimpanzee genomes The human and chimpanzee genomes—with their 5-million-year history.
DNA Barcode Data Analysis: Boosting Assignment Accuracy by Combining Distance- and Character-Based Classifiers Bogdan Paşaniuc, Sotirios Kentros and Ion.
Lecture 5: Learning models using EM
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
Phylogenetic Trees Presenter: Michael Tung
Branch lengths Branch lengths (3 characters): A C A A C C A A C A C C Sum of branch lengths = total number of changes.
Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.
DNA Barcode Data Analysis Boosting Accuracy by Combining Simple Classification Methods CSE 377 – Bioinformatics - Spring 2006 Sotirios Kentros Univ. of.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences Thomas Schmidt Jens Stoye CPM 2004, Istanbul.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Sequencing a genome and Basic Sequence Alignment
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
© Wiley Publishing All Rights Reserved.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Genetic Algorithm.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Models and Algorithms for Complex Networks Power laws and generative processes.
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari.
Learning Phase at Head Ends 1 Edge Events Appliance Table Input Output by Naoki ref: M. Baranski and V. Jurgen (2004) by Josh Implemented in Java with.
Sequencing a genome and Basic Sequence Alignment
Calculating branch lengths from distances. ABC A B C----- a b c.
The error threshold or ribo-organisms Eörs Szathmáry Collegium Budapest AND Eötvös University.
Identifying conserved segments in rearranged and divergent genomes Bob Mau, Aaron Darling, Nicole T. Perna Presented by Aaron Darling.
E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Using BLAST for Genomic Sequence Annotation Jeremy Buhler For HHMI / BIO4342 Tutorial Workshop.
PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.
Evolution at the Molecular Level. Outline Evolution of genomes Evolution of genomes Review of various types and effects of mutations Review of various.
Construction of Substitution matrices
Evolution at the Molecular Level. Outline Evolution of genomes Evolution of genomes Review of various types and effects of mutations Review of various.
Fixed Parameters: Population Structure, Mutation, Selection, Recombination,... Reproductive Structure Genealogies of non-sequenced data Genealogies of.
Supplementary Fig. 1 Supplementary Figure 1. Distributions of (A) exon and (B) intron lengths in O. sativa and A. thaliana genes. Green bars are used for.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Learning to Align: a Statistical Approach
Supplementary Fig. 1 Supplementary Figure 1. Distributions of (A) exon and (B) intron lengths in O. sativa and A. thaliana genes. Green bars are used.
Estimating Volatilities and Correlations
Gene expression from RNA-Seq
Genome alignment Usman Roshan.
Dr. Kenneth Stanley September 25, 2006
Pipelines for Computational Analysis (Bioinformatics)
Machine Learning Basics
Tests for Gene Clustering
1 Department of Engineering, 2 Department of Mathematics,
Hidden Markov Models Part 2: Algorithms
By Chunfang Zheng and David Sankoff, 2014
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Mathematical Foundations of BME Reza Shadmehr
Dr. Kenneth Stanley February 6, 2006
A Flexible Bayesian Framework for Modeling Haplotype Association with Disease, Allowing for Dominance Effects of the Underlying Causative Variants  Andrew.
Jeffrey A. Fawcett, Hideki Innan  Trends in Genetics 
Transposable Elements
Attention for translation
Analysis of protein-coding genetic variation in 60,706 humans
Comparing 3D Genome Organization in Multiple Species Using Phylo-HMRF
Presentation transcript:

Reconstructing the Evolutionary History of Complex Human Gene Clusters Y. Zhang, G. Song, T. Vinar, E. D. Green, A. Siepel, W. Miller RECOMB 2008 Speaker: Fabio Vandin

Outline Motivation Problem definition Simple algorithm for simple model SIS algorithm for complex model Results on human genome Evaluation of SIS algorithm

Gene cluster Group of related genes Probably formed by duplications: followed by functional diversification with deletions, cause human genetic diseases

“History” of a gene cluster? 80% 85% 93% 89% 98% Percentage of similarity in self-alignment in the human genome

Dot-plots of self-alignments Human UGT2 cluster

Data Preparation Atomic segments: Collapsible segments: Self-alignment (forward and reverse-complement): pairs A ≈ B Transitive closure property: A ≈ B, B ≈ C A ≈ C Maximize each alignment Collapsible segments:

Computational Problem Input: sequence of atomic (non-collapsible) segments Output: (most probable) sequence of events (or the number of events) such that if we unwind these events in the input sequence, we obtain a sequence containing only a single atomic segment

Simple Model Assumptions: Only duplications (possibly with reversal and tandem duplications) No duplication inside the originating region The most important:

Simple Model (2) Given assumptions 1 e 2, each event is of “identical” (reversed) regions of consecutive atomic segments ( copied to ) Problem statement

Simple Model (3) Candidate definition Is this definition reasonable?

Simple Model (4) Algorithm Analysys

Simple Model: limitations Assumptions: “large scale deletions are likely to occur” “atomic boundary reuses violating assumption are not uncommon” Same number of events, but multiple way of reconstructing the history NB: not solved with SIS, but you can assign probability..

Stochastic Model Event : History: Target distribution of histories: Duplication (possibly with reversal) Deletion (with restrictions) History: Target distribution of histories: is the number of reused atomic boundaries and

Algorithm for Stochastic Model Sample histories from the target distribution and compute the mean value (of the function we are interested in): sample from given , estimate e.g., gives the number of events Problem: how to sample from the target distribution?

Sequential Importance Sampling (SIS) Goal: compute Target distribution: Trial distribution: Sample’s weight: Output:

SIS for gene cluster history Duplication

SIS for gene cluster history (2) Deletion only without atomic boundary reuse only if “the atomic segment pair flanking a deletion site appears elsewhere” in the seq.

Application to Human Gene Clusters human genome assembly hg18 self alignment: 457 duplicated regions alignments: at least 500 bp, >= 70% identity segments separated by no more than 500 Kbp only long and non-trivial regions considered 165 biomedically interesting clusters (~111 Mbp) 5 divergence thresholds: GA, OWM, NWM, LG, DOG 825 combinations of gene cluster and divergence threshold

Application to Human Gene Clusters (2)

Application to Human Gene Clusters (3)

Application to Human Gene Clusters (4) “… help prioritize the selection of notably interesting gene clusters for more detailed comparative genomics studies” “… to compare cluster dynamics in certain lineages to observed phenotypic differences among primates” “… such sequence data should reveal differences among primate species of possible relevance for selecting species for further biomedical studies”

Evaluation of the SIS Algorithm Estimate parameters from the human genome for the events (e.g., duplications): 39% of duplication “reversed”, deletions=2% of duplications Starting from 500 Kbp sequence, generate 10 genome clusters for N= 10,20,…,100 events

Evaluation of the SIS Algorithm (2)

Discussion Main Limitations: Future Directions: details of data preparation choice of the parameters/distributions evaluation of the SIS algorithm Future Directions: include other types of events understand the stochastic model how to evaluate a model? how to evaluate an algorithm?