Presentation on theme: "Genome Rearrangements in Evolution and Cancer Guillaume Bourque Genome Institute of Singapore HKU-Pasteur Research Centre - Hong Kong August 28 th, 2009."— Presentation transcript:
Genome Rearrangements in Evolution and Cancer Guillaume Bourque Genome Institute of Singapore HKU-Pasteur Research Centre - Hong Kong August 28 th, 2009
2 Outline Genome Rearrangements in Evolution [ ??? ] Cancer genomics
4 High hopes Explain the physical clustering of gene families (regulation, editing or retention). Understand whether even longer linkage associations were preserved by chance or by selection (developmental or functional). Resolve the mammalian phylogeny using genomic segment exchanges as characters. Discover molecular fossils of precipitous genomic events. Identify genetic determinants of reproductive isolation, adaptation, survival and species formation. O’Brien et al, Science 1999
5 Comparing 2 sequences GGCACAAATCCAAATCCAAATCCGGGTTGGGGTTGGGGTTGGGGTTGCGACACATTTGGCCTGTCGTCGTCCGTCGTC GGCACAAATCCAAATCCAAATCCAATGTGTCGCAACCCCAACCCCAACCCCAACCCTGGCCTGTCGTCGTCCGTCGTC Need to reverse complement
6 If you have 3 sequences… Seq_1 vs Seq_2Seq_1 vs Seq_3Seq_2 vs Seq_3 1 2 3 4 5 123-451-2345 Seq_1 : 1 -2 3 4 5 Seq_2 : 1 2 3 -4 5 Seq_3 : 1 2 3 4 5
15 Overview of the Results Nearly 20% of chromosome breakpoint regions were reused. Gene-density is higher in evolutionary breakpoint regions. Segmental duplications populate the majority of primate- specific breakpoints.
19 Recovering true ancestral events Analyses of genome rearrangements are typically evaluated on: –Quality of the ancestral reconstructions –Ability to recover the correct topology –Total number of rearrangements in the scenario recovered (parsimony) We decided to focus on the accuracy of the rearrangements recovered Start by measuring accuracy using simulations and then apply the approach to real data sets Why? –Look for events that could have been involved in speciation –Look at sequence features associated with these events (e.g. repeats, genes, etc.) –Gain mechanistic insights into genome rearrangements
20 EMRAE :: Efficient Method to Recover Ancestral Events Relies on adjacencies conserved in a significant fraction of the genomes. Combines conserved adjacencies (and nearly conserved adjacencies) to predict rearrangement events. Applicable to uni and multi-chromosomal genomes. Currently models: inversions, translocations, fusions, fissions and transpositions. But also amenable to insertions and deletions. Achieves high specificity with comparable sensitivity.
21 Conserved adjacencies Define an adjacency a(c i, c i+1 ) as an ordered pair of integers c i c i+1 or its inverse -c i+1 -c i found in a given genome. For a given edge e, if the adjacency a is found in every genome of S A but not in any genome of S B we say that a is a conserved adjacency of S A.
27 Human-specific breakpoints are enriched in SDs Human-specific breakpoint regions are significantly enriched in SDs as compared to size-matched random regions (p-value < 0.001). Indeed, 93.2% of the human-specific breakpoint regions (69 out of 74) contain SDs. This is true for only approximately 60% of size-matched random regions.
28 Homologous matching pairs of SDs are enriched in human-specific breakpoints Taking the 74 human-specific breakpoints identified in this study, we observed 100 pairs of regions with matching pairs of SDs instead of an average of 25 pairs observed in the random simulated data sets.
29 Primate reversals are associated with SDs The average percent identity of the SDs that are associated with reversals correlates with the relative age of these events. This helps confirms the direct link between SDs and many rearrangements events.
Extension from primate specific reversals to all the predicted mammalian reversals We used BLAST to detect homology between breakpoints of the predicted reversals Many reversals are flanked by regions of high sequence identity (BLAST score >1000) If not SDs, what?
31 Homology flanking mammalian reversals We found that 58%, 29%, 24%, 42%, 47% and 20% of the human, chimp, rhesus, rat, mouse and dog reversals are supported by regions with Blast scores greater than 1000. What is the source of this homology? Is it expected? We restricted our analysis to the reversals with breakpoints defined within 100Kb and assessed the overlap between these regions of homology and repeats. We annotated each reversal to a particular repeat family when the overlap between the homologous segment identified and a repeat instance was greater than 50% and compared the results to matched simulated data sets.
Data Explosion Sequencing is no longer the rate limiting step This year, we expect: –2X increase in CPU –2X increase in memory –10X increase in sequencing (estimate from Illumina and SOLiD) or even 100X increase (Helios, Complete Genomics, etc.) Informatics challenges that we face now will only grow… 35
Paradigm Shift Things that are out: –Storing all primary data (images) –“All versus all” types of analysis –Single large repository (NCBI) –Careless data management (duplicated files, extra transferring steps, etc.) Things that are in: –Clusters and high performance storage –Cloud computing –Careful data management & planning –Bioinformaticians & IT engineers (even for relatively small labs) 37
38 Sequencing Human Genomes 1000 Genomes Project $$$ The Human Genome $$$$$$ Your Genome $ 200920012011 (?)
39 New opportunities… Evolution Populations Cancer In the study of …
40 Outline Genome Rearrangements in Evolution [ ??? ] Cancer genomics
52 Acknowledgments From my group: –Zhao Hao, Chi Ho Lin, Johni Masli (NUS) –Galih Kunarso, Justin Jeyakani –Woo Xing Yi, Kelson Zawack With the help of: –Yijun Ruan, Yao Fei, Axel Hillmer, Chia-Lin Wei –Charlie Lee, Pramila Ariyaratne, Ken Sung –Ed Liu –Jian Ma (UCSC), Pavel Pevzner and Glenn Tesler (UCSD) –GIS and A*STAR for financial support