Identifying conserved spatial patterns in genomes Rose Hoberman Dannie Durand Depts. of Biological Sciences and Computer Science, CMU David Sankoff Dept.

Slides:



Advertisements
Similar presentations
Tests of Hypotheses Based on a Single Sample
Advertisements

Weighing Evidence in the Absence of a Gold Standard Phil Long Genome Institute of Singapore (joint work with K.R.K. “Krish” Murthy, Vinsensius Vega, Nir.
Statistics Review – Part II Topics: – Hypothesis Testing – Paired Tests – Tests of variability 1.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Hypothesis: It is an assumption of population parameter ( mean, proportion, variance) There are two types of hypothesis : 1) Simple hypothesis :A statistical.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Phylogenetic reconstruction
DNA sequences alignment measurement
Lecture 8 Alignment of pairs of sequence Local and global alignment
Molecular Evolution Revised 29/12/06
Evaluating the Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
The Statistical Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand.
Genetica per Scienze Naturali a.a prof S. Presciuttini Human and chimpanzee genomes The human and chimpanzee genomes—with their 5-million-year history.
Heuristic alignment algorithms and cost matrices
Adaptive evolution of bacterial metabolic networks by horizontal gene transfer Chao Wang Dec 14, 2005.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
Similar Sequence Similar Function Charles Yan Spring 2006.
BCOR 1020 Business Statistics Lecture 21 – April 8, 2008.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Scientific FieldsScientific Fields  Different fields of science have contributed evidence for the theory of evolution  Anatomy  Embryology  Biochemistry.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina
Statistical Techniques I
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
The Incompatible Desiderata of Gene Cluster Properties Rose Hoberman Carnegie Mellon University joint work with Dannie Durand.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Comp. Genomics Recitation 3 The statistics of database searching.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
We obtained breast cancer tissues from the Breast Cancer Biospecimen Repository of Fred Hutchinson Cancer Research Center. We performed two rounds of next-gen.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Using BLAST for Genomic Sequence Annotation Jeremy Buhler For HHMI / BIO4342 Tutorial Workshop.
Genetic Algorithms CSCI-2300 Introduction to Algorithms
Significance Tests for Max-Gap Gene Clusters Rose Hoberman joint work with Dannie Durand and David Sankoff.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Statistical Tests We propose a novel test that takes into account both the genes conserved in all three regions ( x 123 ) and in only pairs of regions.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Taxonomy & Phylogeny. B-5.6 Summarize ways that scientists use data from a variety of sources to investigate and critically analyze aspects of evolutionary.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
A Statistical Framework for Spatial Comparative Genomics Rose Hoberman Carnegie Mellon University, March 2007 Thesis Committee Dannie Durand (chair) Andrew.
CSCI2950-C Lecture 12 Networks
Basics of Comparative Genomics
Learning Sequence Motif Models Using Expectation Maximization (EM)
Ab initio gene prediction
Identifying conserved spatial patterns in genomes
Mattew Mazowita, Lani Haque, and David Sankoff
A Statistical Framework for Spatial Comparative Genomics Thesis Proposal Rose Hoberman Carnegie Mellon University, August 2005 Thesis Committee.
Genetic Algorithms CSCI-2300 Introduction to Algorithms
Chapter 19 Molecular Phylogenetics
Artificial Intelligence CIS 342
Basics of Comparative Genomics
Presentation transcript:

Identifying conserved spatial patterns in genomes Rose Hoberman Dannie Durand Depts. of Biological Sciences and Computer Science, CMU David Sankoff Dept. of Math and Statistics University of Ottawa Student Seminar Series Jan 20, 2006

2 The complete genetic material of an organism or species The Genome

3 Key genomic component: genes Genes encode proteins, the building blocks of the cell ACCCTTAGCTAGACCTTTAGGAGG... A gene is a DNA subsequence

4 Comparing Genomes 75 Million years HumanMouseFlyRiceE. ColiChlamydia Chromosomes Genes 20-25k 13.6k~40k

5 Human Chromosome 21 is broken into at least three pieces in mouse Accidental duplication of chromosome 21 causes Down Syndrome

6 Outline  Evolution of genome organization Why identify related genomic regions? How do we find them? Identification: Formal cluster definition Validation: Testing cluster significance

7 A simple model of a chromosome an ordered list of genes

8 What are the processes of genomic change?

9 A single species:

10 Speciation 1. Initially the two populations have identical genomes 2. The populations evolve independently 3. Eventually, there will be two new species with similar but distinct genomes

11 Types of Genomic Rearrangements Inversions Duplications/Insertions Loss Species 2

Chromosomal fissions and fusions Types of Genomic Rearrangements Species 2

13 Genome Comparison Species 2 Species 1 Our goal: identify chromosomal regions that descended from the same region in the genome of the common ancestor

14 Outline Evolution of genome organization  Why identify related genomic regions? How do we find them? Identification: Formal cluster definition Validation: Testing cluster significance

15 Genome Annotation Problem Given the set of genes in the genome, label each with its function Protein Cellular Pathway: Glucose Metabolism ACCCTTAGCTAGACCTTTAGGAGGTGCAGGA Gene

16 There are many aspects of gene function Gene: trpA Biochemical Function: cleaves a double bond Cellular Process: amino-acid biosynthesis Protein-protein interactions: binds trpB

17 There are many aspects of gene function Gene: a typical gene Biochemical Function: ? Biological Process: ? Protein-protein interactions: ? 40-60% of genes in most genomes have unknown function Comparisons of spatial organization within genomes can yield gene function predictions

18 In bacteria, genes in the same pathway often occur together in the genome trpD trpBtrpA trpE Chorismate Anthranilate N-5’-P-ribosyl- anthranilate 1-2 Carboxy- phenylaminodeoxy -ribulose-5’P 3-Indole Glycerol-P Tryptophan Tryptophan Synthesis Pathway trpD trpCtrpBtrpAtrpE trpF trpCF Bacillus Subtilis E. coli

19 Conserved spatial organization between distantly related species suggests functional associations between the genes A: Glucose metabolism B: Glucose metabolism C: ? D: Tryptophan synthesis E: ? F: ? G: Tryptophan synthesis A B C D G E F B C D A GEF

20 Conserved spatial organization between distantly related species suggests functional associations between the genes B C D A B C G A G D EF FE A: Glucose metabolism B: Glucose metabolism C: Prediction: Glucose metabolism D: Tryptophan synthesis E: ? F: Prediction: Tryptophan synthesis G: Tryptophan synthesis A B C D G E F

21 Outline Evolution of genome organization Why identify related genomic regions?  How do we find them? Identification: Formal cluster definition Validation: Testing cluster significance

22 Closely related genomes Related regions, regions that descended from the same region in the genome of the common ancestor, are easy to identify Species 1 Species 2

23 A hundred million years...

24 Related regions are harder to detect, but there is still spatial evidence of common ancestry Similar gene content Neither gene content nor order is perfectly preserved More Diverged Genomes

25 Gene clusters Similar gene content Neither gene content nor order is perfectly The signature of diverged regions

26 A Framework for Identifying Gene Clusters 1. Find corresponding genes 2. Formally define a “gene cluster” 3. Devise an algorithm to identify clusters 4. Statistically verify clusters review the most common definition my work given as input

27 Clusters are signatures of distantly related regions. Without functional constraints... After sufficient time has passed, gene order will become randomized Uniform random data tends to be “clumpy” some genes will end up proximal in both genomes simply by chance Not all clusters have biological significance.

28 Cluster Validation via Hypothesis Testing Null hypothesis: random gene order Reject gene clusters that could have arisen under the null model Clusters that cannot be rejected are likely to be functionally constrained

29 Outline Evolution of genome organization Why find related genomic regions? How do we find them?  Identification: max-gap cluster definition Validation: Testing cluster significance

30 A max-gap chain The distance or “gap” between genes is equal to the number of intervening genes A set of genes in a genome form a max-gap chain if the gap between adjacent genes is never greater than g (a user-specified parameter) g  = 2 gap  = 3

31 Max-Gap cluster definition A set of genes form a max-gap cluster of two genomes if 1. the genes forms a max-gap chain in each genome 2. the cluster is maximal (i.e. not contained within a larger cluster) g  = 2 gap  = 3

32 Max-Gap cluster definition g  = 2 g  = 3 gap  = 3 A set of genes form a max-gap cluster of two genomes if 1. the genes forms a max-gap chain in each genome 2. the cluster is maximal (i.e. not contained within a larger cluster)

33 The max-gap definition is the most widely used cluster definition in genomic analyses Allows extensive rearrangement of gene order Allows limited gene insertion and losses There is no formal statistical model for max-gap clusters

34 Outline Evolution of genome organization Why find related genomic regions? How do we find them? Identification: max-gap cluster definition  Validation: Testing cluster significance

35 Is this cluster biologically meaningful? Could it have occurred in a comparison of random genomes? The Questions Suppose two whole genomes were compared, and this max-gap cluster was identified:

36 The Inputs n: number of genes in each genome m: number of matching genes pairs g : the maximum gap allowed in a cluster h: number of matching genes in the cluster h=4

37 What is the probability of observing a max-gap cluster containing exactly h matching gene pairs assuming the genomes are randomly ordered h=4 The Problem

38 Probability of a cluster of size h h genes m genesm-h genes 1.Create chains of h genes in both genomes 2.Place m-h remaining genes so they do not extend the cluster 3.Normalize to get a probability * Basic approach Enumerate all ways to:

39 Probability of observing a cluster of size h number of ways to place h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster All configurations of m gene pairs in two genomes of size n

40 Total number of configurations of m gene pairs in two genomes of size n m genes

41 Probability of observing a cluster of size h number of ways to place h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster All configurations of m gene pairs in two genomes of size n

42 Number of ways to place h genes in two genomes so they form a cluster h genesm genesm-h genes Choose h genes to compose the cluster Assign each gene to a selected spot in each genome Select h spots in each genome, so they form a max-gap chain

…. n-L+1 …. n Ways to place the leftmost gene in the chain, so there are at least L-1 places left The maximum length of the chain is: L = (h-1)g + h The number of ways to create a chain of h genes

44 Ways to place the leftmost gene in the chain, so there are at least L-1 slots left There are h-1 gaps in a chain of h genes Choices for the size of each gap (from 0 to g) The number of ways to create a chain of h genes

45 Chains near the end of the genome Ways to place the leftmost gene in the chain, so there are at least L-1 slots left There are h-1 gaps in a chain of h genes …. n-L+1 …. n Choices for the size of each gap (from 0 to g) The number of ways to create a chain of h genes

46 Number of ways to position h genes in a genome of n genes so they form a max-gap chain Ways to place remaining h-1 genes Starting positions near end Starting positions

47 Probability of a cluster of size h h genes m-h genes 1.Create chains of h genes in both genomes 2.Place m-h remaining genes so they do not extend the cluster * Basic approach Enumerate all ways to:

48 Probability of observing a cluster of size h number of ways to place h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster All configurations of m gene pairs in two genomes of size n

49 Approach: design a rule specifying where the genes can be placed so that the cluster is not extended count the positions Counting the number of ways to place m-h genes outside the cluster g = 1 h = 3

50 Rule 1: A gene can go anywhere except in the cluster (the white box). g = 1 gaps ≤ 1 Counting the number of ways to place m-h genes outside the cluster Too lenient

51 Rule 2: Every gene must be at least g+1 positions from the cluster (outside the grey box). g = 1 g + 1 Counting the number of ways to place m-h genes outside the cluster Too strict

52 g = 1 h = 3 gap > 1 Counting the number of ways to place m-h genes outside the cluster Rule 2: Every gene must be at least g+1 positions from the cluster (outside the grey box). Too strict

53 g = 1 gap > 1 Counting the number of ways to place m-h genes outside the cluster Too lenient Rule 3: At most one member of each gene pair can be in the grey box.

54 Rule 3: At most one member of each gene pair can be in the grey box. g = 1 Counting the number of ways to place m-h genes outside the cluster Too lenient gaps ≤ 1

55 g = 1  Acceptable positions for a gene depend on the positions of the remaining genes  Use strict and lenient rules to calculate upper and lower bounds on G Counting the number of ways to place m-h genes outside the cluster

56 Estimating G  Upper bound: Erroneously enumerates this configuration  Lower bound: Fails to enumerate this configuration

57 Probability of observing a cluster of size h number of ways to place h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster Hoberman, Sankoff, Durand Journal of Computational Biology, 2005

58 What can we learn from this statistical result? Are we less likely to observe a large cluster (containing more gene pairs) than a small cluster? How large does a cluster have to be before we are surprised to observe it? How do we choose the maximum allowed gap value? Larger values will yield more clusters more of these will be false positives

59 Cluster Probability g=10 Whole-genome comparison cluster statistics g=20 n=1000, m=250 h (cluster size) With a significance threshold of 10 -4, any cluster containing 8 or more genes is significant.

60 Conclusion Statistical analysis of max-gap gene clusters 1. Provides a principled approach for choosing a gap size that will yield significant clusters 2. Allows statistically significant max-gap clusters to be identified 3. Provides insight on criteria for cluster definitions

61 Odd properties of max-gap clusters 1. Moving a gene further away may make a cluster more likely 2. A larger cluster may be less significant

62 Acknowledgements Barbara Lazarus Fellowship The Sloan Foundation The Durand Lab

63 Thanks

64 Questions?

65

66

67

68 Cluster Significance: Related Work Randomization tests Requires complete genome (confusing!) Not useful for choosing parameter values Very simple models Excessively strict simplifying assumptions Overly conservative cluster definitions A few more general statistical approaches Not applicable to max-gap clusters

69 Groups find very different clusters when analyzing the same data

70 Generative Models of Genome Rearrangement 1. Construct a probabilistic model specifying rates for each type of genomic rearrangement 2. Reject regions that are unlikely to have evolved via the model Challenges: Relative rates of rearrangement processes are not known requires identification of clusters Rates may differ significantly within regions of the genome between species over time (e.g. depending on population sizes)

71 Advantages of an analytical approach Analyzing incomplete datasets Principled parameter selection Efficiency? Accuracy? Understanding statistical trends Insight into tradeoffs between definitions

72 plot graph with fixed cluster size and varying maximum gap sizes is it monotonic? is a function of density and size monotonic?

73 not capturing difference in density between max-gap clusters partially conserved order

74 Identifying gene clusters 1. Formally define a “gene cluster” 2. Devise an algorithm to identify clusters 3. Verify that clusters indicate common ancestry...statistics...modeling...algorithms

75 Identifying gene clusters 1. Formally define a “gene cluster” 2. Devise an algorithm to identify clusters 3. Verify that clusters indicate common ancestry...statistics...modeling...algorithms

76 These are criteria…. Size and density Hard to capture One I’ve chosen is widely use, but see at end of talk has some problems

77 Genome The complete set of genetic material of an organism or species Chromosome A double-stranded molecule of DNA GGGGCGGGGGGCGGGGGGGGGAGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTCTCCCCG CCCCGCCCCCCGCCCCCCCCCTCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAGAGGGGC Gene A protein coding sequence

78 Genome The complete set of genetic material of an organism or species AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTCTCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAGAGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC Regions where proteins bind to turn genes on and off Genes: protein coding sequences Large stretches of DNA with unknown function. … … AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTCTCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAGAGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC …as an ordered list of genes

79 The maximum length of a chain: L = (h-1)g + h …. n-L+1 …. n Example: h = 4 and g = 1 Ways to place the leftmost gene in the chain, so there are at least L-1 slots left

80 l < L Gaps are constrained: And sum of gaps is constrained: Ways to place the remaining h-1 genes when the gaps and length are constrained …. n-L+1 …. n A known solution:

81 g1g1 g2g2 g3g3 g m-1 ≤ l A known solution:

82 l = w-1 Gaps are constrained: And sum of gaps is constrained: Counting chains at the end of the genome l = h

83 Ways to place the remaining h-1 genes, so no gap exceeds g Chains near the end of the genome Ways to place the leftmost gene in the chain, so there are at least L-1 slots left …. n-L+1 …. n …. 1 L-h …

84 Ways to place remaining h-1 genes Starting positions near end Starting positions Number of ways to position h genes in a genome of n genes so they form a max-gap chain

85 Cluster Probability Whole-genome comparison cluster statistics g=10 g=20 n=1000, m=250 h (cluster size)

86

87 Constructive Approach number of ways to position h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster Number of configurations that contain a cluster of exactly size h number of ways to position h genes so they form a chain in a single genome

88 Constructive Approach number of ways to position h genes so they form a cluster in both genomes number of ways to place m-h remaining genes so they do not extend the cluster Number of configurations that contain a cluster of exactly size h

89 AAACATTTT E. coli GTCGGTTGG E. coli Building Phylogenetic Trees Trees are often constructed based on a single gene species with the fewest differences between their gene sequences are grouped together in the tree The history of a gene may not indicate the history of the species Construct trees based on evidence from the whole genome AAACATTTA Salmonella AAACGTTTC Chlamydia GTCGGTTGC Thermococcus GTCAGTTGC Methanococcus Genes may be laterally transferred between distantly related species

90 An Essential Task for Spatial Comparative Genomics Identify gene clusters, groups of genes that are derived from the same chromosomal region in an ancestral genome

91 Human Phylogenetic Trees Describe evolutionary relationships between species each internal node represents the most recent common ancestor of the descendants edge lengths correspond to time estimates. Chimp Mouse Rat Dog Million years Ago Possum

92 Building Phylogenetic Trees Trees can be built from: Physiological features Gene sequences Spatial genome organization AAACATTTTA AAATATTTA AACATTTTG ATCAGTTGC TGCACTTGT AACATTTCG Human Chimp Mouse Rat Dog Opossum Species with the fewest differences between their gene sequences are grouped together Opposable thumbs Single pair of incisors No placenta Opposable thumbs Flesh shearing teeth

93 1. Find gene clusters 2. Determine the minimum number of rearrangements between genome pairs 3. Use rearrangement distances to build phylogenies Guillaume Bourque et al. Genome Res. 2004; 14: Whole-genome phylogenies based on spatial organization

94 Conserved spatial organization between distantly related species suggests functional associations betweeen the genes Snel, Bork, Huynen. PNAS 2002 B C D A B C E D A E D E D ? E D ?

95

96 Statistical Testing Provides Additional Evidence for Common Ancestry How can we verify that a gene cluster indicates common ancestry? True histories are rarely known Experimental verification is often not possible Rates and patterns of large-scale rearrangement processes are not well understood

97 Constructive Approach Enumerating configurations that contain a cluster of exactly h gene pairs 1. Select h spots in each genome, so that they form a max-gap chain 2. Choose h genes to compose the cluster 3. Assign each gene to a selected spot in each genome 4. Choose the location of the remaining m-h genes so they don’t extend the cluster h genesm genesm-h genes

98 Where are the gene clusters? Intuitive notions of what clusters look like Similar gene content Neither gene content nor order is perfectly preserved Need more rigorous criteria

99 l < L Ways to place the remaining h-1 genes when the gaps and length are constrained …. n-L+1 …. n A known solution: …but not closed form

100 Ways to place the remaining h-1 genes when the gaps and length are constrained …. n-L+1 …. n …. 1 L-h …

101 Future Work Evluate Developed statistical tests for max-gap clusters identified by whole-genome comparison using a combinatoric approach Results raise concerns about current methods used in comparative genomics studies

102 What characteristics should we use to evaluate a cluster? Extent of gene loss/insertion: Density? (constrained by def to 1/g) Number of insertions/delections between matches (constrained to g) Size of fragment: Number of matching genes (unconstrained) Degree of rearrangement: Number of order violations (unconstrained) …

103 Assumptions A single, linear chromosome The mapping between genes is one-to-one

104 Evaluate clusters based on size The size of a cluster is the number of matching gene pairs it contains gap  > 3 size = 4

105

106 Existing Algorithms Impose Order Constraints Typical approaches to finding max-gap clusters use a greedy, agglomerative algorithm initialize a cluster as a single matching gene pair search for a gene in proximity in both genomes either extend the cluster and repeat, or terminate and choose a new seed g = 2

107 Algorithms and Definition Mismatch Agglomerative algorithms will not find highly disordered max-gap clusters A divide-and-conquer algorithm has been developed (Bergeron et al, 2002) this work is not known by the biological community g = 2 A max-gap cluster of size four

108 Future Work Generalize the model Remove the assumption that gene correspondences are one-to-one Evaluate clusters based on: density, e.g. size and total gaps the degree to which order is conserved Take phylogenetic distance into account for more closely related species, random gene order is not a reasonable null hypothesis

109 In bacteria, genes in the same pathway often occur together in the genome trpD trpBtrpA trpE Chorismate Anthranilate N-5’- Phophoribosyl- anthranilate Enol-1-o-carboxy phenylamino-1- deoxyribulose phosphate Indole-3- glycerol phosphate L-Tryptophan trpD + trpE trpD + trpE trpCF trpA + trpB trpA + trpB Tryptophan Synthesis Pathway trpD trpCtrpBtrpAtrpE trpF trpCF Bacillus Subtilis E. coli

110 Speciation An ancestral species: a uniform population

111 Speciation 1. Initially the two populations have identical genomes 3. Eventually, there will be two species with similar but distinct genomes 2. The populations evolve independently

112 Time passes, …more rearrangements accumulate

Common blocks are now harder to detect but there is still evidence of common ancestry Gene clusters –Similar gene content –Neither gene content nor order is perfectly preserved

114 Gene Clusters Intuitive notions of what clusters look like Similar gene content Neither gene content nor order is perfectly preserved Need more rigorous criteria

115 Genome The genetic material of an organism or species Specifies the complete blueprint for the organism Chromosome A long double-stranded molecule of DNA Gene A DNA sequence that encodes a protein Proteins are the building blocks of cells AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTCTCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAGAGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTCTCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAGAGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC

116 Benoit’s outline: example and a little motivation here are the issues, in order to solve this we need to… need to cluster ways to cluster exist but we don’t know how good they are want to have a statistical way of measuring it cluster def

117 What are the processes of genomic change? 1. Small-scale: point mutations  Change gene sequences 2. Large-scale: genomic rearrangements  Change gene content and order

118 In bacteria, genes in the same pathway often occur together in the genome trpD trpBtrpA trpE Chorismate Anthranilate N-5’-Phophoribosyl- anthranilate Enol-1-o-carboxy phenylamino-1- deoxyribulose phosphate Indole-3-glycerol phosphate Tryptophan trpD + trpEtrpCFtrpA + trpB Tryptophan Synthesis Pathway trpD trpCtrpBtrpAtrpE trpF trpCF Bacillus Subtilis E. coli trpA + trpBtrpD + trpE

119 Human genome Mouse genome as scrambled human genome X is scrambled but conserved Human Chromosome 21 is broken into at least three pieces in mouse Guillaume Bourque et al. Genome Res. 2004; 14: Human genome Accidental duplication of chromosome 21 causes Down Syndrome

120 Other applications build evolutionary trees based on rearrangements detect ancient whole genome duplications identify operons estimate rearrangement frequencies...

121 Common Blocks regions that descended from the same region in the genome of the common ancestor Species 1 Species 2

122 Common Blocks are harder to detect between more distantly related organisms, but there is still evidence of common ancestry Species 1 Species Similar gene content Neither gene content nor order is perfectly preserved

123 Gene clusters: Similar gene content Neither gene content nor order is perfectly preserved

124 Inputs 1. Two genomes (i.e, ordered lists of genes) 2. A mapping of corresponding genes

125 Hypothesis Testing Null hypothesis: random gene order Alternate hypothesis: shared ancestry Reject clusters that could have arisen under the null model

126 Number of ways to position h genes in a genome of n genes so they form a max-gap chain Probability that h randomly placed genes will form a chain in a genome of n genes:

127 Probability of h randomly placed genes forming a chain n = 1000 (total genes in genome) h (size of the chain)

128 Number of ways to place h genes in two genomes so they form a cluster h genesm genesm-h genes Choose h genes to compose the cluster Assign each gene to a selected spot in each genome Select h spots in a genome, so they form a max-gap chain

129 Calculating the Numerator Enumerate the configurations that contain a cluster of exactly h gene pairs Choose the location of the remaining m-h genes so they don’t extend the cluster h genesm genesm-h genes Choose h genes to compose the cluster Assign each gene to a selected spot in each genome Select h spots in a genome, so they form a max-gap chain

130 Closely related genomes Related regions, regions that descended from the same region in the genome of the common ancestor, are easy to identify Species 1 Species 2

131 More Diverged Genomes Related regions are harder to detect, but there is still spatial evidence of common ancestry Similar gene content Neither gene content nor order is perfectly preserved

132 Genome Comparison Species 2 Species 1 Our goal: identify chromosomal regions that descended from the same region in the genome of the common ancestor

133 Comparing Genomes Chromo -somes Millions of nucleotides Genes Human k Mouse k Fly k Rice12430~40k E. coli Chlamydia Million years

134 Comparing Genomes Chromo -somes Genes Human k Mouse k Fly413.6k Rice12~40k E. coli13200 Chlamydia Million years