Identifying conserved spatial patterns in genomes Rose Hoberman Dannie Durand Depts. of Biological Sciences and Computer Science, CMU David Sankoff Dept.

Identifying conserved spatial patterns in genomes Rose Hoberman Dannie Durand Depts. of Biological Sciences and Computer Science, CMU David Sankoff Dept. of Math and Statistics University of Ottawa Student Seminar Series Jan 20, 2006

2 The complete genetic material of an organism or species The Genome

3 Key genomic component: genes Genes encode proteins, the building blocks of the cell ACCCTTAGCTAGACCTTTAGGAGG... A gene is a DNA subsequence

4 Comparing Genomes 75 Million years HumanMouseFlyRiceE. ColiChlamydia Chromosomes232041211 Genes 20-25k 13.6k~40k3200936

5 Human Chromosome 21 is broken into at least three pieces in mouse Accidental duplication of chromosome 21 causes Down Syndrome

6 Outline  Evolution of genome organization Why identify related genomic regions? How do we find them? Identification: Formal cluster definition Validation: Testing cluster significance

7 A simple model of a chromosome an ordered list of genes 4 5 3 7 1 6 2 8 9 11 12 13 1014 15 17 1619 20 18 3

8 What are the processes of genomic change? 4 5 3 7 1 6 2 8 9 11 12 13 1014 15 17 1619 20 18 3

9 A single species:

10 Speciation 1. Initially the two populations have identical genomes 2. The populations evolve independently 3. Eventually, there will be two new species with similar but distinct genomes

11 Types of Genomic Rearrangements 4 5 3 7 1 6 2 8 9 7 11 12 10 4 5 3 1 6 2 13 14 15 17 1619 20 18 8 9 11 12 13 1014 15 17 1619 20 18 Inversions 4 3 1 2 Duplications/Insertions 3 8 9 11 12 13 1014 15 17 1619 20 18 20 19 18 13 14 15 17 16 Loss Species 2

12 8 9 7 11 12 10 4 5 3 1 6 2 4 3 1 2 20 13 14 15 17 16 Chromosomal fissions and fusions 4 5 3 7 1 2 3 13 14 15 17 1619 20 18 3 8 9 11 12 10 Types of Genomic Rearrangements Species 2

13 Genome Comparison 4 5 3 7 1 2 8 9 7 11 12 10 4 5 3 1 6 2 4 3 1 2 13 14 15 17 1619 20 18 20 13 14 15 17 16 3 4 3 1 2 8 9 11 12 10 Species 2 Species 1 Our goal: identify chromosomal regions that descended from the same region in the genome of the common ancestor

14 Outline Evolution of genome organization  Why identify related genomic regions? How do we find them? Identification: Formal cluster definition Validation: Testing cluster significance

15 Genome Annotation Problem Given the set of genes in the genome, label each with its function Protein Cellular Pathway: Glucose Metabolism ACCCTTAGCTAGACCTTTAGGAGGTGCAGGA Gene

16 There are many aspects of gene function Gene: trpA Biochemical Function: cleaves a double bond Cellular Process: amino-acid biosynthesis Protein-protein interactions: binds trpB

17 There are many aspects of gene function Gene: a typical gene Biochemical Function: ? Biological Process: ? Protein-protein interactions: ? 40-60% of genes in most genomes have unknown function Comparisons of spatial organization within genomes can yield gene function predictions

18 In bacteria, genes in the same pathway often occur together in the genome trpD trpBtrpA trpE Chorismate Anthranilate N-5’-P-ribosyl- anthranilate 1-2 Carboxy- phenylaminodeoxy -ribulose-5’P 3-Indole Glycerol-P Tryptophan Tryptophan Synthesis Pathway trpD trpCtrpBtrpAtrpE trpF trpCF Bacillus Subtilis E. coli

19 Conserved spatial organization between distantly related species suggests functional associations between the genes A: Glucose metabolism B: Glucose metabolism C: ? D: Tryptophan synthesis E: ? F: ? G: Tryptophan synthesis A B C D G E F B C D A GEF

20 Conserved spatial organization between distantly related species suggests functional associations between the genes B C D A B C G A G D EF FE A: Glucose metabolism B: Glucose metabolism C: Prediction: Glucose metabolism D: Tryptophan synthesis E: ? F: Prediction: Tryptophan synthesis G: Tryptophan synthesis A B C D G E F

21 Outline Evolution of genome organization Why identify related genomic regions?  How do we find them? Identification: Formal cluster definition Validation: Testing cluster significance

22 Closely related genomes Related regions, regions that descended from the same region in the genome of the common ancestor, are easy to identify 4 5 3 7 1 2 8 9 7 11 12 10 4 5 3 1 6 2 4 3 1 2 13 14 15 17 1619 20 18 20 13 14 15 17 16 3 4 3 1 2 8 9 11 12 10 Species 1 Species 2

23 A hundred million years...

24 Related regions are harder to detect, but there is still spatial evidence of common ancestry Similar gene content Neither gene content nor order is perfectly preserved More Diverged Genomes 4 5 3 7 1 2 8 9 11 12 10 4 5 1 6 2 3 1 2 8 9 11 12 13 1014 15 17 16 19 20 18 20 13 14 15 17 16 3 4 1 18 7 11

25 Gene clusters Similar gene content Neither gene content nor order is perfectly 4 5 3 7 1 2 8 9 11 12 10 4 5 1 6 2 3 1 2 8 9 11 12 13 1014 15 17 16 19 20 18 20 13 14 15 17 16 3 4 1 18 7 11 The signature of diverged regions

26 A Framework for Identifying Gene Clusters 1. Find corresponding genes 2. Formally define a “gene cluster” 3. Devise an algorithm to identify clusters 4. Statistically verify clusters review the most common definition my work given as input

27 Clusters are signatures of distantly related regions. Without functional constraints... After sufficient time has passed, gene order will become randomized Uniform random data tends to be “clumpy” some genes will end up proximal in both genomes simply by chance Not all clusters have biological significance.

28 Cluster Validation via Hypothesis Testing Null hypothesis: random gene order Reject gene clusters that could have arisen under the null model Clusters that cannot be rejected are likely to be functionally constrained

29 Outline Evolution of genome organization Why find related genomic regions? How do we find them?  Identification: max-gap cluster definition Validation: Testing cluster significance

30 A max-gap chain The distance or “gap” between genes is equal to the number of intervening genes A set of genes in a genome form a max-gap chain if the gap between adjacent genes is never greater than g (a user-specified parameter) g  = 2 gap  = 3

31 Max-Gap cluster definition A set of genes form a max-gap cluster of two genomes if 1. the genes forms a max-gap chain in each genome 2. the cluster is maximal (i.e. not contained within a larger cluster) g  = 2 gap  = 3

32 Max-Gap cluster definition g  = 2 g  = 3 gap  = 3 A set of genes form a max-gap cluster of two genomes if 1. the genes forms a max-gap chain in each genome 2. the cluster is maximal (i.e. not contained within a larger cluster)

33 The max-gap definition is the most widely used cluster definition in genomic analyses Allows extensive rearrangement of gene order Allows limited gene insertion and losses There is no formal statistical model for max-gap clusters

34 Outline Evolution of genome organization Why find related genomic regions? How do we find them? Identification: max-gap cluster definition  Validation: Testing cluster significance

35 Is this cluster biologically meaningful? Could it have occurred in a comparison of random genomes? The Questions Suppose two whole genomes were compared, and this max-gap cluster was identified:

36 The Inputs n: number of genes in each genome m: number of matching genes pairs g : the maximum gap allowed in a cluster h: number of matching genes in the cluster h=4

37 What is the probability of observing a max-gap cluster containing exactly h matching gene pairs assuming the genomes are randomly ordered h=4 The Problem

38 Probability of a cluster of size h h genes m genesm-h genes 1.Create chains of h genes in both genomes 2.Place m-h remaining genes so they do not extend the cluster 3.Normalize to get a probability * Basic approach Enumerate all ways to:

39 Probability of observing a cluster of size h number of ways to place h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster All configurations of m gene pairs in two genomes of size n

40 Total number of configurations of m gene pairs in two genomes of size n m genes

42 Number of ways to place h genes in two genomes so they form a cluster h genesm genesm-h genes Choose h genes to compose the cluster Assign each gene to a selected spot in each genome Select h spots in each genome, so they form a max-gap chain

43 1 2 3 4 5 …. n-L+1 …. n Ways to place the leftmost gene in the chain, so there are at least L-1 places left The maximum length of the chain is: L = (h-1)g + h The number of ways to create a chain of h genes

44 Ways to place the leftmost gene in the chain, so there are at least L-1 slots left There are h-1 gaps in a chain of h genes Choices for the size of each gap (from 0 to g) The number of ways to create a chain of h genes

45 Chains near the end of the genome Ways to place the leftmost gene in the chain, so there are at least L-1 slots left There are h-1 gaps in a chain of h genes 1 2 3 4 5 …. n-L+1 …. n Choices for the size of each gap (from 0 to g) The number of ways to create a chain of h genes

46 Number of ways to position h genes in a genome of n genes so they form a max-gap chain Ways to place remaining h-1 genes Starting positions near end Starting positions

47 Probability of a cluster of size h h genes m-h genes 1.Create chains of h genes in both genomes 2.Place m-h remaining genes so they do not extend the cluster * Basic approach Enumerate all ways to:

49 Approach: design a rule specifying where the genes can be placed so that the cluster is not extended count the positions Counting the number of ways to place m-h genes outside the cluster g = 1 h = 3

50 Rule 1: A gene can go anywhere except in the cluster (the white box). g = 1 gaps ≤ 1 Counting the number of ways to place m-h genes outside the cluster Too lenient

51 Rule 2: Every gene must be at least g+1 positions from the cluster (outside the grey box). g = 1 g + 1 Counting the number of ways to place m-h genes outside the cluster Too strict

52 g = 1 h = 3 gap > 1 Counting the number of ways to place m-h genes outside the cluster Rule 2: Every gene must be at least g+1 positions from the cluster (outside the grey box). Too strict

53 g = 1 gap > 1 Counting the number of ways to place m-h genes outside the cluster Too lenient Rule 3: At most one member of each gene pair can be in the grey box.

54 Rule 3: At most one member of each gene pair can be in the grey box. g = 1 Counting the number of ways to place m-h genes outside the cluster Too lenient gaps ≤ 1

55 g = 1  Acceptable positions for a gene depend on the positions of the remaining genes  Use strict and lenient rules to calculate upper and lower bounds on G Counting the number of ways to place m-h genes outside the cluster

56 Estimating G  Upper bound: Erroneously enumerates this configuration  Lower bound: Fails to enumerate this configuration

57 Probability of observing a cluster of size h number of ways to place h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster Hoberman, Sankoff, Durand Journal of Computational Biology, 2005

58 What can we learn from this statistical result? Are we less likely to observe a large cluster (containing more gene pairs) than a small cluster? How large does a cluster have to be before we are surprised to observe it? How do we choose the maximum allowed gap value? Larger values will yield more clusters more of these will be false positives

59 Cluster Probability g=10 Whole-genome comparison cluster statistics g=20 n=1000, m=250 h (cluster size) With a significance threshold of 10 -4, any cluster containing 8 or more genes is significant.

60 Conclusion Statistical analysis of max-gap gene clusters 1. Provides a principled approach for choosing a gap size that will yield significant clusters 2. Allows statistically significant max-gap clusters to be identified 3. Provides insight on criteria for cluster definitions

61 Odd properties of max-gap clusters 1. Moving a gene further away may make a cluster more likely 2. A larger cluster may be less significant

62 Acknowledgements Barbara Lazarus Women@IT Fellowship The Sloan Foundation The Durand Lab

63 Thanks

64 Questions?

68 Cluster Significance: Related Work Randomization tests Requires complete genome (confusing!) Not useful for choosing parameter values Very simple models Excessively strict simplifying assumptions Overly conservative cluster definitions A few more general statistical approaches Not applicable to max-gap clusters

69 Groups find very different clusters when analyzing the same data

70 Generative Models of Genome Rearrangement 1. Construct a probabilistic model specifying rates for each type of genomic rearrangement 2. Reject regions that are unlikely to have evolved via the model Challenges: Relative rates of rearrangement processes are not known requires identification of clusters Rates may differ significantly within regions of the genome between species over time (e.g. depending on population sizes)

71 Advantages of an analytical approach Analyzing incomplete datasets Principled parameter selection Efficiency? Accuracy? Understanding statistical trends Insight into tradeoffs between definitions

72 plot graph with fixed cluster size and varying maximum gap sizes is it monotonic? is a function of density and size monotonic?

73 not capturing difference in density between max-gap clusters partially conserved order

74 Identifying gene clusters 1. Formally define a “gene cluster” 2. Devise an algorithm to identify clusters 3. Verify that clusters indicate common ancestry...statistics...modeling...algorithms

75 Identifying gene clusters 1. Formally define a “gene cluster” 2. Devise an algorithm to identify clusters 3. Verify that clusters indicate common ancestry...statistics...modeling...algorithms

76 These are criteria…. Size and density Hard to capture One I’ve chosen is widely use, but see at end of talk has some problems

77 Genome The complete set of genetic material of an organism or species Chromosome A double-stranded molecule of DNA GGGGCGGGGGGCGGGGGGGGGAGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTCTCCCCG CCCCGCCCCCCGCCCCCCCCCTCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAGAGGGGC Gene A protein coding sequence

78 Genome The complete set of genetic material of an organism or species AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTCTCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAGAGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC Regions where proteins bind to turn genes on and off Genes: protein coding sequences Large stretches of DNA with unknown function. … … AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTCTCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAGAGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC …as an ordered list of genes

79 The maximum length of a chain: L = (h-1)g + h 1 2 3 4 5 6 …. n-L+1 …. n Example: h = 4 and g = 1 Ways to place the leftmost gene in the chain, so there are at least L-1 slots left

80 l < L Gaps are constrained: And sum of gaps is constrained: Ways to place the remaining h-1 genes when the gaps and length are constrained 1 2 3 4 5 …. n-L+1 …. n A known solution:

81 g1g1 g2g2 g3g3 g m-1 ≤ l A known solution:

82 l = w-1 Gaps are constrained: And sum of gaps is constrained: Counting chains at the end of the genome l = h

83 Ways to place the remaining h-1 genes, so no gap exceeds g Chains near the end of the genome Ways to place the leftmost gene in the chain, so there are at least L-1 slots left 1 2 3 4 5 …. n-L+1 …. n …. 1 L-h …

84 Ways to place remaining h-1 genes Starting positions near end Starting positions Number of ways to position h genes in a genome of n genes so they form a max-gap chain

85 Cluster Probability Whole-genome comparison cluster statistics g=10 g=20 n=1000, m=250 h (cluster size)

87 Constructive Approach number of ways to position h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster Number of configurations that contain a cluster of exactly size h number of ways to position h genes so they form a chain in a single genome

88 Constructive Approach number of ways to position h genes so they form a cluster in both genomes number of ways to place m-h remaining genes so they do not extend the cluster Number of configurations that contain a cluster of exactly size h

89 AAACATTTT E. coli GTCGGTTGG E. coli Building Phylogenetic Trees Trees are often constructed based on a single gene species with the fewest differences between their gene sequences are grouped together in the tree The history of a gene may not indicate the history of the species Construct trees based on evidence from the whole genome AAACATTTA Salmonella AAACGTTTC Chlamydia GTCGGTTGC Thermococcus GTCAGTTGC Methanococcus Genes may be laterally transferred between distantly related species

90 An Essential Task for Spatial Comparative Genomics Identify gene clusters, groups of genes that are derived from the same chromosomal region in an ancestral genome 4 5 3 7 1 2 8 9 7 11 12 10 4 5 3 1 6 2 4 3 1 2 13 14 15 17 1619 20 18 20 13 14 15 17 16 3 4 3 1 2 8 9 11 12 10

91 Human Phylogenetic Trees Describe evolutionary relationships between species each internal node represents the most recent common ancestor of the descendants edge lengths correspond to time estimates. Chimp Mouse Rat Dog 100 50 0 Million years Ago Possum

92 Building Phylogenetic Trees Trees can be built from: Physiological features Gene sequences Spatial genome organization AAACATTTTA AAATATTTA AACATTTTG ATCAGTTGC TGCACTTGT AACATTTCG Human Chimp Mouse Rat Dog Opossum Species with the fewest differences between their gene sequences are grouped together Opposable thumbs Single pair of incisors No placenta Opposable thumbs Flesh shearing teeth

93 1. Find gene clusters 2. Determine the minimum number of rearrangements between genome pairs 3. Use rearrangement distances to build phylogenies Guillaume Bourque et al. Genome Res. 2004; 14: 507-516 Whole-genome phylogenies based on spatial organization

94 Conserved spatial organization between distantly related species suggests functional associations betweeen the genes Snel, Bork, Huynen. PNAS 2002 B C D A B C E D A E D E D ? E D ?

96 Statistical Testing Provides Additional Evidence for Common Ancestry How can we verify that a gene cluster indicates common ancestry? True histories are rarely known Experimental verification is often not possible Rates and patterns of large-scale rearrangement processes are not well understood

97 Constructive Approach Enumerating configurations that contain a cluster of exactly h gene pairs 1. Select h spots in each genome, so that they form a max-gap chain 2. Choose h genes to compose the cluster 3. Assign each gene to a selected spot in each genome 4. Choose the location of the remaining m-h genes so they don’t extend the cluster h genesm genesm-h genes

98 Where are the gene clusters? Intuitive notions of what clusters look like Similar gene content Neither gene content nor order is perfectly preserved Need more rigorous criteria

99 l < L Ways to place the remaining h-1 genes when the gaps and length are constrained 1 2 3 4 5 …. n-L+1 …. n A known solution: …but not closed form

100 Ways to place the remaining h-1 genes when the gaps and length are constrained 1 2 3 4 5 …. n-L+1 …. n …. 1 L-h …

101 Future Work Evluate Developed statistical tests for max-gap clusters identified by whole-genome comparison using a combinatoric approach Results raise concerns about current methods used in comparative genomics studies

102 What characteristics should we use to evaluate a cluster? Extent of gene loss/insertion: Density? (constrained by def to 1/g) Number of insertions/delections between matches (constrained to g) Size of fragment: Number of matching genes (unconstrained) Degree of rearrangement: Number of order violations (unconstrained) …

103 Assumptions A single, linear chromosome The mapping between genes is one-to-one

104 Evaluate clusters based on size The size of a cluster is the number of matching gene pairs it contains gap  > 3 size = 4

106 Existing Algorithms Impose Order Constraints Typical approaches to finding max-gap clusters use a greedy, agglomerative algorithm initialize a cluster as a single matching gene pair search for a gene in proximity in both genomes either extend the cluster and repeat, or terminate and choose a new seed g = 2

107 Algorithms and Definition Mismatch Agglomerative algorithms will not find highly disordered max-gap clusters A divide-and-conquer algorithm has been developed (Bergeron et al, 2002) this work is not known by the biological community g = 2 A max-gap cluster of size four

108 Future Work Generalize the model Remove the assumption that gene correspondences are one-to-one Evaluate clusters based on: density, e.g. size and total gaps the degree to which order is conserved Take phylogenetic distance into account for more closely related species, random gene order is not a reasonable null hypothesis

109 In bacteria, genes in the same pathway often occur together in the genome trpD trpBtrpA trpE Chorismate Anthranilate N-5’- Phophoribosyl- anthranilate Enol-1-o-carboxy phenylamino-1- deoxyribulose phosphate Indole-3- glycerol phosphate L-Tryptophan trpD + trpE trpD + trpE trpCF trpA + trpB trpA + trpB Tryptophan Synthesis Pathway trpD trpCtrpBtrpAtrpE trpF trpCF Bacillus Subtilis E. coli

110 Speciation An ancestral species: a uniform population

111 Speciation 1. Initially the two populations have identical genomes 3. Eventually, there will be two species with similar but distinct genomes 2. The populations evolve independently

112 Time passes, …more rearrangements accumulate

113 4 5 3 7 1 2 8 9 11 12 10 4 5 1 6 2 3 1 2 8 9 11 12 13 1014 15 17 16 19 20 18 20 13 14 15 17 16 3 4 1 18 7 11 Common blocks are now harder to detect but there is still evidence of common ancestry Gene clusters –Similar gene content –Neither gene content nor order is perfectly preserved

114 Gene Clusters 4 5 3 7 1 2 8 9 11 12 10 4 5 1 6 2 3 1 2 8 9 11 12 13 1014 15 17 16 19 20 18 20 13 14 15 17 16 3 4 1 18 7 11 Intuitive notions of what clusters look like Similar gene content Neither gene content nor order is perfectly preserved Need more rigorous criteria

115 Genome The genetic material of an organism or species Specifies the complete blueprint for the organism Chromosome A long double-stranded molecule of DNA Gene A DNA sequence that encodes a protein Proteins are the building blocks of cells AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTCTCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAGAGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTCTCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAGAGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC

116 Benoit’s outline: example and a little motivation here are the issues, in order to solve this we need to… need to cluster ways to cluster exist but we don’t know how good they are want to have a statistical way of measuring it cluster def

117 What are the processes of genomic change? 1. Small-scale: point mutations  Change gene sequences 2. Large-scale: genomic rearrangements  Change gene content and order

118 In bacteria, genes in the same pathway often occur together in the genome trpD trpBtrpA trpE Chorismate Anthranilate N-5’-Phophoribosyl- anthranilate Enol-1-o-carboxy phenylamino-1- deoxyribulose phosphate Indole-3-glycerol phosphate Tryptophan trpD + trpEtrpCFtrpA + trpB Tryptophan Synthesis Pathway trpD trpCtrpBtrpAtrpE trpF trpCF Bacillus Subtilis E. coli trpA + trpBtrpD + trpE

119 Human genome Mouse genome as scrambled human genome X is scrambled but conserved Human Chromosome 21 is broken into at least three pieces in mouse Guillaume Bourque et al. Genome Res. 2004; 14: 507-516 Human genome Accidental duplication of chromosome 21 causes Down Syndrome

120 Other applications build evolutionary trees based on rearrangements detect ancient whole genome duplications identify operons estimate rearrangement frequencies...

121 Common Blocks regions that descended from the same region in the genome of the common ancestor 4 5 3 7 1 2 8 9 7 11 12 10 4 5 3 1 6 2 4 3 1 2 13 14 15 17 1619 20 18 20 13 14 15 17 16 3 4 3 1 2 8 9 11 12 10 Species 1 Species 2

122 Common Blocks are harder to detect between more distantly related organisms, but there is still evidence of common ancestry Species 1 Species 2 4 5 3 7 1 2 8 9 11 12 10 4 5 1 6 2 3 1 2 8 9 11 12 13 1014 15 17 16 19 20 18 20 13 14 15 17 16 3 4 1 18 7 11 Similar gene content Neither gene content nor order is perfectly preserved

123 Gene clusters: Similar gene content Neither gene content nor order is perfectly preserved 4 5 3 7 1 2 8 9 11 12 10 4 5 1 6 2 3 1 2 8 9 11 12 13 1014 15 17 16 19 20 18 20 13 14 15 17 16 3 4 1 18 7 11

124 Inputs 1. Two genomes (i.e, ordered lists of genes) 2. A mapping of corresponding genes

125 Hypothesis Testing Null hypothesis: random gene order Alternate hypothesis: shared ancestry Reject clusters that could have arisen under the null model

126 Number of ways to position h genes in a genome of n genes so they form a max-gap chain Probability that h randomly placed genes will form a chain in a genome of n genes:

127 Probability of h randomly placed genes forming a chain n = 1000 (total genes in genome) h (size of the chain)

128 Number of ways to place h genes in two genomes so they form a cluster h genesm genesm-h genes Choose h genes to compose the cluster Assign each gene to a selected spot in each genome Select h spots in a genome, so they form a max-gap chain

129 Calculating the Numerator Enumerate the configurations that contain a cluster of exactly h gene pairs Choose the location of the remaining m-h genes so they don’t extend the cluster h genesm genesm-h genes Choose h genes to compose the cluster Assign each gene to a selected spot in each genome Select h spots in a genome, so they form a max-gap chain

130 Closely related genomes Related regions, regions that descended from the same region in the genome of the common ancestor, are easy to identify 4 5 3 7 1 2 8 9 7 11 12 10 4 5 3 1 6 2 4 3 1 2 13 14 15 17 1619 20 18 20 13 14 15 17 16 3 4 3 1 2 8 9 11 12 10 Species 1 Species 2

131 More Diverged Genomes Related regions are harder to detect, but there is still spatial evidence of common ancestry Similar gene content Neither gene content nor order is perfectly preserved 4 5 3 7 1 2 8 9 11 12 10 4 5 1 6 2 3 1 2 8 9 11 12 13 1014 15 17 16 19 20 18 20 13 14 15 17 16 3 4 1 18 7 11

132 Genome Comparison 4 5 3 7 1 2 8 9 7 11 12 10 4 5 3 1 6 2 4 3 1 2 13 14 15 17 1619 20 18 20 13 14 15 17 16 3 4 3 1 2 8 9 11 12 10 Species 2 Species 1 Our goal: identify chromosomal regions that descended from the same region in the genome of the common ancestor

133 Comparing Genomes Chromo -somes Millions of nucleotides Genes Human23290020-25k Mouse20250020-25k Fly418013.6k Rice12430~40k E. coli14.73200 Chlamydia11936 75 Million years

134 Comparing Genomes Chromo -somes Genes Human2320-25k Mouse2020-25k Fly413.6k Rice12~40k E. coli13200 Chlamydia1936 75 Million years

Identifying conserved spatial patterns in genomes Rose Hoberman Dannie Durand Depts. of Biological Sciences and Computer Science, CMU David Sankoff Dept.

Similar presentations

Presentation on theme: "Identifying conserved spatial patterns in genomes Rose Hoberman Dannie Durand Depts. of Biological Sciences and Computer Science, CMU David Sankoff Dept."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Identifying conserved spatial patterns in genomes Rose Hoberman Dannie Durand Depts. of Biological Sciences and Computer Science, CMU David Sankoff Dept.

Similar presentations

Presentation on theme: "Identifying conserved spatial patterns in genomes Rose Hoberman Dannie Durand Depts. of Biological Sciences and Computer Science, CMU David Sankoff Dept."— Presentation transcript:

Similar presentations

About project

Feedback