Download presentation
Presentation is loading. Please wait.
Published byRegina Richards Modified over 8 years ago
1
Identifying conserved spatial patterns in genomes Rose Hoberman Dannie Durand Depts. of Biological Sciences and Computer Science, CMU David Sankoff Dept. of Math and Statistics University of Ottawa Student Seminar Series Jan 20, 2006
2
2 The complete genetic material of an organism or species The Genome
3
3 Key genomic component: genes Genes encode proteins, the building blocks of the cell ACCCTTAGCTAGACCTTTAGGAGG... A gene is a DNA subsequence
4
4 Comparing Genomes 75 Million years HumanMouseFlyRiceE. ColiChlamydia Chromosomes232041211 Genes 20-25k 13.6k~40k3200936
5
5 Human Chromosome 21 is broken into at least three pieces in mouse Accidental duplication of chromosome 21 causes Down Syndrome
6
6 Outline Evolution of genome organization Why identify related genomic regions? How do we find them? Identification: Formal cluster definition Validation: Testing cluster significance
7
7 A simple model of a chromosome an ordered list of genes 4 5 3 7 1 6 2 8 9 11 12 13 1014 15 17 1619 20 18 3
8
8 What are the processes of genomic change? 4 5 3 7 1 6 2 8 9 11 12 13 1014 15 17 1619 20 18 3
9
9 A single species:
10
10 Speciation 1. Initially the two populations have identical genomes 2. The populations evolve independently 3. Eventually, there will be two new species with similar but distinct genomes
11
11 Types of Genomic Rearrangements 4 5 3 7 1 6 2 8 9 7 11 12 10 4 5 3 1 6 2 13 14 15 17 1619 20 18 8 9 11 12 13 1014 15 17 1619 20 18 Inversions 4 3 1 2 Duplications/Insertions 3 8 9 11 12 13 1014 15 17 1619 20 18 20 19 18 13 14 15 17 16 Loss Species 2
12
12 8 9 7 11 12 10 4 5 3 1 6 2 4 3 1 2 20 13 14 15 17 16 Chromosomal fissions and fusions 4 5 3 7 1 2 3 13 14 15 17 1619 20 18 3 8 9 11 12 10 Types of Genomic Rearrangements Species 2
13
13 Genome Comparison 4 5 3 7 1 2 8 9 7 11 12 10 4 5 3 1 6 2 4 3 1 2 13 14 15 17 1619 20 18 20 13 14 15 17 16 3 4 3 1 2 8 9 11 12 10 Species 2 Species 1 Our goal: identify chromosomal regions that descended from the same region in the genome of the common ancestor
14
14 Outline Evolution of genome organization Why identify related genomic regions? How do we find them? Identification: Formal cluster definition Validation: Testing cluster significance
15
15 Genome Annotation Problem Given the set of genes in the genome, label each with its function Protein Cellular Pathway: Glucose Metabolism ACCCTTAGCTAGACCTTTAGGAGGTGCAGGA Gene
16
16 There are many aspects of gene function Gene: trpA Biochemical Function: cleaves a double bond Cellular Process: amino-acid biosynthesis Protein-protein interactions: binds trpB
17
17 There are many aspects of gene function Gene: a typical gene Biochemical Function: ? Biological Process: ? Protein-protein interactions: ? 40-60% of genes in most genomes have unknown function Comparisons of spatial organization within genomes can yield gene function predictions
18
18 In bacteria, genes in the same pathway often occur together in the genome trpD trpBtrpA trpE Chorismate Anthranilate N-5’-P-ribosyl- anthranilate 1-2 Carboxy- phenylaminodeoxy -ribulose-5’P 3-Indole Glycerol-P Tryptophan Tryptophan Synthesis Pathway trpD trpCtrpBtrpAtrpE trpF trpCF Bacillus Subtilis E. coli
19
19 Conserved spatial organization between distantly related species suggests functional associations between the genes A: Glucose metabolism B: Glucose metabolism C: ? D: Tryptophan synthesis E: ? F: ? G: Tryptophan synthesis A B C D G E F B C D A GEF
20
20 Conserved spatial organization between distantly related species suggests functional associations between the genes B C D A B C G A G D EF FE A: Glucose metabolism B: Glucose metabolism C: Prediction: Glucose metabolism D: Tryptophan synthesis E: ? F: Prediction: Tryptophan synthesis G: Tryptophan synthesis A B C D G E F
21
21 Outline Evolution of genome organization Why identify related genomic regions? How do we find them? Identification: Formal cluster definition Validation: Testing cluster significance
22
22 Closely related genomes Related regions, regions that descended from the same region in the genome of the common ancestor, are easy to identify 4 5 3 7 1 2 8 9 7 11 12 10 4 5 3 1 6 2 4 3 1 2 13 14 15 17 1619 20 18 20 13 14 15 17 16 3 4 3 1 2 8 9 11 12 10 Species 1 Species 2
23
23 A hundred million years...
24
24 Related regions are harder to detect, but there is still spatial evidence of common ancestry Similar gene content Neither gene content nor order is perfectly preserved More Diverged Genomes 4 5 3 7 1 2 8 9 11 12 10 4 5 1 6 2 3 1 2 8 9 11 12 13 1014 15 17 16 19 20 18 20 13 14 15 17 16 3 4 1 18 7 11
25
25 Gene clusters Similar gene content Neither gene content nor order is perfectly 4 5 3 7 1 2 8 9 11 12 10 4 5 1 6 2 3 1 2 8 9 11 12 13 1014 15 17 16 19 20 18 20 13 14 15 17 16 3 4 1 18 7 11 The signature of diverged regions
26
26 A Framework for Identifying Gene Clusters 1. Find corresponding genes 2. Formally define a “gene cluster” 3. Devise an algorithm to identify clusters 4. Statistically verify clusters review the most common definition my work given as input
27
27 Clusters are signatures of distantly related regions. Without functional constraints... After sufficient time has passed, gene order will become randomized Uniform random data tends to be “clumpy” some genes will end up proximal in both genomes simply by chance Not all clusters have biological significance.
28
28 Cluster Validation via Hypothesis Testing Null hypothesis: random gene order Reject gene clusters that could have arisen under the null model Clusters that cannot be rejected are likely to be functionally constrained
29
29 Outline Evolution of genome organization Why find related genomic regions? How do we find them? Identification: max-gap cluster definition Validation: Testing cluster significance
30
30 A max-gap chain The distance or “gap” between genes is equal to the number of intervening genes A set of genes in a genome form a max-gap chain if the gap between adjacent genes is never greater than g (a user-specified parameter) g = 2 gap = 3
31
31 Max-Gap cluster definition A set of genes form a max-gap cluster of two genomes if 1. the genes forms a max-gap chain in each genome 2. the cluster is maximal (i.e. not contained within a larger cluster) g = 2 gap = 3
32
32 Max-Gap cluster definition g = 2 g = 3 gap = 3 A set of genes form a max-gap cluster of two genomes if 1. the genes forms a max-gap chain in each genome 2. the cluster is maximal (i.e. not contained within a larger cluster)
33
33 The max-gap definition is the most widely used cluster definition in genomic analyses Allows extensive rearrangement of gene order Allows limited gene insertion and losses There is no formal statistical model for max-gap clusters
34
34 Outline Evolution of genome organization Why find related genomic regions? How do we find them? Identification: max-gap cluster definition Validation: Testing cluster significance
35
35 Is this cluster biologically meaningful? Could it have occurred in a comparison of random genomes? The Questions Suppose two whole genomes were compared, and this max-gap cluster was identified:
36
36 The Inputs n: number of genes in each genome m: number of matching genes pairs g : the maximum gap allowed in a cluster h: number of matching genes in the cluster h=4
37
37 What is the probability of observing a max-gap cluster containing exactly h matching gene pairs assuming the genomes are randomly ordered h=4 The Problem
38
38 Probability of a cluster of size h h genes m genesm-h genes 1.Create chains of h genes in both genomes 2.Place m-h remaining genes so they do not extend the cluster 3.Normalize to get a probability * Basic approach Enumerate all ways to:
39
39 Probability of observing a cluster of size h number of ways to place h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster All configurations of m gene pairs in two genomes of size n
40
40 Total number of configurations of m gene pairs in two genomes of size n m genes
41
41 Probability of observing a cluster of size h number of ways to place h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster All configurations of m gene pairs in two genomes of size n
42
42 Number of ways to place h genes in two genomes so they form a cluster h genesm genesm-h genes Choose h genes to compose the cluster Assign each gene to a selected spot in each genome Select h spots in each genome, so they form a max-gap chain
43
43 1 2 3 4 5 …. n-L+1 …. n Ways to place the leftmost gene in the chain, so there are at least L-1 places left The maximum length of the chain is: L = (h-1)g + h The number of ways to create a chain of h genes
44
44 Ways to place the leftmost gene in the chain, so there are at least L-1 slots left There are h-1 gaps in a chain of h genes Choices for the size of each gap (from 0 to g) The number of ways to create a chain of h genes
45
45 Chains near the end of the genome Ways to place the leftmost gene in the chain, so there are at least L-1 slots left There are h-1 gaps in a chain of h genes 1 2 3 4 5 …. n-L+1 …. n Choices for the size of each gap (from 0 to g) The number of ways to create a chain of h genes
46
46 Number of ways to position h genes in a genome of n genes so they form a max-gap chain Ways to place remaining h-1 genes Starting positions near end Starting positions
47
47 Probability of a cluster of size h h genes m-h genes 1.Create chains of h genes in both genomes 2.Place m-h remaining genes so they do not extend the cluster * Basic approach Enumerate all ways to:
48
48 Probability of observing a cluster of size h number of ways to place h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster All configurations of m gene pairs in two genomes of size n
49
49 Approach: design a rule specifying where the genes can be placed so that the cluster is not extended count the positions Counting the number of ways to place m-h genes outside the cluster g = 1 h = 3
50
50 Rule 1: A gene can go anywhere except in the cluster (the white box). g = 1 gaps ≤ 1 Counting the number of ways to place m-h genes outside the cluster Too lenient
51
51 Rule 2: Every gene must be at least g+1 positions from the cluster (outside the grey box). g = 1 g + 1 Counting the number of ways to place m-h genes outside the cluster Too strict
52
52 g = 1 h = 3 gap > 1 Counting the number of ways to place m-h genes outside the cluster Rule 2: Every gene must be at least g+1 positions from the cluster (outside the grey box). Too strict
53
53 g = 1 gap > 1 Counting the number of ways to place m-h genes outside the cluster Too lenient Rule 3: At most one member of each gene pair can be in the grey box.
54
54 Rule 3: At most one member of each gene pair can be in the grey box. g = 1 Counting the number of ways to place m-h genes outside the cluster Too lenient gaps ≤ 1
55
55 g = 1 Acceptable positions for a gene depend on the positions of the remaining genes Use strict and lenient rules to calculate upper and lower bounds on G Counting the number of ways to place m-h genes outside the cluster
56
56 Estimating G Upper bound: Erroneously enumerates this configuration Lower bound: Fails to enumerate this configuration
57
57 Probability of observing a cluster of size h number of ways to place h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster Hoberman, Sankoff, Durand Journal of Computational Biology, 2005
58
58 What can we learn from this statistical result? Are we less likely to observe a large cluster (containing more gene pairs) than a small cluster? How large does a cluster have to be before we are surprised to observe it? How do we choose the maximum allowed gap value? Larger values will yield more clusters more of these will be false positives
59
59 Cluster Probability g=10 Whole-genome comparison cluster statistics g=20 n=1000, m=250 h (cluster size) With a significance threshold of 10 -4, any cluster containing 8 or more genes is significant.
60
60 Conclusion Statistical analysis of max-gap gene clusters 1. Provides a principled approach for choosing a gap size that will yield significant clusters 2. Allows statistically significant max-gap clusters to be identified 3. Provides insight on criteria for cluster definitions
61
61 Odd properties of max-gap clusters 1. Moving a gene further away may make a cluster more likely 2. A larger cluster may be less significant
62
62 Acknowledgements Barbara Lazarus Women@IT Fellowship The Sloan Foundation The Durand Lab
63
63 Thanks
64
64 Questions?
65
65
66
66
67
67
68
68 Cluster Significance: Related Work Randomization tests Requires complete genome (confusing!) Not useful for choosing parameter values Very simple models Excessively strict simplifying assumptions Overly conservative cluster definitions A few more general statistical approaches Not applicable to max-gap clusters
69
69 Groups find very different clusters when analyzing the same data
70
70 Generative Models of Genome Rearrangement 1. Construct a probabilistic model specifying rates for each type of genomic rearrangement 2. Reject regions that are unlikely to have evolved via the model Challenges: Relative rates of rearrangement processes are not known requires identification of clusters Rates may differ significantly within regions of the genome between species over time (e.g. depending on population sizes)
71
71 Advantages of an analytical approach Analyzing incomplete datasets Principled parameter selection Efficiency? Accuracy? Understanding statistical trends Insight into tradeoffs between definitions
72
72 plot graph with fixed cluster size and varying maximum gap sizes is it monotonic? is a function of density and size monotonic?
73
73 not capturing difference in density between max-gap clusters partially conserved order
74
74 Identifying gene clusters 1. Formally define a “gene cluster” 2. Devise an algorithm to identify clusters 3. Verify that clusters indicate common ancestry...statistics...modeling...algorithms
75
75 Identifying gene clusters 1. Formally define a “gene cluster” 2. Devise an algorithm to identify clusters 3. Verify that clusters indicate common ancestry...statistics...modeling...algorithms
76
76 These are criteria…. Size and density Hard to capture One I’ve chosen is widely use, but see at end of talk has some problems
77
77 Genome The complete set of genetic material of an organism or species Chromosome A double-stranded molecule of DNA GGGGCGGGGGGCGGGGGGGGGAGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTCTCCCCG CCCCGCCCCCCGCCCCCCCCCTCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAGAGGGGC Gene A protein coding sequence
78
78 Genome The complete set of genetic material of an organism or species AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTCTCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAGAGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC Regions where proteins bind to turn genes on and off Genes: protein coding sequences Large stretches of DNA with unknown function. … … AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTCTCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAGAGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC …as an ordered list of genes
79
79 The maximum length of a chain: L = (h-1)g + h 1 2 3 4 5 6 …. n-L+1 …. n Example: h = 4 and g = 1 Ways to place the leftmost gene in the chain, so there are at least L-1 slots left
80
80 l < L Gaps are constrained: And sum of gaps is constrained: Ways to place the remaining h-1 genes when the gaps and length are constrained 1 2 3 4 5 …. n-L+1 …. n A known solution:
81
81 g1g1 g2g2 g3g3 g m-1 ≤ l A known solution:
82
82 l = w-1 Gaps are constrained: And sum of gaps is constrained: Counting chains at the end of the genome l = h
83
83 Ways to place the remaining h-1 genes, so no gap exceeds g Chains near the end of the genome Ways to place the leftmost gene in the chain, so there are at least L-1 slots left 1 2 3 4 5 …. n-L+1 …. n …. 1 L-h …
84
84 Ways to place remaining h-1 genes Starting positions near end Starting positions Number of ways to position h genes in a genome of n genes so they form a max-gap chain
85
85 Cluster Probability Whole-genome comparison cluster statistics g=10 g=20 n=1000, m=250 h (cluster size)
86
86
87
87 Constructive Approach number of ways to position h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster Number of configurations that contain a cluster of exactly size h number of ways to position h genes so they form a chain in a single genome
88
88 Constructive Approach number of ways to position h genes so they form a cluster in both genomes number of ways to place m-h remaining genes so they do not extend the cluster Number of configurations that contain a cluster of exactly size h
89
89 AAACATTTT E. coli GTCGGTTGG E. coli Building Phylogenetic Trees Trees are often constructed based on a single gene species with the fewest differences between their gene sequences are grouped together in the tree The history of a gene may not indicate the history of the species Construct trees based on evidence from the whole genome AAACATTTA Salmonella AAACGTTTC Chlamydia GTCGGTTGC Thermococcus GTCAGTTGC Methanococcus Genes may be laterally transferred between distantly related species
90
90 An Essential Task for Spatial Comparative Genomics Identify gene clusters, groups of genes that are derived from the same chromosomal region in an ancestral genome 4 5 3 7 1 2 8 9 7 11 12 10 4 5 3 1 6 2 4 3 1 2 13 14 15 17 1619 20 18 20 13 14 15 17 16 3 4 3 1 2 8 9 11 12 10
91
91 Human Phylogenetic Trees Describe evolutionary relationships between species each internal node represents the most recent common ancestor of the descendants edge lengths correspond to time estimates. Chimp Mouse Rat Dog 100 50 0 Million years Ago Possum
92
92 Building Phylogenetic Trees Trees can be built from: Physiological features Gene sequences Spatial genome organization AAACATTTTA AAATATTTA AACATTTTG ATCAGTTGC TGCACTTGT AACATTTCG Human Chimp Mouse Rat Dog Opossum Species with the fewest differences between their gene sequences are grouped together Opposable thumbs Single pair of incisors No placenta Opposable thumbs Flesh shearing teeth
93
93 1. Find gene clusters 2. Determine the minimum number of rearrangements between genome pairs 3. Use rearrangement distances to build phylogenies Guillaume Bourque et al. Genome Res. 2004; 14: 507-516 Whole-genome phylogenies based on spatial organization
94
94 Conserved spatial organization between distantly related species suggests functional associations betweeen the genes Snel, Bork, Huynen. PNAS 2002 B C D A B C E D A E D E D ? E D ?
95
95
96
96 Statistical Testing Provides Additional Evidence for Common Ancestry How can we verify that a gene cluster indicates common ancestry? True histories are rarely known Experimental verification is often not possible Rates and patterns of large-scale rearrangement processes are not well understood
97
97 Constructive Approach Enumerating configurations that contain a cluster of exactly h gene pairs 1. Select h spots in each genome, so that they form a max-gap chain 2. Choose h genes to compose the cluster 3. Assign each gene to a selected spot in each genome 4. Choose the location of the remaining m-h genes so they don’t extend the cluster h genesm genesm-h genes
98
98 Where are the gene clusters? Intuitive notions of what clusters look like Similar gene content Neither gene content nor order is perfectly preserved Need more rigorous criteria
99
99 l < L Ways to place the remaining h-1 genes when the gaps and length are constrained 1 2 3 4 5 …. n-L+1 …. n A known solution: …but not closed form
100
100 Ways to place the remaining h-1 genes when the gaps and length are constrained 1 2 3 4 5 …. n-L+1 …. n …. 1 L-h …
101
101 Future Work Evluate Developed statistical tests for max-gap clusters identified by whole-genome comparison using a combinatoric approach Results raise concerns about current methods used in comparative genomics studies
102
102 What characteristics should we use to evaluate a cluster? Extent of gene loss/insertion: Density? (constrained by def to 1/g) Number of insertions/delections between matches (constrained to g) Size of fragment: Number of matching genes (unconstrained) Degree of rearrangement: Number of order violations (unconstrained) …
103
103 Assumptions A single, linear chromosome The mapping between genes is one-to-one
104
104 Evaluate clusters based on size The size of a cluster is the number of matching gene pairs it contains gap > 3 size = 4
105
105
106
106 Existing Algorithms Impose Order Constraints Typical approaches to finding max-gap clusters use a greedy, agglomerative algorithm initialize a cluster as a single matching gene pair search for a gene in proximity in both genomes either extend the cluster and repeat, or terminate and choose a new seed g = 2
107
107 Algorithms and Definition Mismatch Agglomerative algorithms will not find highly disordered max-gap clusters A divide-and-conquer algorithm has been developed (Bergeron et al, 2002) this work is not known by the biological community g = 2 A max-gap cluster of size four
108
108 Future Work Generalize the model Remove the assumption that gene correspondences are one-to-one Evaluate clusters based on: density, e.g. size and total gaps the degree to which order is conserved Take phylogenetic distance into account for more closely related species, random gene order is not a reasonable null hypothesis
109
109 In bacteria, genes in the same pathway often occur together in the genome trpD trpBtrpA trpE Chorismate Anthranilate N-5’- Phophoribosyl- anthranilate Enol-1-o-carboxy phenylamino-1- deoxyribulose phosphate Indole-3- glycerol phosphate L-Tryptophan trpD + trpE trpD + trpE trpCF trpA + trpB trpA + trpB Tryptophan Synthesis Pathway trpD trpCtrpBtrpAtrpE trpF trpCF Bacillus Subtilis E. coli
110
110 Speciation An ancestral species: a uniform population
111
111 Speciation 1. Initially the two populations have identical genomes 3. Eventually, there will be two species with similar but distinct genomes 2. The populations evolve independently
112
112 Time passes, …more rearrangements accumulate
113
113 4 5 3 7 1 2 8 9 11 12 10 4 5 1 6 2 3 1 2 8 9 11 12 13 1014 15 17 16 19 20 18 20 13 14 15 17 16 3 4 1 18 7 11 Common blocks are now harder to detect but there is still evidence of common ancestry Gene clusters –Similar gene content –Neither gene content nor order is perfectly preserved
114
114 Gene Clusters 4 5 3 7 1 2 8 9 11 12 10 4 5 1 6 2 3 1 2 8 9 11 12 13 1014 15 17 16 19 20 18 20 13 14 15 17 16 3 4 1 18 7 11 Intuitive notions of what clusters look like Similar gene content Neither gene content nor order is perfectly preserved Need more rigorous criteria
115
115 Genome The genetic material of an organism or species Specifies the complete blueprint for the organism Chromosome A long double-stranded molecule of DNA Gene A DNA sequence that encodes a protein Proteins are the building blocks of cells AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTCTCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAGAGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC AGCAGAAGTCTGGGAATCGATCTGGAAATCCTCCTAATTTTTACTCCCTCTCCCCGCCCGGGGGCGGGGGGCGGGGGGGGG TCGTCTTCAGACCCTTAGCTAGACCTTTAGGAGGATTAAAAATGAGGGAGAGGGGCGGGCCCCCGCCCCCCGCCCCCCCCC
116
116 Benoit’s outline: example and a little motivation here are the issues, in order to solve this we need to… need to cluster ways to cluster exist but we don’t know how good they are want to have a statistical way of measuring it cluster def
117
117 What are the processes of genomic change? 1. Small-scale: point mutations Change gene sequences 2. Large-scale: genomic rearrangements Change gene content and order
118
118 In bacteria, genes in the same pathway often occur together in the genome trpD trpBtrpA trpE Chorismate Anthranilate N-5’-Phophoribosyl- anthranilate Enol-1-o-carboxy phenylamino-1- deoxyribulose phosphate Indole-3-glycerol phosphate Tryptophan trpD + trpEtrpCFtrpA + trpB Tryptophan Synthesis Pathway trpD trpCtrpBtrpAtrpE trpF trpCF Bacillus Subtilis E. coli trpA + trpBtrpD + trpE
119
119 Human genome Mouse genome as scrambled human genome X is scrambled but conserved Human Chromosome 21 is broken into at least three pieces in mouse Guillaume Bourque et al. Genome Res. 2004; 14: 507-516 Human genome Accidental duplication of chromosome 21 causes Down Syndrome
120
120 Other applications build evolutionary trees based on rearrangements detect ancient whole genome duplications identify operons estimate rearrangement frequencies...
121
121 Common Blocks regions that descended from the same region in the genome of the common ancestor 4 5 3 7 1 2 8 9 7 11 12 10 4 5 3 1 6 2 4 3 1 2 13 14 15 17 1619 20 18 20 13 14 15 17 16 3 4 3 1 2 8 9 11 12 10 Species 1 Species 2
122
122 Common Blocks are harder to detect between more distantly related organisms, but there is still evidence of common ancestry Species 1 Species 2 4 5 3 7 1 2 8 9 11 12 10 4 5 1 6 2 3 1 2 8 9 11 12 13 1014 15 17 16 19 20 18 20 13 14 15 17 16 3 4 1 18 7 11 Similar gene content Neither gene content nor order is perfectly preserved
123
123 Gene clusters: Similar gene content Neither gene content nor order is perfectly preserved 4 5 3 7 1 2 8 9 11 12 10 4 5 1 6 2 3 1 2 8 9 11 12 13 1014 15 17 16 19 20 18 20 13 14 15 17 16 3 4 1 18 7 11
124
124 Inputs 1. Two genomes (i.e, ordered lists of genes) 2. A mapping of corresponding genes
125
125 Hypothesis Testing Null hypothesis: random gene order Alternate hypothesis: shared ancestry Reject clusters that could have arisen under the null model
126
126 Number of ways to position h genes in a genome of n genes so they form a max-gap chain Probability that h randomly placed genes will form a chain in a genome of n genes:
127
127 Probability of h randomly placed genes forming a chain n = 1000 (total genes in genome) h (size of the chain)
128
128 Number of ways to place h genes in two genomes so they form a cluster h genesm genesm-h genes Choose h genes to compose the cluster Assign each gene to a selected spot in each genome Select h spots in a genome, so they form a max-gap chain
129
129 Calculating the Numerator Enumerate the configurations that contain a cluster of exactly h gene pairs Choose the location of the remaining m-h genes so they don’t extend the cluster h genesm genesm-h genes Choose h genes to compose the cluster Assign each gene to a selected spot in each genome Select h spots in a genome, so they form a max-gap chain
130
130 Closely related genomes Related regions, regions that descended from the same region in the genome of the common ancestor, are easy to identify 4 5 3 7 1 2 8 9 7 11 12 10 4 5 3 1 6 2 4 3 1 2 13 14 15 17 1619 20 18 20 13 14 15 17 16 3 4 3 1 2 8 9 11 12 10 Species 1 Species 2
131
131 More Diverged Genomes Related regions are harder to detect, but there is still spatial evidence of common ancestry Similar gene content Neither gene content nor order is perfectly preserved 4 5 3 7 1 2 8 9 11 12 10 4 5 1 6 2 3 1 2 8 9 11 12 13 1014 15 17 16 19 20 18 20 13 14 15 17 16 3 4 1 18 7 11
132
132 Genome Comparison 4 5 3 7 1 2 8 9 7 11 12 10 4 5 3 1 6 2 4 3 1 2 13 14 15 17 1619 20 18 20 13 14 15 17 16 3 4 3 1 2 8 9 11 12 10 Species 2 Species 1 Our goal: identify chromosomal regions that descended from the same region in the genome of the common ancestor
133
133 Comparing Genomes Chromo -somes Millions of nucleotides Genes Human23290020-25k Mouse20250020-25k Fly418013.6k Rice12430~40k E. coli14.73200 Chlamydia11936 75 Million years
134
134 Comparing Genomes Chromo -somes Genes Human2320-25k Mouse2020-25k Fly413.6k Rice12~40k E. coli13200 Chlamydia1936 75 Million years
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.