Presentation on theme: "Genetic variation & expression - “genetical genomics”* Yaniv Loewenstein CompBio Msc seminar December 2005 * Jansen RC, Nap JP. Genetical genomics: the."— Presentation transcript:
Genetic variation & expression - “genetical genomics”* Yaniv Loewenstein CompBio Msc seminar December 2005 * Jansen RC, Nap JP. Genetical genomics: the added value from segregation. Trends Genet Jul;17(7):
Genetic Variation and Expression 1. Morley et al. Genetic analysis of genome-wide variation in human gene expression. Nature Aug 12;430(7001): Bystrykh et al. Uncovering regulatory pathways that affect hematopoietic stem cell function using 'genetical genomics'. Nat Genet Mar;37(3): Chesler et al. Complex trait analysis of gene expression uncovers polygenic and pleiotropic networks that modulate nervous system function. Nat Genet Mar;37(3):
Overview The original papers are tiresome. But principals and ideas are fun! “Genetical genomics” (Jansen 2001) Background - basic genetics. The first experimental paper (Brem 2002). A recent paper on mouse stem cells (Bystrykh 2005). Review recent mammalian papers. Common problems etc. Discussion & future directions.
Genetical genomics: the added value from segregation. Jansen RCJansen RC, Nap JP.Nap JP Trends Genet Jul;17(7):
What is “Genetical genomics*” ? Genetics: marker-based* fingerprinting of each individual of a segregating population. Statistical QTL* framework. Genomics: compare GE across conditions. usually, one factor\gene at a time. “Multifactorial experimentation would allow the study of many more biologically relevant questions in parallel at the same or lower cost.” (Jansen 2003). *(Jansen 2001)
The classic genetics paradigm (details on next slides) Choose a hereditable trait of interest. Mendelian or quantitative (an example soon). E.g. genetic disease, height, milk product. Information from segregation (i.e. meiosis). Given genetic markers Classic markers (e.g. flower color). Molecular markers (e.g. SNPs, microsatellites) Is the trait correlated with the marker? (LOD scores). Deduce whereabouts of trait’s gene.
Meiosis 2N2N S 2 x 2N I 2 x N N = 2 I II N (4 Cells) II Mitosis (1 -> 2 cells) 2 N Equiprobable haploid combinations (no recombination) Under this model: Linked seg. P(x 1,x 2 |linked)= P(x 1 )=P(x 2 ) Independent seg. P(x 1,x 2 |¬linked)= P(x 1 )P(x 2 )
Recombination New combinations are possible for linked genes. Multiple chiasmata require dense markers. Linkage disequilibrium: Close genes are not independent (less probable to recombine)*. Remote genes (≥50cM) are independent – essentially unlinked. (*) Physical:genetical distance is variable.
Segregation creates information If segregation & recombination was a card game. (shuffling or random sample). Positive LOD scores mean that nature is “cheating”. (correlated genes). Each marker is an hypothesis. We have multiple marker hypotheses testing per trait. (correlation => closeness). (trait – e.g. genetic disease) Each segregation (meiosis) is another random sample. Dense markers add consistency (examples - soon).
Quantitative Trait Loci (QTL) For instance: blood pressure, milk production (generalization of the binary disease example). A significant QTL means that different genotypes at a polymorphic marker locus are associated with different trait values. Usually uses molecular markers. Not necessarily due to chromosomal linkage. E.g. inhibitor’s mutation correlated w. its target phenotype.
LOD score plot The markers: A.Need to be polymorphic. B.Could be anything – not necessarily a gene.
What is genetical genomics - II The concept: GE levels are the QT values. (SNPs are the molecular markers). Is GE hereditable? Hmmm… yes! (later..) Loci that correlate w. specific gene’s expression. Cis-regulated (same locus). Trans-regulated (on another chromosome).
SNP - Single Nucleotide Polymorphism. Natural genetic variation* - a molecular marker. Sometimes leads to phenotypic variation Millions of markers in eukaryote genomes. Genotyping is high-throughput SNP-Chips, re-sequencing arrays. Other polymorphisms (markers) exist.
(a)A qualitative expression for cDNA1 [=> this is a marker]. (c) components can’t be resolved despite of F/f segregation. F by itself is not informative. (d) can be resolved based on D/d and F/f. F contributes information about other cDNAs. (b) gives a quantitative profile [grouping segregating alleles].
More “genetical genomics” III We need: A segregating population. An extensive molecular marker map. Preferably an organism with known genome seq. Quantitative trait measurements (e.g. cDNA chips)
More “genetical genomics” III We need A segregating population. An extensive molecular marker map. Preferably an organism with known genome seq. Quantitative trait measurements (e.g. cDNA chips). Proposed for Arabidopsis (at the time). Large pedigree of F2, F3 progeny. Recombinant Inbred Lines (RIL). Think of twin experiments. Today RI mice are available.
More “genetical genomics” III
Genetical genomics closes the circuit
So “genetical genomics” is cool! General framework for any expression profiling. Multifactorial – multiple experiments concurently. Can detect genes: Not on the array. With low expression. Fuzzy & epigenetic gene interactions. (Who said miRNAs !?) With influential expression (long) before sampling*. (Pathways with memory could be visualized).
So “genetical genomics” is cool! General framework for any expression profiling. Can detect genes: Not on the array. With low expression. Fuzzy & epigenetic gene interactions. (Who said miRNAs !?) With influential expression (long) before sampling*. (Pathways with memory could be visualized). “Likely to become instrumental in the further unraveling of metabolic, regulatory and developmental pathways” (Jansen 2001).
G. genomics – organisms to date Yeast (>5 Kruglyak papers) Plants Arabidopsis. Maize. Sugarcane. Etc. (QTLs are hip in plants). Fish Fly Mice Rat WebQTL website. Human Many reviews. Much more to come.
Genetic dissection of transcriptional regulation in budding yeast Brem RB, Yvert G, Clinton R, Kruglyak L. Science Apr 26;296(5568):752-5.
Experimental setup – Brem 2002 (* - different hybridization across strains? ) Cross two S. cerevisiae strains: A standard lab strain (BY) A wild California vineyard strain (RM) genes on expression array* 3312 SNP markers. S98 Affymetrix GeneChip. covering> 99% genome
Chromosome XII – 4 segregants 100kb Brem 2002
Controls – Brem differentially expressed genes between strains. (P<0.005, 23 expected by chance). Median proportion of obs. variation that is genetic* = 84% A bunch of known genes correctly linked (LOD>9). 73 crossovers (86 expected). 2:2 marker segregation.
Controls – Brem differentially expressed genes between strains. (P<0.005, 23 expected by chance). Median proportion of obs. variation that is genetic* = 84% A bunch of known genes correctly linked (LOD>9). 73 crossovers (86 expected). 2:2 marker segregation. Neither parent was flocculent BY is mutant in FLO1, and RM in FLO8 1:3 after the cross
Cis vs trans QTLs. cis trans We check for significant correlations i.e. There is “something” near the marker that affects the quantity of the trait. (null hypothesis: the QT is independent of the marker)
Results - Brem linked to 1 locus (P<5x10 -5, 53 exp). (205 for 0 FP exp)
Results II – Brem not diff. expressed in the parents. (1) statistically insignificant (40 vs. 6 samples). (2) A false + (3) Transgressive segregation. P: (+,-) ; (-,+) F1: (-,-) ; (+,+)
Cis/trans-ness Cis = linkage within 10kb. 32%-36% of 570 fell into this category. (none by chance). Create 20kb bins No bin expected to have >5 linkages by chance 10 (8) bins in their analysis. 7 to 87 linked genes per bin.
~ 40% fell into the 8 trans groups
Enrichments in trans groups Modulator sometimes in the group too. No enrichment for TFs in the trans-QTLs (!) (consistent in further publications from this group). A biological story for each group. Some were further checked experimentally. E.g. in group 5 a known Hap1 motif was identified in new group members.
Conclusions – Brem differentially expressed but no linked. Simulations for N linked loci (equal effect): 97% would link for N=1, 39% N=5. >29% if strongest locus explained 1/3. But only 308/570 20% were linked. => most mRNAs are affected by multi loci. =>most loci effect less than 1/3 (Transgressive segregation adds complexity).
Summary – Brem 2002 “Instead of changing a condition… casual connections between modulator loci and genes they directly and indirectly affect, are made”. Detects subtle effects obscured in knockout. “Even in yeast, under controlled environment GE has a polygenic basis”
Summary – Brem 2002 Detects subtle effects obscured in knockout. “Even in yeast, under controlled environment GE has a polygenic basis”. “Instead of changing a condition… casual connections between modulator loci and genes they directly and indirectly affect, are made”. Regulatory genetic variation is characterized by a high rate of cis-acting alleles and a small number of trans-acting alleles with widespread transcriptional effects.
(1) Ronald J, Brem RB, Whittle J, Kruglyak L. Local Regulatory Variation in Saccharomyces cerevisiae. PLoS Genet Aug 19;1(2):e25 (2) Brem RB, Kruglyak L.The landscape of genetic complexity across 5,700 gene expression traits in yeast. Proc Natl Acad Sci U S A Feb 1;102(5): (3) Yvert G, Brem RB, Whittle J, Akey JM, Foss E, Smith EN, Mackelprang R, Kruglyak L. Trans-acting regulatory variation in Saccharomyces cerevisiae and the role of transcription factors. Nat Genet Sep;35(1): (4) Brem RB, Storey JD, Whittle J, Kruglyak L.Genetic interactions between polymorphisms that affect gene expression in yeast. Nature Aug 4;436(7051): (5) Storey JD, Akey JM, Kruglyak L. Multiple locus linkage analysis of genomewide expression in yeast. PLoS Biol Aug;3(8):e267. (6) Ronald J, Akey JM, Whittle J, Smith EN, Yvert G, Kruglyak L. Simultaneous genotyping, gene-expression measurement, and detection of allele- specific expression with oligonucleotide arrays. Genome Res Feb;15(2): (7 - Today) Brem RB, Yvert G, Clinton R, Kruglyak L. Genetic dissection of transcriptional regulation in budding yeast. Science Apr 26;296(5568): Further work by this group: Refinement of computational methodology. Genotyping + expression concurrently on same array. New feedback loops. Cis-polymorphisms that affect GE. Enrichment in promotors (not all in TF binding sites). 3’ UTRs. Transgressive segregation is common. Most genes are affected by many other genes. most QTLs have only weak effects. 40% of highly heritable transcripts have no QTL. Take home message: Even in yeast everything is much more complicated than we assume.
Nature Genetics 37(3), Mar back to back ‘genetical genomics’ publications. i. Mice- hematopoetic stem cells [Bystrykh et al.] ii. Mice- forebrain [Chesler et al.] ii. Rat- metabolic stress syndrome [Hubner et al.] iii. A review of the above [Broman.]* [* Another good review I used, by Li & Margit later this year].
RI lines P : Distinct parental strains (completely homozygous) F1: completely heterozygous Fx: homozygous. mosaics of parents. duplicates. Recombinant Inbred. Sibling (or self) intercross Recombination = shuffling Broman 2005
RI* advantages - I Greater mapping resolution than intercross. Denser breakpoints on RI chromosomes. A single genome can be assayed repeatedly. Multiple individuals can be assayed. Reduce variation - noise. E.g. individual, environmental, measurement. Integration of phenotypes from multiple investigators Essentially unlimited number of phenotypes can be measured from each line. (*) RI = Recombinant Inbred
RI* advantages - II Phenotype data integration from multiple sources. Brain GE integrated with >650 previous phenotypes from these lines [Chesler et al.] e.g. measures of behaviors => identify new candidate genes underlying behavior. HSC GE with brain GE data [Bystrykh et al]. investigate the tissue specificity of trans-acting QTLs. Standardized RILs make shared DBs invaluable. WebQTL.org (*) RI = Recombinant Inbred
The WebQTL Database genetic reference populations (RI) of mouse (BXD, LXS, etc.). rat (HXB). Arabidopsis. Each with dense genetic maps Modifiers causing downstream differences in expression, and higher-order phenotypes. 3 million mouse SNPs.
What can WebQTL* do? Use your own QT values or site’s DB. Simple\composite QTL Interval Mapping Use known QTLs for background Bootstrap tools (estimate confidence intervals). Create network graphs of custom traits. A bunch of handy python scripts with some powerful C implementations (e.g. PCA). Linked to all DBs (UCSC, GNF, Entrez). *
What do we hope to learn? (Broman 2005) Coregulated genes network identification. (Dissect the pathways that connect genes). Understanding the etiology of disease phenotypes. E.g. Metabolic syndrome in rats [Hubner 2005]. More on that very soon.
What do we hope to learn? (Broman 2005) Coregulated genes network identification. (Dissect the pathways that connect genes). Understanding the etiology of disease phenotypes. Metabolic syndrome in rats [Hubner 2005]. Human psychiatric disorders are tested. e.g. susceptibility for type II alcoholism (very SciFi). More on that very soon.
Limitations (Broman 2005) Path from QTL to gene remains laborious. Subject to chance (remember the CF story?). Focus on genes that have differential GE between two strains that also differ in the target phenotype. not necessary nor sufficient. Correlations are insufficient for causation. Is a gene’s response part of the etiology or pathology of the disease? (sounds familiar?)
How about a break?
Uncovering regulatory pathways that affect hematopoietic stem cell function using ‘genetical genomics’. Bystrykh et al. Nature Genetics 37, (2005)
Hematopoietic Stem cells (HSC) HSC* undergo self renewing divisions. Forms bone, muscle, blood cells. Used in cancer therapy. Lots of GE profiling on embryonic\neural\SC. Some new SC transcripts. limited overlap between groups. The 1,000,000$ question What is the transcriptional circuitry that distinguishes SC? (justification - soon)
Background (Bystrykh 2005) Genetic work with D2 (DBA/2) & B6 (C57BL/6) 2 mice strains (1.2 M SNPs apart). HSC turnover rate: D2 > B6 (previous work) Cell-autonomous & environment independent. => Result of distinct GE patterns in HSCs. Scp2 – A 10 cM QTL on chromo. 11 Remember me. Modulates % cells in S phase. Associated w. mean mouse lifespan Extensively checked with backcrossed mice. Deletions in human 5q31.1 causes AML + MS.
Current experimental setup Homozygous RI strains from D2 x B6 D2:B6 alleles 1:1 => duplicates. 3 mice per RI x 2 Affymetrix U markers (distribution of B6\D2 alleles) x 12K genes (almost all with known positions) Analyzed using webQTL.
Results I P<0.05 cis ±20Mb trans Horizontal bands - local variation in gene density + incomplete chip representation.
Results – cis-QTLs 478 cis-regulated transcripts (within 20Mb). 5 would fall within 20Mb by chance. 162 highly significant (per 12K/2600Mb). Some important to HSC function. Most contain polymorphisms in regulatory elements. 0.3% of probes contain B6/D2 SNPs. But most don’t map as cis-QTLSs. Several known HSC genes are polymorphic and diff. expressed in B6/D2. These are strongly cis-regulated. Bystrykh 2005
4 examples of cis-QTLs (LRS) likelihood ratio statistic [association strength] SNP density Some of these were identified before as HSC preferentially expressed genes.
Results - trans-QTLs 136 linked (P<0.005) to a single marker. Weaker linkage statistics than cis-QTLs. Some QTLs control multiple transcripts. Vertical bands Some nice stories & anecdotes. (E.g. X chromosome linkage). Some show mendelian inheritance. Some of the top trans-QTLs have documented associations (A lot of some’s).
Brain vs. HSC QTLs Stable QTLs (not necessarily cis) Brain vs. HSC
Comparing brain and HSC QTLs Distinct tissues GE repeatedly phenotyped. (But why use global normalization?). 297 genes w. stable regulatory QTL. Stable means within 20Mb… (too fuzzy?). 297 out of 75 stable HSC cis-QTLs. It would be good to have another tissue. 222 stable trans-regulated i.e. identical QTL in brain & HSC
Show me the money! In yeast (reminder): Trans-QTLs not enriched for TFs. Enrichment for genes with similar known functions mapping to the same QTL Everything is more simple. “Collections of coregulated transcripts*, consist largely of downstream targets of polymorphic genes.” (*) identified by vertical trans-acting bands
Money time. 1. Select 4 strongly cis-regulated genes of known function [Runx1 TF]. 2. downstream targets := genes w. same expression pattern across strains [Tcrb, Csfr1 ds-targets]. (webQTL correlation tool). Predict (new) putative downstream factors targets. Some of which have documented support of interactions. (Not very convincing in my opinion).
Scp2* genes identification. Take Affy transcripts from this interval. (~25% of mouse genes on chip) Similar variation across 30 strains for cis- regulation. 8 cis-regulated genes (in HSC). In brain: 3-cis + 1-trans. 4 HSC specific (based on 2 tissues…) * HCS 10cM QTL from previous study
Scp2 targets analysis “HSC turnover is a complex phenotype”. probably polygenic. “A more complex model than yeast”. “highly coregulated and trans-regulated transcripts can uncover the function of the underlying QTL gene“. Look for associated transcripts genome-wide for each of the 8 cis-genes (P<0.05) Actually per cluster Some DNA repair genes, many stories. no systematical testing.
Bystrykh summary “Molecular networks associated with phenotypic differences immediately become accessible as collections of coregulated genes controlled by a single locus”. “key candidate genes within such a locus can be identified by their physical position”. Actually the phenotypic association was made with “classic” genetic work.
Conclusions (Broman 2005) Decide what you want to learn before you start. Tremendous computational & statistical challenges. New visualization tools needed. New 1000(!) 8-way RI lines in plan. Compared with 32 2-way RILs of mice in these papers. “The focus of the computational biologist will need to change from the development of tools that answer specific questions to tools that enable biologists to carry out their own investigations—to explore, visualize and find biological signals in complex data”. (I beg To differ).
Bystrykh 2005 – my comments To check tissue specificity it would be good to have another reference tissue. Can use GNF data. Don’t use global normalization. 20Mb ‘stable’-ness – probably too fuzzy. We haven’t learned much of Scp2 from genetical genomics. No methodological analysis. Densely mark areas of a-priori interest. 10cM = 10% recombination. Select strains with recombination in this area
Discussion General trends- summary My comments Your comments The future
General trends (I) So, is GE hereditable? At least to some extent – yes. But we probably oversimplify. QTL modeling assumes hereditability. This is never explicitly discussed. Inherent complexity of polygenic expression. Change less parameters per experiment. => information vs. confidence tradeoff. Use engineered chromosomal recombinants(?).
General trends – infancy problems answering very specific questions w. a system biology tool. Manual analysis of single “interesting” genes. you will always find them. Inadequate planning: Markers density. SNPs on probes? (Affymetrix).
The future – my 2 cents GGI\ PPI (Zohar’s lecture) issues apply to genetical genomics analysis. FP await - experimentalists needed. Unclear or suggestive correlations (undirected). Selective targets for genetic\physical interactions. Genes that share a trans-QTL, and have a cis- QTL as well are even more interesting.
The Future – my 2.5 cents New motifs, in trans-QTL clusters of targets? Combine w. TF location analysis + predicted motifs. Cis-sites are putative binding sites. Trans-sites are possible regulators. Enrich regulatory networks. Improve\test GE clustering w. trans linkage data. E.g. shared regulators vs. absolute GE correlation. Genes with same trans-QTL will probably behave the same under relevant conditions.
The future – miRNAs QTLs associate miRNA with their targets? Found correlations to 3’ UTR polymorphism. ORF-less trans-QTLs – novel miRNA genes. SNPs + cis-QTLs + UTR =? miRNA target. Trans-QTLs with no TF enrichment in yeast. validate existing miRNA predictions as well. Your suggestions please.
Summary Traditional GE: Thousands of measures to find the relevance of specific genes (groups) to a specific condition / experiment / KO / etc. Genetical genomics: We now compare thousands of phenotypes under a spectrum of multiple changing conditions (genotypes). Recombination and RI* lines sample a random but fixed combination of conditions. Still complete proof = changing 1 tested condition. Bottom line: A strong integrative & modular framework. will probably become very prominent. (*) – but this manipulation is impossible in human.
Thank you for listening
Additional bibliography Jansen RC, Nap JP. Genetical genomics: the added value from segregation. Trends Genet Jul;17(7): Li J, Burmeister M. Genetical genomics: combining genetics with gene expression analysis. Hum Mol Genet Oct 15;14 Spec No. 2:R163-9.
QTL assumes hereditibility. Genetical hotspots Measure in cMs