Presentation on theme: "Targets of recent positive selection in Indian populations Irene Gallego Romero Leverhulme Centre for Human Evolutionary Studies Department of Biological."— Presentation transcript:
Targets of recent positive selection in Indian populations Irene Gallego Romero Leverhulme Centre for Human Evolutionary Studies Department of Biological Anthropology
The Indian subcontinent Probably inhabited by H sapiens ~50,000 YBP (coastal route out of Africa, mtDNA and Y data) Drastic population expansion ~35,000 YBP Decidedly not a single panmictic population, highly stratified and fragmented –linguistics, geography, sociocultural practices. Very high incidence of T2D and obesity (predicted highest worldwide by 2030) Underrepresented in genomic diversity panels
All of which means… There has been ample time for ‘recent’ evolutionary adaptations to arise These adaptations have generally gone unexamined –Most Indian work to date has examined Indian population history, and been carried out on mtDNA and Y-chromosome
Selective sweeps and haplotypes Nielsen et al, Nat Rev Gen, 2007
Selective sweeps and haplotypes Bamshad & Wooding, Nat Rev Gen, 2003 All we are looking for is haplotypes that are uncommonly long for their frequency in the sample.
Quantifying selective sweeps EHH: probability of two chromosomes in a sample being identical as a function of distance from a chosen ‘core’ SNP Other related metrics: –iHS: integral under the EHH curve, sensitive to allelic ancestry –XP-EHH: cross population EHH, compares population pairs, detects the action of selection in one population but not the other
Sample composition 156 Indian samples –31 populations 836 further samples HGDP-CEPH, our data –Old World, Oceania –Split into 8 geographic groups/40 populations Illumina 650K, 610K chips (~550,000 autosomal SNPs)
Computational challenges Phasing: –Inferring haplotype from genotype Calculating test statistics: –iHS and XP-EHH Data post-processing: –~550,000 data points per population per statistic –SNPs to genes/genomic regions
Phasing Likelihood-based methods 550,000 SNPs per individual, ~1,000 individuals Phasing chromosome 2 (densest, ~50,000 SNPs) can take over a week Computationally intensive, and requires a lot of disk space for storing iterations, so cannot use CamGrid –use elephant.bio.cam.ac.uk, simultaneously run multiple chromosomes –< 2 weeks to phase all autosomal chromosomes
Computing XP-EHH and iHS Compute a value for each statistic for each SNP for each population or population pair (~10 per test) –>5,000,000 data points for each statistic Not computationally intensive, small files –easily run on CamGrid (each chromosome separately) –4-5 hours to analyse a single population C++ code
Data processing Data sets this big suffer from high false discovery rates Multiple testing corrections can be too stringent Need to reduce the number of data points –windowing approach: Break the genome into non-overlapping, contiguous 200kb windows, test significance at that level
Windowing Done using R –Hand-written code, no extra packages –Requires large amounts of RAM (> 10GB), so not suitable for CamGrid –Again, use elephant –Roughly 2 hours per population From 550,000 SNPs to 13,274 windows –Spanning ~20,000 genes –How to tease out biological meaningfulness?
From SNPs to genes and beyond Selection acts on phenotypes, not genes Mining of ontologies and other databases –Gene Ontology terms, Mammalian Phenotype terms, other annotations –(not actually done by high throughput methods, but I know better by now) –Although it still requires a lot of manual curation Map biological function to windows, test for overrepresentation of categories relative to expectations
Acknowledgements Toomas Kivisild, Katie Siddle (LCHES) Jenny Barna Mait Metspalu, Georgi Hudjashov, Gyaneshwer Chaubey (University of Tartu) Joe Pickrell (University of Chicago) Richard Lempicki (NIH)
Other genome-wide statistics Genome-wide F ST and H S are both computed with simple R scripts –Hand-written code –~5 minutes per population –The slowest bit is reading the data in –Use elephant.bio.cam.ac.uk AAF spectrum slopes are a bit more involved –To correct for sample size effects, resample every locus 1,000 times from its own allelic distribution –~ 1 hour per population, requires high RAM, use R