Presentation is loading. Please wait.

Presentation is loading. Please wait.

The 1000 Genomes Project Gil McVean Department of Statistics, Oxford.

Similar presentations


Presentation on theme: "The 1000 Genomes Project Gil McVean Department of Statistics, Oxford."— Presentation transcript:

1 The 1000 Genomes Project Gil McVean Department of Statistics, Oxford

2 What is the 1000 Genomes Project? A catalogue of all types of genetic variation, including rare variants (c. 1% frequency) obtained by sequencing at least 1000 individuals from geographic centres of major medical genetics interest A large international collaboration –UK, USA, China, Germany An exploration of the use of next-generation technologies for population-scale genome sequencing A resource for accelerating the rate of identifying disease mechanisms in the follow-up to disease-association studies

3 Samples for the main project UK FIN TSI ESP CEU JPT CHB CHS DAI KVT GMB GHN YRI MLW LWK Major population groups comprised of subpopulations of c. 100 each MXL ASW new CMB PRO

4 Population-scale genome sequencing Haplotypes 2x 10x

5 Pilot experiments Pilot 1 –Low-coverage (2x-4x) on 60 unrelated individuals from each of CEU, YRI and CHB+JPT Pilot 2 –High-coverage (20x diploid) on 2 trios (one from CEU, one from YRI) Pilot 3 –Exons from 1000 genes to 20x in c. 1000 samples (largely European) Complete!

6 The 1000G Low Coverage Pilot 185 individuals from 4 populations – CEU (63), CHB (30), JPT (30), YRI (62) PopulationTechnologyN IndividualsMapped Bases (billions) Mean Coverage / Individual CEUSLX524823.09 SOLiD302402.66 454181322.45 CHBSLX302342.60 JPTSLX282272.70 45429.61.60 YRISLX605943.30 SOLiD520.61.38 454210.81.80 Combined1851,8843.52

7 Even still, at lot of data isn’t much In the Pilot 1 sample 1 tera-basepairs leaves the CEU with… –6% of genotypes with 0 reads –16% of genotypes with < 2 reads –29% of genotypes with < 3 reads

8 ftp.1000genomes.ebi.ac.uk www.1000genomes.org Pilot release expected Nov/Dec 2009 ftp-trace.ncbi.nih.gov/1000genomes/ftp

9

10 What has the project already generated?

11 Over 9 millions novel SNPs Total 17.2 M SNPs called Previously ~12M SNPs “known” (dbSNP 129) –7.9M confirmed –9.2M novel 4.84 1.09 0.78 0.48 2.805.65 1.54 CEUYRI CHB+JPT 0.50 0.38 0.29 0.26 2.20 4.38 1.35 CEU YRI CHB+JPT Total SNPsNovel SNPs Le Quang

12 A near complete record of common SNPs Durbin, Le Quang

13 A set of accurate genotypes This is about where simulations suggest we should be with 2-4x on 60 samples Note this quality is much much better than if calls were made marginally Durbin, Le Quang

14 Many novel indels and larger structural variants

15 Zam Iqbal Up to 50kb Novel sequence from de novo assembly

16 Some interesting biology - variation in SNP density

17 Some more interesting biology – high Fst SNPs Ryan Hernandez, Adam Auton

18 Even more interesting biology – loss of function mutations Daniel MacArthur

19 A robust and modular pipeline for analysis of population-scale sequence data

20 An efficient format for storing aligned reads and a set of tools to manipulate and view the files SAM/BAM format for storing (aligned) reads Bioinformatics (2009) http://samtools.sourceforge.net

21 An information-rich format for storing generic haplotype/genotype data and tools for manipulating the files www.1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcfv3.2

22 Using the 1000G data now

23 IMPUTE Genotypes in additional samples from standard product Reference panel (1000G) Imputation … 11101010101011 … … 00111110000111 … … 11110000011101 … … 00101011100101 … … 1.2..1.0.0..22… … 11220110200122 … Imputed genotypes

24 Imputation performance across SNP types from P1 (CEU) from Affy 500k Annotation# SNPsInfo measure All414,3210.780 MAF < 5%102,0000.543 MAF > 5%312,3210.857 UCSC Genes6,6280.736 Depth < 1003,153 (0.7%)0.611 SimpRpts25,6250.607 SimpRpts + Depth < 1001,652 (6.5%)0.671 SegDups24,3010.686 SegDups + Depth < 100665 (2.7%)0.388 Jonathan Marchini

25 Looking forward... Already have data generated for c. 200 more Europeans –Data generation largely complete by mid 2010 Much work still to be done on accurate inference of all types of variation from NGS data Data already proven useful for a number of projects – please use it

26 Thanks to the many...


Download ppt "The 1000 Genomes Project Gil McVean Department of Statistics, Oxford."

Similar presentations


Ads by Google