Presentation on theme: "Genetic variation is meaningful only in the context of a population the minor allele frequency f says how often a particular allele (variant) occurs at."— Presentation transcript:
genetic variation is meaningful only in the context of a population the minor allele frequency f says how often a particular allele (variant) occurs at a particular site in a given population; by definition it is < 0.5 ccagtcagagAtgtgcacatggcttagttttcatacaGagcctgggctgggggtggggtg ccagtcagagAtgtgcacatggcttagttttcatacacagcctgggctgggggtggggtg ccagtcagagttgtgcacatggcttagttttcatacacagcctgggctgggggtggggtg ccagtcagagAtgtgcacatggcttagttttcatacacagcctgggctgggggtggggtg ccagtcagagttgtgcacatggcttagttttcatacacagcctgggctCggggtggggtg ccagtcagagttgtgcacatggcttagttttcatacacagcctgggctgggggtggggtg ccagtcagagttgtgcacatgTcttagttttcatacacagcctgggctgggggtggggtg ccagtcagagttgtgcacatggcttagttttcatacaGagcctgggctgggggtggggtg ccagtcagagttgtgcacatggcttagttttcatacacagcctgggctgggggtggggtg f = 4/10f = 1/10f = 2/10f = 1/10 individuals 1-10
the polymorphisms most analyzed are: single nucleotide polymorphisms (SNPs) replace one bp with another but do not change lengths 1 SNP per 1000 bp between any two individuals; almost every bp is variable when we consider the world population SNPs are essentially always bi-allelic; not because tri-allelics are impossible, just highly unlikely other polymorphism categories include: small insertions-deletions (indels) below reads length large structural variations consisting of insertions and deletions and inversions above reads length
SNPs approximate 1/f distribution this is the observed frequency distribution from the complete sequencing a large population; however many SNP discovery projects sequence a small population and then consider the absence or presence of those previously discovered SNPs in a large population; this is known to underestimate the number of rare variants 00.10.20.30.40.5 0 250 500 750 1000 f (minor allele) # of SNPs # of SNPs found = 1541 NonSyn Synon 5'-UTR 3'-UTR Frame Splice 5'-Flank 3'-Flank Intron
population specific SNPs are found at lower f than shared ones minor allele frequencies classified by occurrence within individuals of either African or European descent (population specific) or presence in both (shared) as shown by Halushka MK, …, Chakravarti A. 1999. Nat Genet 22: 239-347
GENOTYPE first we identify variant sites by sequencing a small number of individuals; then we test (i.e. genotype) only those variant sites to determine which alleles are present RE-SEQUENCE [inconsistently used terminology] generate low coverage (2x) sequence from one individual and compare that data against the reference genome DE NOVO generate high coverage (50x) sequence from one individual and perform de novo assembly of that genome without making use of the existing reference genome
growth in public databases for “common” human polymorphisms 27 October 2005 one million SNPs genotyped in 269 individuals from four diverse populations 28 October 2010 15 million SNPs, one million short insertion-deletion, 20000 structural variants genotyped in up to 697 individuals from 7 diverse populations worldwide 18 October 2007 3.1 million SNPs genotyped in 270 individuals from four diverse populations
structural variations detected by fosmid end-sequence pairs (ESPs) the fosmid cloning system generates an exceptionally narrow distribution of clone insert sizes, 40 ± 2.8 kb; each of these fosmid clones is sequenced from both ends, creating an end-sequence pair with two 500 bp sequence reads separated by a known distance in the test genome from which the fosmid clone was made; insertions-deletions-inversions are detected by computationally aligning end-sequence pairs to the reference genome
10 kb deletion relative to reference 50 kb on REF REF genome test genome 40 kb on test deleted 10 kb arrows indicate direction of sequence read
structural variations follow a 1/f distribution just like SNPs 15% (261 of 1,695) of discovered sites represent the more common configuration than the reference human genome JM Kidd, et al. 2008. Mapping and sequencing of structural variation from eight human genomes. Nature 453: 56-64
human pan-genome: non-redundant collection of sequences found across the entire world’s human population de novo assembly of individual genome reveals ~5 Mb of novel sequence not present in reference genome; complete human pan-genome contains 19~40 Mb of novel sequence not present in reference genome R Li, et al. 2010. Building the sequence map of the human pan-genome. Nat Biotechnol 28: 57-63.
Lewontin’s (in)famous paper on non-existence of “race” in genetics Lewontin RC. 1972. "The apportionment of human diversity“, in Evolutionary Biology 6: 391-398 most of the variations (85%) found in human populations is found within local geographic groups and any differences attributable to race groups is just a small fraction of human genetic variability (15%); race is an invalid taxonomic construct because the probability of a racial misclassification is approximately 30% based on a single genetic locus Edwards AW. 2003. Human genetic diversity: Lewontin's fallacy. Bioessays 25: 798-801 even if the probability of misclassifying an individual’s race based on a single locus is as high as 30%, the misclassification probability based on 10 loci can drop to a few percent
Structure clustering of genotype data Rosenberg NA, …, Feldman MW. 2002. Science 298: 2381-2385 This analysis is based on 377 microsatellites in 1056 individuals from 52 populations. Variations within populations account for 93 to 95% of the data. Nevertheless we can identify clusters that are consistent with known populations. K is chosen in advance. For any given K, each individual is represented by a thin vertical line, which is partitioned into K colored segments indicating the individual’s estimated membership in the preordained K clusters. AfricaAsiaEurope
science does not dictate public policy science can and should inform policy but that is never the only consideration, and in the meantime there are better (or at least more fun) things to do with the decreasing cost of sequencing, the age of personal genomics is fast approaching; we need not limit ourselves to sequencing live individuals
human genome sequence from an extinct Palaeo-Eskimo evidence of migration from Siberia into the New World some 5,500 years ago independent of migrations giving rise to modern Native Americans and Inuits M Rasmussen, et al. Feb 2010. Nature 463: 757-762. Kennewick Man is but the tip of the iceberg for the New World Entrada controversy. Who occupied the American continents first and where did they come from? These questions are intricately connected with the rights of indigenous Native Americans. Sequencing of a pre-Clovis genome over 11,500 years old would rattle the field.
personal genomes for $199 Anne Wojcicki and Sergey Brin