Can we infer historical population movements from principal component analysis of genetic data? Saharon Rosset,

Cover of Science, 1.9.1978 Menozzi, P., Piazza A. & Cavalli-Sforza, L. Synthetic Maps of Human Gene Frequencies in Europeans. Science 201, 786– 792 (1978).

How was the map generated? Menozzi et al. collected information on genetic markers in 67 European populations –Before DNA, markers were: blood types, HLA types, etc. –Total: 38 “markers” (columns in the data matrix) They performed a principal components analysis (PCA), describing the “directions” of maximum variance in the data (i.e., what genetic patterns best separate populations) Plotted the projections of top PCs on a map and looked for patterns The top PC is on the cover, and shows a clear pattern “radiating” out of the middle east

What does the map tell us? Main conclusion of Menozzi et al.: the leading contribution to modern European gene pool is from a migration out of the Middle East “Obvious” candidate: the Neolithic (farming) expansion, circa 6000BC We have archaeological evidence of spread of farming, now also genetic evidence that people also moved and replaced ancestral European populations? In 1994, the seminal book by Cavalli-Sforza et al. * performed similar analyses for all continents, and reached significant conclusions * For the rest of this talk, the 78 and 94 works combined will be referred to as “Cavalli Sforza et al.”

Nagging 30-year old questions How can we differentiate statistical artifacts from real effects in such analyses? How can we more specifically pinpoint which historical movements are responsible for the effects we see?

From Novembre & Stephens (Nature Genetics, 5/2008) “Here, we find that gradients and waves observed in Cavalli-Sforza et al.’s maps resemble sinusoidal mathematical artifacts that arise generally when PCA is applied to spatial data, implying that the patterns do not necessarily reflect specific migration events.” Novembre, J. & Stephens, M. Interpreting principal component analyses of spatial population genetic variation. Nature Genetics 40, 646–649 (2008). So, are the results real or mathematical artifacts? As we will see, Novembre and Stephens neglected to take some critical points into account and to re-analyze the original data Our main goal: critically re-evaluate the results of the original paper and the claims of Novembre & Stephens

Other recent high profile work on PCA in genetics Since that 1978 breakthrough, PCA has played a significant role in various areas of genetics. Examples of recent high impact work: Novembre J. et al. Genes mirror geography within Europe, Nature 456, 98-101 (6 November 2008) Similar goals as the 78 paper, except the data are BIGGER: ~340K SNPs on thousands of Europeans Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38, 904-909 (2006) This work introduced eigentstrat – a central tool in search for functional associations in the genome, used to “whiten” the stratification signal

Reminder: PCA and eigen-analysis Assume we have an N×K matrix X whose rows are the x i ’s Z=X T X is a symmetric K×K matrix, and its eigenvectors v 1,v 2,…,v K   K are an orthogonal basis of  K such that: –Zv k =λ k v k, where –λ 1 ≥λ 2 ≥… ≥ λ K ≥0 are the eigenvalues PC1 (maximal spread) is in fact defined by the direction v 1, and the portion of data spread it explains is λ 1 /  k λ k –Similarly, v 2 defines PC2 and so on In 78 paper, λ 1 /  k λ k >0.3 and λ 1 /λ 2 ≈2 (this will be important later!)

Local migrations and PCA patterns Assume we have a collection of populations with geographic organization. –For example, on a square or rectangular grid Each population has low migration rate to/from its immediate neighbors Novembre & Stephens’ main point: In this situation, if the grid is square or rectangular, the top PCs of genetic variation data will “tend to be” axis oriented –As in the Cavalli-Sforza et al. works

A schematic of the local migration model Pop1Pop2 Pop7 Pop3 Pop6Pop5 Pop11Pop10Pop9 Pop8 Pop4 Pop12 Pop15Pop14Pop13Pop16 Populations reside on a regular grid In every generation, 1-  in the individuals in each population stay, while  migrate to the neighbors, replaced by migrants from there

A schematic of the local migration model Pop1Pop2 Pop6 Pop3 Pop5Pop4 Pop9Pop8Pop7 Populations reside on a regular grid In every generation, 1-  of the individuals in each population stay, while  migrate to the neighbors, replaced by migrants from there

Resulting similarity (or covariance) between populations Naturally close-by populations will be more genetically similar in this setup, with similarity decaying quickly (typically exponentially) with distance Gives rise to a population covariance matrix (E(Z)=E(XX T )) that has special structure and a wonderful name: E(Z) will be a Block Toeplitz with Toeplitz Block matrix  33 22 22  1 22 22  1  22 33 1 22   Pop1 Pop2 Pop3 Pop1 Pop2 Pop3 Pop4 Pop5 Pop6 Pop7 Pop8 Pop9 Pop4 Pop5 Pop6 Pop7 Pop8 Pop9 Row 1 Row 2 Row 3 Row 1Row 2Row 3

Eigenvectors of Toeplitz matrices Novembre and Stephens discuss the known structure of the top eigenvectors of such matrices –Eigenvectors are given by the 2-dimensional discrete cosine transform –The top “population” PCs are sinusoidal patterns with “geographical structure” –Typically axis oriented Figure 1 of their paper demonstrates that this theory fits well with the results in the Cavalli-Sforza et al. worksFigure 1 –Hence their conclusion

OK, so we are done, right? Not really! Because N&S neglect to discuss two critical, related aspects: Eigenvalues: what is the ratio between % of variance explained of top PC(s) and other PCs? Variance: do we really expect the top PCs in real data to conform exactly to theory? Once these two aspects are brought into the equation, it is no longer true that the results of C-S et al. are explained by the Toeplitz phenomena!

Asymptotics of Toeplitz eigenvalues Basic formula (Böttcher and Silbermann 99): –T is a block-Toeplitz matrix of r  r blocks, each of size k  k –A is the symbol of the matrix T –|||A(e iθ )||| is the operator norm –Lots of technical conditions and limitations (e.g., expression on right has to be finite for the result to be meaningful) In English: asymptotically, the top eigenvalues are close to each other –For all results in Cavalli-Sforza et al. works, we have λ 1 /λ 2 ≥2

Toeplitz eigenvalues (ctd.) We can also make simple, non-asymptotic statements, such as: If the environment is a square and migration only local, then by symmetry the North-South and East-West eigenvalues should be similar It seems we should have two regimes: –For small data, variance could prevent us from recovering the “population” eigenvectors –For large data, the asymptotics should prevent us from seeing extreme ratios for λ 1 /λ 2

Understanding which scenario we are in: simulation N&S simulate populations with local migration on a 15×15 grid, with 500 genetic markers Their Figure S1 shows a nice match of the data eigenvectors to the Toeplitz theory.Figure S1 But is this representative of the Cavalli Sforza et al. data of Only 67 populations and 38 markers? Not really… What about the ratio λ 1 /λ 2 in their simulations?

Simulation results: the two scenarios Spatial top PCs?95%CI for λ 1 /λ 2 DemesMarkers Clear(1.02, 1.33)225 (15*15)500Big (N&S) Marginal(1.07, 2.01)64 (8*8)30Small (like C-S) The big simulations are consistent with eigenvector directions but absolutely not with eigenvalue ratios  eigenvalue asymptotics at work! The small simulations are barely consistent with eigenvalue ratios but not at all with eigenvector directions  variance at work Stop

Summary so far, next steps The effects in Cavalli-Sforza et al. are too big to be explained by local correlations as suggested by Novembre and Stephens On “similar” simulations with no signal the magnitude of 1 / 2 was smaller but “comparable” So are the geographic effects real and reproducible? Key next step: critical reanalysis of the original data –Does “correct” analysis still produce the same results? –Do “modern” approaches for inference on PCA (hypothesis testing, confidence intervals) confirm validity?

Data and analysis of 78’ paper The original paper used frequency of 21 Human Leukocyte Antigen (HLA) alleles in 67 West Eurasian populations, and 17 other markers in a different set of populations –Required a complex interpolation scheme to get 21+17=38 variables on the “same” populations –Many assumptions of PCA are violated by interpolations Realizing the difficulties, the authors also performed analysis on the 67  21 matrix of HLA data –Results were very similar –Some basic assumptions of PCA still violated Our task: reconstruct this HLA matrix and critically re- analyze it

Chasing the HLA data from the original paper HLA data was taken from 1970’s books and articles –Problem 1: locating them –Problem 2: figuring out HLA nomenclature Finally managed to recover 54  21 matrix (13 other populations found but inconsistent nomenclature) Statistical challenges: –Different number of individuals from each population  Rows do not have similar variance –Different prevalence for each HLA marker  Columns do not have mean 0 and same variance –HLA markers have complex dependence between them  Columns are not independent But, thankfully, this dependence is weak (weaker than multinomial)

Idealized statistical model Denote the among-populations N×N covariance matrix by H pop Idealized PCA model: X = H pop 1/2 Y, where –X N×K is observed data –Y N×K is random noise (say all entries are i.i.d Gaussian) In other words, the matrix XX T = H pop 1/2 YY T H pop 1/2 is a scaled “noisy” version of the matrix H pop Characteristics of X in this model: –i.i.d columns with mean 0 –Rows have mean 0 –If H pop is “regular” (e.g., Toeplitz), then all entries are identically distributed –Unless H pop is “trivial” (e.g. identity), then rows are not independent

Matching data and model Entry X ij in our data matrix has a Bin(n i,p ij ) distribution –Assume X ij,X ik (=columns) are independent (though not exactly true) We can think of the p ij ’s as representing our idealized X matrix –Independent stochastic evolution of each marker based on (finite) population relationships –Still have to worry about centralizing and “standardizing” them In this situation, the binomial noise in the actual X ij ’s is a “nuisance” that does not follow our model

Matching data and model (ctd.) First step: variance stabilizing transformation –Fact: arcsin(  X ij /  n i ) has variance approximately 1/4n i, –Independently of p ij Second step: centralize columns Result: approximately i.i.d columns, approximately mean 0 –rows still have different variance Interpretation: our standardized data has the form X = H pop 1/2 Y + Z, with Z ij ~N (0,1/2n i ) independent –Geometrically: “perturbed” versions of our desired data, but with small noise Bad third steps: standardize columns or multiply each row i by 2  n i to standardize variance –Problem: scaling our “signal” in the p ij ’s

Comparison: first PC in 78’ vs ours Main messages remain the same: Exceptionally high variance in first PC Clear correlation with North/South axis or distance from Middle East Conclusion: the main findings of Cavalli-Sforza et al. withstand a reanalysis. *dist from ME: distance from Middle East (Baghdad) In parentheses: bootstrap 95% confidence intervals *

Conclusions PCA was, is (and will be?) an important tool in genetics The works of Cavalli-Sforza et al. show geographic patterns of genetic variation that cannot be discarded as either artifacts or coincidence due to variance We have also reanalyzed parts of their data to verify that their PCA results hold up to renewed examination –Bootstrap study also confirms validity of main findings (not shown here) Can we differentiate local migration that prefers North-South to East- West consistently from a real “expansion”? –If yes, is this really the Neolithic expansion? Novembre & Stephens do point to an intriguing connection between local migration and “global” effects

N&S big simulationSmall simulation PC1 PC2 PC3

Can we infer historical population movements from principal component analysis of genetic data? Saharon Rosset,

Similar presentations

Presentation on theme: "Can we infer historical population movements from principal component analysis of genetic data? Saharon Rosset,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Can we infer historical population movements from principal component analysis of genetic data? Saharon Rosset,

Similar presentations

Presentation on theme: "Can we infer historical population movements from principal component analysis of genetic data? Saharon Rosset,"— Presentation transcript:

Similar presentations

About project

Feedback