DATA ANALYSIS EXERCISES Analyses of the genetic structure of 6 Italian populations subject to geographical and/or cultural isolation factors Identify in.

DATA ANALYSIS EXERCISES Analyses of the genetic structure of 6 Italian populations subject to geographical and/or cultural isolation factors Identify in their genetic makeup the signatures of genetic isolation Lower intra-population diversity compared to open populations Higher inter-population diversity compared to open populations

The dataset PopulationLabelSample size German speaking islands SappadaSAP24 SaurisSAU10 TimauTIM24 Sardinia North SardiniaNSA25 Sulcis IglesienteSUL23 BenetuttiBEN25 Open AostaAOS22 LuccaLUC25 BolognaBOL22 CosenzaCAL21 PalermoSIC29 Final dataset after quality control (250 samples, 87818 SNPs)

The dataset The dataset include populations genotyped with two different microarays Genochip 2.0 – 150,000 SNPs (130,000 autosomal SNPs) This chip has been specifically designed for anthropological research as it includes only neutral SNPs Illumina HumanOmniExpress BeadChip – 713,599 SNPs Designed to capture the greatest amount of common SNP variation and drive the discovery of novel associations with traits and diseases QUALITY CONTROL SNPs: genotyping success rate > 90% Individuals: genotyping success rate >92 % MAF (minor allele frequency) >0,05 IBD >0.185 DATA MERGING Comparing the SNPs of two or more chips and selecting only those that these chips have in common

The analyses Quality control Check for cryptic relatedness (IBD) Eliminate relatives from dataset Intra-population diversity Analysis of Runs of Homozygosity (RoHs) Graphical reppresentation of RoHs (boxplots and scatterplot) Analsis of intra-population pairwise IBS identities Graphical reppresentation of intra-population pairwise IBS identities (violin plots) Inter-population diversity Principal components analysis Admixture analysis (Frappè 1.1)

Softwares PLINK 1.9 R software package Frappè 1.1

PLINK A toolset for whole genome association analysis https://www.cog-genomics.org/plink2 Command line based software working with flags (--) and modifiers (numbers or commands that follow the flag) Data management Summary statistics Population stratification Association analysis Linkage disequilibrium and haplotype analysis Shared segment analysis Copy number analysis

Data management Recode dataset (A,C,G,T → 1,2) Reorder, reformat dataset Flip DNA strand Extract/remove individuals/SNPs Swap in new phenotypes, covariates Filter on covariates Merge 2 or more filesets P1 A A A C C G T T A A T T P2 A C A A C G G T A C T T P3 C C A C G G T T A A T T P4 C C A A G G G T A A T T P1 A A A C C G T T A A T T P2 A C A A C G G T A C T T P3 C C A C G G T T A A T T P4 C C A A G G G T A A T T S1 A A A C C C C C S2 A C A A A C A A S3 C G C G G G G G S4 T T C G T T G T S5 A A G T A A A A S6 T T A C T T T T S1 A A A C C C C C S2 A C A A A C A A S3 C G C G G G G G S4 T T C G T T G T S5 A A G T A A A A S6 T T A C T T T T P1 S1 A A P1 S2 A C P1 S3 C G P2 S4 A C P2 S5 C C P1 S1 A A P1 S2 A C P1 S3 C G P2 S4 A C P2 S5 C C P1 S1 A 2 C 0 P1 S2 A 1 C 1 P1 S3 C 0 G 1 P2 S4 A 3 C 1 P2 S5 C 2 C 2 P1 S1 A 2 C 0 P1 S2 A 1 C 1 P1 S3 C 0 G 1 P2 S4 A 3 C 1 P2 S5 C 2 C 2 0101010010101010101101001110101010101011011101010010101011101001011101101010110101010101011101001010100101010101011010011101010101010110111010100101010111010010111011010101101010101010111010 ←People SNPs → ←Genotypes SNPs in CNVs Compact binary format ←SNPs People → S1 A/A P1 P2 P7 P8 S1 A/C P3 P4 S1 C/C P5 S2 G/T P5 S2 G/G P1 P2 P3 P4 S1 A/A P1 P2 P7 P8 S1 A/C P3 P4 S1 C/C P5 S2 G/T P5 S2 G/G P1 P2 P3 P4 List by genotype S1 S2 S3 S4 S1 S2 S3 S4 P1 0 1 2 1 P2 0 NA 0 2 P3 2 1 2 1 P4 0 1 2 0 S1 S2 S3 S4 S1 S2 S3 S4 P1 0 1 2 1 P2 0 NA 0 2 P3 2 1 2 1 P4 0 1 2 0 Numeric coding

Summary statistics Filters and reports for standard metrics Genotyping rate Allele, genotype, haplotype frequencies Hardy-Weinberg Mendel errors Tests of non-random missingness by phenotype and by (unobserved) genotype Individual homozygosity estimates Check/impute sex based on X chromosome LD-based detection of strand flips A/T and C/G SNPs potentially ambiguous Automated search for plate effects w/ subsequent masking of specific SNP/individual genotypes

Han Chinese Japanese Reference Same population Different population Pairwise allele-sharing metric Hierarchical clustering Multidimensional scaling/PCA Population stratificati on

Estimation of IBD sharing (relatedness) ABAB ACAC ABAB ACAC IBS = 1 IBD = 0 Parents Most recent common ancestor from homogeneous random mating population AB AC

Association analysis Population-based Allelic, trend, genotypic, Fisher’s exact Stratified tests (Cochran-Mantel-Haenszel, Breslow-Day) Linear & logistic regression models multiple covariates, interactions, joint tests, etc Family-based Disease traits: TDT / sib-TDT Continuous traits: QFAM (between/within model, QTDT) Permutation procedures “adaptive”, max(T), gene-dropping, between/within, rank-based, within-cluster Multilocus tests Haplotype estimation, set-based tests, Hotelling’s T 2, epistasis

OUR DATA FILESET data.bed - Primary representation of genotype calls at biallelic variants. Must be accompanied by.bim and.fam files data.bim - Extended variant information file accompanying a.bed binary genotype table data.fam - Sample information file accompanying a.bed binary genotype table 1.Family ID ('FID') 2.Within-family ID ('IID'; cannot be '0') 3.Within-family ID of father ('0' if father isn't in dataset) 4.Within-family ID of mother ('0' if mother isn't in dataset) 5.Sex code ('1' = male, '2' = female, '0' = unknown) 6.Phenotype value ('1' = control, '2' = case, '-9'/'0'/non- numeric = missing data if case/control) FAM FILE 1.Chromosome code or name 2.Variant identifier 3.Position in morgans or centimorgans 4.Base-pair coordinate 5.Allele 1 (corresponding to clear bits in.bed; usually minor) 6.Allele 2 (corresponding to set bits in.bed; usually major) BIM FILE

R SOFTWARE PACKAGE What R is and what it is not R is a programming language a statistical package an interpreter Open Source R is not a database a collection of “black boxes” a spreadsheet software package commercially supported In computer science, an interpreter is a computer program that directly executes, i.e. performs, instructions written in a programming or scripting language, without previously compiling them into a machine language program. In science, computing, and engineering, a black box is a device, system or object which can be viewed in terms of its inputs and outputs (or transfer characteristics), without any knowledge of its internal workings

R SOFTWARE PACKAGE Why R? It's free! It runs on a variety of platforms including Windows, Unix and MacOS. It provides an unparalleled platform for programming new statistical methods in an easy and straightforward manner (packages). It contains advanced statistical routines not yet available in other packages. It has state-of-the-art graphics capabilities.

History of R Statistical programming language S developed at Bell Labs since 1976 (at the same time as UNIX) Intended to interactively support research and data analysis projects Exclusively licensed to Insightful (“S-Plus”) R: Open source platform similar to S developed by R. Gentleman and R. Ihaka (Univeristy of Auckland, NZ) during the 1990s Since 1997: international “R-core” developing team Updated versions available every couple months R SOFTWARE PACKAGE

R SOFTWARE PACKAGE – basic knowledges > 1 + 1 [1] 2 > 1 + 1 * 7 [1] 8 > (1 + 1) * 7 [1] 14 > x <- 1 > x [1] 1 > y = 2 > y [1] 2 > 3 <- z > z [1] 3 > (x + y) * z [1] 9 Math: Variables: Calculations

R SOFTWARE PACKAGE – basic knowledges Built in Functions R has many built in functions that compute different statistical procedures. Functions in R are followed by ( ). Inside the parenthesis we write the object (vector, matrix, array, data frame) to which we want to apply the function and the arguments of the function.

R SOFTWARE PACKAGE – basic knowledges Objects Vector - A vector is a sequence of data elements of the same basic type (one- dimensional objects) Factor – A factor is a variable which take on a limited number of different values (categorical variables) Matrix – A matrix is a collection of data elements arranged in a two-dimensional rectangular layout Data frame – A data frame is a a list of vectors of equal length List - A list is a generic vector containing other objects

R SOFTWARE PACKAGE – basic knowledges Naming objects must start with a letter (A-Z or a-z) can contain letters, digits (0-9), and/or periods “.” case-sensitive - mydata different from MyData “<-” used to indicate assignment – example mydata<-read.table(……..)

R SOFTWARE PACKAGE – basic knowledges Objects that you create during an R session are hold in memory, the collection of objects that you currently have is called the workspace. This workspace is not saved on disk unless you tell R to do so. This means that your objects are lost when you close R and not save the objects, or worse when R or your system crashes on you during a session. To see which objects are currently saved type ls() Managing objects

R SOFTWARE PACKAGE – basic knowledges Some usefull R commands……. just to start to get used to the interface # load a file as a data frame mydata<-read.table("filename.extension", header=FALSE, sep="\t", row.names=NULL, col.names=NULL………) # export a data frame to computer write.table(mydata, "filer name.extension", quote=TRUE, sep="", row.names=TRUE, colnames=TRUE……..)

How to use help in R? R has a very good help system built in. If you know which function you want help with simply use ?name of the function (Ex: ?hist) If you don’t know which function to use, then use help.search(“the analysis or procedure you want to carry out”). Ex: help.search(“histogram”). R SOFTWARE PACKAGE – basic knowledges

FRAPPÈ 1.1 Frappe is a program for estimating individual ancestry and admixture proportions using high-density SNP data

QUALITY CONTROL # sort alphabetically system("plink --bfile data --indiv-sort n --make-bed --out data1") --indiv-sort Sorts individuals Natural sort of family and within-family IDs (n) Use the order in another file (named in the second parameter) (f) Check for cryptic relatedness (IBD)

QUALITY CONTROL # IBD/IBS calculations system("plink --bfile data1 --genome rel-check --min 0.185 --out ibd") --genome invokes an IBS/IBD computation and writes a report. The rel-check modifier removes pairs of samples with different FIDs --min [minimum PI_HAT value] removes lines with PI_HAT values below the given cutoff Check for cryptic relatedness (IBD) – initial sample size=263 Identity by Descent (IBD) is a measure of how many alleles at any marker in each of the two samples came from the same ancestral chromosomes

QUALITY CONTROL Check for cryptic relatedness (IBD) FID1Family ID for first sample IID1Individual ID for first sample FID2Family ID for second sample IID2Individual ID for second sample RTRelationship type inferred from.fam/.ped file EZIBD sharing expected value, based on just.fam/.ped relationship Z0P(IBD=0) Z1P(IBD=1) Z2P(IBD=2) PI_HATProportion IBD, i.e. P(IBD=2) + 0.5*P(IBD=1) PHEPairwise phenotypic code (1, 0, -1 = AA, AU, and UU pairs, respectively) DSTIBS distance, i.e. (IBS2 + 0.5*IBS1) / (IBS0 + IBS1 + IBS2) PPCIBS binomial test RATIOHETHET : IBS0 SNP ratio (expected value 2) Ouput file ibd.genome The probability that and in share 0, 1 or 2 alleles IBD at the marker locus. These probabilities may either refer to given markers or be thought of as sample-wide. The probable number of shared alleles at any given marker. In a random population =1 identical twins =0,5 first-degree relatives =0,25 second-degree relatives =0,125 third-degree relatives

QUALITY CONTROL Check for cryptic relatedness (IBD) Ouput file ibd.genome FID1IID1FID2IID2RTEZZ0Z1Z2PI_HATPHEDSTPPCRATIO SAPSAP8SAPSAP26OT00.41330.56760.01910.30290.78240815.5136 SAPSAP20SAPSAP22OT00.65190.32520.02290.18550.75784113.2562 SAUSAU6SAUSAU8OT00.47630.48130.04240.2830.7798714.2709 SULSUL13SULSUL14OT00.43880.54650.01470.28790.77892715.392 SULSUL15SULSUL16OT00.41290.587100.29360.77825415.6008 TIMTIM1TIMTIM27OT00.51770.45410.02820.25530.7729814.0868 TIMTIM2TIMTIM8OT00.32250.48430.19320.43530.82287417.1703 TIMTIM5TIMTIM9OT00.14990.51620.33390.5920.866074118.25 TIMTIM5TIMTIM25OT00.63410.32390.0420.2040.76311513.3646 TIMTIM5TIMTIM30OT00.0010.98190.01710.50810.82570211556 TIMTIM6TIMTIM14OT00.22840.46780.30380.53770.852441111.4723 TIMTIM7TIMTIM10OT00.49630.45860.04510.27440.77824313.9923 TIMTIM9TIMTIM25OT00.64710.31940.03350.19320.76023313.0965 TIMTIM9TIMTIM30OT00.43150.55240.01610.29230.77994914.8404 TIMTIM16TIMTIM24OT00.00050.9490.05050.5250.83167513188 TIMTIM25TIMTIM26OT00.00050.96880.03070.51510.82816913051

QUALITY CONTROL Check for cryptic relatedness (IBD) #read output of IBD/IBS analysis ibd<-read.table("ibd.genome", header=T) #eliminate columns that are of no interest (just to simplify view) ibd<-subset(ibd, select=c(FID1, IID1, FID2, IID2, PI_HAT)) File to open (between "") If we want the first row to be a header (logical TRUE/FALSE) #see on sceern jst the first 6 rows head(ibd)

QUALITY CONTROL Check for cryptic relatedness (IBD) We don't need to drop all individuals, we can keep one from each pair Select individuals to remove from the dataset Exercise: select, from each pair, one individual to exclude from the dataset. Beware that some individuals are present in more than one pair

QUALITY CONTROL Check for cryptic relatedness (IBD) FID1IID1FID2IID2PI_HAT SAPSAP8SAPSAP260.3029 SAPSAP20SAPSAP220.1855 SAUSAU6SAUSAU80.283 SULSUL13SULSUL140.2879 SULSUL15SULSUL160.2936 TIMTIM1TIMTIM270.2553 TIMTIM2TIMTIM80.4353 TIMTIM5TIMTIM90.592 TIMTIM5TIMTIM250.204 TIMTIM5TIMTIM300.5081 TIMTIM6TIMTIM140.5377 TIMTIM7TIMTIM100.2744 TIMTIM9TIMTIM250.1932 TIMTIM9TIMTIM300.2923 TIMTIM16TIMTIM240.525 TIMTIM25TIMTIM260.5151

QUALITY CONTROL Check for cryptic relatedness (IBD) FID1IID1FID2IID2PI_HAT SAPSAP8SAPSAP260.3029 SAPSAP20SAPSAP220.1855 SAUSAU6SAUSAU80.283 SULSUL13SULSUL140.2879 SULSUL15SULSUL160.2936 TIMTIM1TIMTIM270.2553 TIMTIM2TIMTIM80.4353 TIMTIM5TIMTIM90.592 TIMTIM5TIMTIM250.204 TIMTIM5TIMTIM300.5081 TIMTIM6TIMTIM140.5377 TIMTIM7TIMTIM100.2744 TIMTIM9TIMTIM250.1932 TIMTIM9TIMTIM300.2923 TIMTIM16TIMTIM240.525 TIMTIM25TIMTIM260.5151 SAP SAP26 SAP SAP22 SAU SAU8 SUL SUL14 SUL SUL16 TIM TIM27 TIM TIM8 TIM TIM14 TIM TIM10 TIM TIM24 TIM TIM26 TIM TIM5 TIM TIM9 We have to built a file listing these IIDs The file must have the first column indicating the Family and the second the IID

QUALITY CONTROL Check for cryptic relatedness (IBD) #create data frame with FID and IID fid<-c("SAP", "SAP", "SAU", "SUL", "SUL", "TIM", "TIM","TIM","TIM","TIM","TIM","TIM","TIM") iid<-c("SAP26", "SAP22", "SAU8", "SUL14", "SUL16", "TIM27", "TIM24", "TIM8", "TIM26", "TIM5", "TIM9", "TIM14", "TIM10") ibd_out<-data.frame(fid, iid) #write table to file write.table(ibd_out, "ibd-out.txt", quote=F, col.names=F, row.names=F) #remove individuals system("plink --bfile data1 --remove ibd-out.txt --make-bed --out data2")

RUNS OF HOMOZYGOSITY Stretches of consecutive homozygous genotypes system("plink --bfile data2 --homozyg --homozyg-snp 14 --homozyg-kb 500 --out roh ") homozyg: Flag to perform RoH analysis homozyg-snp: Flag that defines the minimum number of SNPs that a genomic strech sould have to be considered as RoH Default value = 100 homozyg-kb: Flag defining the minimum lenght of the RoH Default value = 1MB Take a window of X SNPs and slide this across the genome. At each window position determine whether this window looks 'homozygous' enough (yes/no) (i.e. allowing for some number of hets or missing calls). Then, for each SNP, calculate the proportion of 'homozygous' windows that overlap that position. Call segments based on this metric, e.g. based on a threshold for the average Alowing one heterozygous and 5 missing data, proportion of homozygous windows 0,05

RUNS OF HOMOZYGOSITY N° of SNPs 87818 Total size of human genome 3,2x10 9 bp If we want to analyse genome segments of at least 500 Kb we have to check, on the basis of our SNP density, the number of SNPs that this segment should have Calculate density 3200000000/87818= 36439 Meaning that we have one SNP every 36439 bp A segment of 500Kb sould have 500000/36439= ~14 SNPs

RUNS OF HOMOZYGOSITY OUTPUTS.hom.hom.indiv.hom.summary CHRSNPBPAFFUNAFF 1rs1256203476844803 1rs12726255104995003 1rs2887286115613103 1rs6685064121129203 1rs6603793150525503.hom.summary FIDIIDPHENSEGKBKBAVG AOSAOS11531368.46273.67 AOSAOS211861831.23435.06 AOSAOS311231397.82616.48 AOSAOS41811240.51405.06 AOSAOS51717885.92555.13 AOSAOS61718833.32690.48.hom.indiv FIDIIDPHECHRSNP1SNP2POS1POS2KBNSNPDENSITYPHOMPHET AOSAOS111rs6656088rs109262642.26E+082.36E+089869.32351819.0530.9980.002 AOSAOS112rs1445131rs126200671879675619604309807.5544816.8240.9790.021 AOSAOS113rs4380449rs6767173323101304522008512909.9635136.7810.9970.003 AOSAOS115rs3909885rs38924761.27E+081.33E+086331.05217236.80810 AOSAOS1115rs4357892rs436668871822297732727841450.4887619.0850.9740.026 AOSAOS211rs325921rs112493951.12E+081.21E+089647.45745821.0640.9980.hom

RUNS OF HOMOZYGOSITY #Load output files to R roh<-read.table("roh.hom.indiv", header=T) #Load ggplot package library(ggplot2) #Perform a boxplot analysis on the total lenghts of RoHs plot_kb<-ggplot(roh, aes(x=FID, y=KB, fill=FID))+geom_boxplot()

RUNS OF HOMOZYGOSITY

#Perform a boxplot analysis on the total lenghts of RoHs with x axis sorted according to median values plot_kb<-ggplot(roh, aes(x=reorder(FID, KB, FUN=median), y=KB, fill=FID))+geom_boxplot() + labs(title = "RoH total length", x = "Populations", y="RoH length (Kb)") #Save graph as pdf ggsave("boxplot_KB.pdf")

RUNS OF HOMOZYGOSITY #Perform a boxplot analysis on the number of RoHs with x axis sorted according to median values plot_nseg<-ggplot(roh, aes(x=reorder(FID, NSEG, FUN=median), y=NSEG, fill=FID))+geom_boxplot()+ labs(title = "Number of RoHs", x = "Populations", y="RoH number") #Save graph as pdf ggsave("boxplot_nseg.pdf")

RUNS OF HOMOZYGOSITY Scatterplot of the average values of number and total length of RoHs # Load the plyr package library(plyr) #Create a data frame with the average values of total length and number of RoHs for each population roh_sum<-ddply(roh,~FID,summarise,mean_KB=mean(KB),mean_NSEG=mean(NSEG)) The roh.hom.indiv file we loaded and named roh has individual values. From this file we have to obtain the average values for total length and number of RoHs for each population

RUNS OF HOMOZYGOSITY # Draw the scatterplot scatterplot<-ggplot(roh_sum, aes(x=mean_KB, y=mean_NSEG, colour=FID))+geom_point()

RUNS OF HOMOZYGOSITY # Draw the scatterplot with axes titles and higher point size scatterplot<-ggplot(roh_sum, aes(x=mean_NSEG, y=mean_KB, colour=FID))+geom_point(size=2) + labs(x="Average number of RoHs", y = "Average total length of RoHs (Kb)") # add labels to points scatterplot+ geom_text(aes(label=factor(FID)), size=3, hjust=1,vjust=1)

GGPLOT CHANGE COLORS #change colors of the boxplots +scale_fill_manual (values = c(AOS="#25418A", BOL="#25418A", LUC="#25418A", CAL="#25418A", SIC="#25418A", BEN="#CC1F26", NSA="#CC1F26", SUL="#CC1F26", SAP="#CC1F26", SAU="#CC1F26", TIM="#CC1F26")) #change colors of the scatterplot +scale_colour_manual (values = c(AOS="#25418A", BOL="#25418A", LUC="#25418A", CAL="#25418A", SIC="#25418A", BEN="#CC1F26", NSA="#CC1F26", SUL="#CC1F26", SAP="#CC1F26", SAU="#CC1F26", TIM="#CC1F26"))

RUNS OF HOMOZYGOSITY

INTRA-POPULATION IBS IDENTITIES #Perform IBS analysis system("plink --bfile data2 --distance square0 ibs --out ibs") A DNA segment is identical by state (IBS) in two or more individuals if they have identical nucleotide sequences in this segment. In our case is the portion of loci that are IBS in each couple of individuals Ibs.mibs Ibs.mibs.id A square matrix with 250 rows and 250 columns We have to perform the analysis for each population separatelly and then merge the files Square matrix with with all cells in the upper right triangle zeroed out Ibs calculation

INTRA-POPULATION IBS IDENTITIES 1.We will use the flag --keep of Plink to performe the analysis only on the selected individuals #Perform the analysis only within the Aosta population system("plink --bfile data2 --keep aos.txt --cluster --matrix --out ibs") A file with the list of individuals that we want to use for the specific analysis. The format sould be: FID IID AOS AOS1 AOS AOS2 AOS AOS3 AOS AOS4 AOS AOS5 AOS AOS6 AOS AOS7 AOS AOS8

INTRA-POPULATION IBS IDENTITIES #generate files to use with the --keep flag fam<-read.table("data2.fam") fam <- fam[,c("V1","V2")] Reads the.fam file into R Keeps only first two columns Selects only the rows of interest Saves file to computer aos<-subset(fam, V1=="AOS", select=c(V1, V2)) write.table(aos, "aos.txt", quote=F, row.names=F, col.names=F)

INTRA-POPULATION IBS IDENTITIES #Perform the analysis only within the Aosta population system("plink --bfile data2 --keep aos.txt --distance square0 ibs --out ibs_aos") #load file to R aos<-read.table("aos_ibs.mibs") #load file with names aoc_n<-read.table("aos_ibs.mibs.id") #add column with names to the data file aos$POP<-aos_n$V1

INTRA-POPULATION IBS IDENTITIES #vectorize matrix library(reshape2) aos<-melt(aos) #subset only values different than 1 and 0 aos 0 & value<1, select=c(POP, value)) EXERCISE: Repeat the whole procedure for the remaining 10 populations AOS BEN BOL CAL LUC NSA SAP SAU SIC SUL TIM

INTRA-POPULATION IBS IDENTITIES #merge population data ibs<-rbind(aos, ben, bol, cal, luc, nsa, sap, sau, sic, sul, tim) #perform violin plot library(ggplot2) plot_ibs<-ggplot(ibs, aes(x=reorder(POP, value, FUN=median), y=value, fill=POP))+geom_violin()

plot_ibs<-plot_ibs + labs(title = "Pairwise IBS identities", x = "Populations", y="IBS") plot_ibs<- plot_ibs + scale_fill_manual (values = c(AOS="#25418A", BOL="#25418A", LUC="#25418A", CAL="#25418A", SIC="#25418A", BEN="#CC1F26", NSA="#CC1F26", SUL="#CC1F26", SAP="#CC1F26", SAU="#CC1F26", TIM="#CC1F26")) ggsave("ibs_plot.pdf")

INTER-POPULATION ANALYSES Aims at quantifying the differences between populations from a genetic point of view Principal components analysis: is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables Multivariate analyses: Analyses based on the statistical principle of multivariate statistics, which involves observation and analysis of more than one statistical outcome variable at a time.

INTER-POPULATION ANALYSES Genes mirror geography within Europe Novembre et al 2008, Nature 3000 individuals 197146 SNPs

INTER-POPULATION ANALYSES #Perform PCA analysis system("plink --bfile data2 --pca header var-wts --out pca") 3 output files: pca.eigenval pca.eigenvec pca.eigenvec.var Command to perform PCA analysis By default extracts the first 20 components (to change this add the number of components as a modifier) header=adds a header line to the.eigenvec file var-wts=produces a file (pca.eigenvec.var) with variant weights The eigenvalues (the ammount of variance explained by components) of each extracted component The principal component weights for each row (individuals) The weights of the factors (variants, SNPs)

INTER-POPULATION ANALYSES FIDIIDPC1PC2PC3PC4PC5PC6PC7PC8PC9PC10PC11PC12PC13PC14PC15PC16PC17PC18PC19PC20 AOSAOS10.008617-0.02036-0.03062-0.00659-0.18797-0.017560.0147950.0248450.007103-0.03884-0.00046-0.00127-0.01503-0.00385-0.064580.041162-0.0073-0.005190.0209030.071287 AOSAOS20.025512-0.01396-0.028770.010402-0.15436-0.02154-0.03089-0.058120.1893230.0252210.084268-0.004050.0516770.034457-0.0751-0.06764-0.10291-0.051170.0261150.021141 AOSAOS30.01381-0.02024-0.02239-0.03321-0.166740.022770.087292-0.057270.003890.0629240.010222-0.028240.055983-0.020180.0406460.0006230.0623720.056792-0.191870.07628 AOSAOS40.006152-0.01458-0.03944-0.01818-0.21320.013522-0.00798-0.04420.0035750.039466-0.134410.026946-0.05145-0.069730.0380070.1603090.0554660.0902390.047950.010133 AOSAOS50.011842-0.02861-0.02741-0.00212-0.14644-0.01495-0.013270.0091210.063780.0921680.0885610.0778540.049190.0367820.11644-0.03209-0.0223-0.05094-0.070460.099836 AOSAOS60.00867-0.02366-0.031550.010667-0.162220.0004970.008434-0.056930.035603-0.02957-0.01628-0.077630.042746-0.105330.0602780.0254820.0310550.022840.029388-0.00146 AOSAOS70.01593-0.02055-0.02476-0.01547-0.111940.0307050.0071930.0120070.0307920.0425680.0342690.0330760.014141-0.008340.0308640.0977430.011035-0.024870.0754310.054567 #Load PCA files to R pca<-read.table("pca.eigenvec", header=T) pca_eig<-read.table("pca.eigenval") pca_var<-read.table("pca.eigenvec.var", header=T)

INTER-POPULATION ANALYSES #Calculate the portion of variation explained by each component colnames(pca_eig)<-"EIG" #Create a vector with the percentages o variance x<-pca_eig$EIG/sum(pca_eig$EIG) #Add this vector as a new column in the data frame pca_eig$VAR<-x

INTER-POPULATION ANALYSES #Produce a histogram with the % of variation explained by each component eig<- ggplot(pca_eig, aes(x=reorder(row.names(pca_eig),VAR, FUN=max), y=VAR))+geom_bar(stat = "identity")

INTER-POPULATION ANALYSES #Produce the PCA plot for the first two principal components pca12<- ggplot(pca, aes(x=PC1, y=PC2, colour=FID))+geom_point() #Add labels pca12<-pca12+geom_text(aes(label=pca$FID),size=3) But maybe we don't want to have the points so……. pca12<- ggplot(pca, aes(x=PC1, y=PC2, colour=FID))+geom_point(size=0) pca12<-pca12+geom_text(aes(label=pca$FID),size=3) ggsave("pca12.pdf")

INTER-POPULATION ANALYSES #Produce the PCA plot for the third and fourth principal components pca34<- ggplot(pca, aes(x=PC3, y=PC4, colour=FID))+geom_point( size=0) #Add labels pca34<-pca34+geom_text(aes(label=pca$FID),size=3) ggsave("pca34.pdf")

To have a better look on the open populations we must cut the graph, so….. #limit graph view to specific values of X and/or Y pca34+xlim(-0,12, 0)

#Subset the data p =2 | PC1<=-2, select=c(CHR, VAR, PC1)) #Produce graphics var<-ggplot(p, aes(x=reorder(VAR, CHR, FUN=max),y=PC1, colour=factor(CHR)))+geom_point(stat="identity", alpha=.2) Weight of the factors (variants) in determining the principal components scores

DATA ANALYSIS EXERCISES Analyses of the genetic structure of 6 Italian populations subject to geographical and/or cultural isolation factors Identify in.

Similar presentations

Presentation on theme: "DATA ANALYSIS EXERCISES Analyses of the genetic structure of 6 Italian populations subject to geographical and/or cultural isolation factors Identify in."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DATA ANALYSIS EXERCISES Analyses of the genetic structure of 6 Italian populations subject to geographical and/or cultural isolation factors Identify in.

Similar presentations

Presentation on theme: "DATA ANALYSIS EXERCISES Analyses of the genetic structure of 6 Italian populations subject to geographical and/or cultural isolation factors Identify in."— Presentation transcript:

Similar presentations

About project

Feedback