Presentation is loading. Please wait.

Presentation is loading. Please wait.

The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT.

Similar presentations


Presentation on theme: "The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT."— Presentation transcript:

1 The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT

2 Page 2CT ASA Mini Conference: 2005-03-05 Outline Project Goals Project Goals Simplify Population Genetic Analysis Design Details Design Details Extend R Factor objects Functions Included Functions Included Genetic data: Importing & Creation, Manipulation, Information, Annotation, Transformation, Export Statistical Functions: Hardy-Weinberg (Dis-)Equilibrium, Linkage Disequlibrium, Haplotype Imputation, Sample-size tools Simple Examples Simple Examples Creating Genotype Objects Example Session Example Session Future Development: Future Development: Emulate BioConductor Project Large scale SNP analysis Formal Object Class Multi-team collaboration

3 Page 3CT ASA Mini Conference: 2005-03-05 Abstract The genetics package for the R statistical environment provides convenient classes and methods for handling genetic data. Features include: creating, representing, and manipulating variables containing single-locus genetic information creating, representing, and manipulating variables containing single-locus genetic information performing common genetic calculations, including genotype and allele frequencies performing common genetic calculations, including genotype and allele frequencies estimating and testing for departure from Hardy-Weinberg equilibrium (HWE) of individual markers estimating and testing for departure from Hardy-Weinberg equilibrium (HWE) of individual markers estimating, testing and plotting linkage disequilibrium (LD) between sets of markers performing sample size calculations for genetic markers estimating, testing and plotting linkage disequilibrium (LD) between sets of markers performing sample size calculations for genetic markers tools for representing specific the relationship among marker alleles (e.g. dominant, recessive, additive, heterozygote advantage) in standard R statistical models tools for representing specific the relationship among marker alleles (e.g. dominant, recessive, additive, heterozygote advantage) in standard R statistical models In addition to these standard methods, the package also provides two novel capabilities: Estimation of departure from HWE for multi-allelic markers Estimation of departure from HWE for multi-allelic markers Confidence intervals for HWE and LD which account for the bounded nature of the estimates in order to achieve proper coverage Confidence intervals for HWE and LD which account for the bounded nature of the estimates in order to achieve proper coverage The genetics package makes it significantly easier to manipulate and analyse genetic marker information.

4 Page 4CT ASA Mini Conference: 2005-03-05 Abstract During this presentation I will Describe the goals of the R genetics package Describe the goals of the R genetics package Introduce the basic features of R genetics package, Introduce the basic features of R genetics package, Provide a brief worked example, including the application of the novel capabilities, and Provide a brief worked example, including the application of the novel capabilities, and Discuss the ongoing project to develop a next-generation package for efficient handling large volumes of genetic marker data (e.g. for whole genome SNP scans). Discuss the ongoing project to develop a next-generation package for efficient handling large volumes of genetic marker data (e.g. for whole genome SNP scans).

5 Page 5CT ASA Mini Conference: 2005-03-05 Problem At each genetic position within a gene, diploid cells have two alleles. At each genetic position within a gene, diploid cells have two alleles. This suggests storing each allele as separate variable. This suggests storing each allele as separate variable. However, most laboratory methods cannot distinguish between A/B and B/A, yielding three observed genotypes at each position: (A/A), (A/B or B/A), (B/B). Consequently, the observed alleles are confounded, However, most laboratory methods cannot distinguish between A/B and B/A, yielding three observed genotypes at each position: (A/A), (A/B or B/A), (B/B). Consequently, the observed alleles are confounded, This suggests the use of a single genotype variable. This suggests the use of a single genotype variable. This duality is not directly handled by standard statistical packages. This duality is not directly handled by standard statistical packages. As a consequence, the need to handle both views creates complexity when manipulating or including genotype data in statistical analysis. As a consequence, the need to handle both views creates complexity when manipulating or including genotype data in statistical analysis.

6 Page 6CT ASA Mini Conference: 2005-03-05 Initial Project Goals Simplify Statistical Analysis using Genetic Data by providing: A genotype object class that appropriately captures the single variable / separate allele duality Methods to import and manipulate genotype objects without string manipulation Simple tools including different views of genotype variables in standard statistical models Dominant ( at least one copy of X) Recessive ( both alleles are X) Additive ( Number of copies of X) Heterozygote Effect ( Differing Alleles) Independent ( separate effect for each allele combination: A/A, A/B=B/A, B/B) Functions for computing and visualizing common genetic summaries and statistical tests Allele Frequencies Hardy-Weinberg Equilibrium Linkage Disequilibrium Other statistical methods

7 Page 7CT ASA Mini Conference: 2005-03-05 Design Details Design: Design: Genotypes are stored in Factor objects, with factor levels formatted as A/C. A translation table is constructed to quickly extract individual allele information: Consequences Consequences Can be stored in standard data frames Can be efficiently manipulated (space & time) Permits both biallelic (C/T) and multi-allelic genetic markers (SSLPs) Genotype Allele 1 Allele 2 A/AAA A/BAB B/BBB

8 Page 8CT ASA Mini Conference: 2005-03-05 Genotype Manipulation Importing & Creation Importing & Creation genotype(), as.genotype(), makeGenotypes(), … haplotype(), as.haplotype(), makeHaplotypes(), … Manipulation Manipulation [] (subsetting), []<- (subset assignment), == (equality) Information Information summary() (Allele and genotype counts and frequencies), allele.names(), allele() (Extract individual alleles), nallele() (Number of distinct allele values) Annotation Annotation locus(), gene(), marker(), … Transformation Transformation carrier(), homozygote(), heterozygote(), allele.count() Export Export write.marker.file(), write.pedigree.file(), write.pop.file()

9 Page 9CT ASA Mini Conference: 2005-03-05 Installation Windows GUI: Command Line: > install.packages(genetics, dependencies=TRUE)

10 Page 10CT ASA Mini Conference: 2005-03-05 Statistical Functions Hardy-Weinberg (Dis-)Equilibrium: D, D, r, r 2, X 2 Hardy-Weinberg (Dis-)Equilibrium: D, D, r, r 2, X 2 diseq(), diseq.ci() (Confidence Intervals!) HWE.test(), HWE.chisq(), HWE.exact() Linkage Disequlibrium: D, D, r, r 2 Linkage Disequlibrium: D, D, r, r 2 LD(), LDplot(), LDtable() Haplotype Imputation: Haplotype Imputation: hap(), hapambig(), hapmcmc(), hapenum(), hapshuffle() Sample-size tools Sample-size tools gregorius() (Probability of observing a marked of given frequency with specified sample size) power.casectrl() Utilities Utilities Bootstrap.ci

11 Page 11CT ASA Mini Conference: 2005-03-05 Simple Examples : Creating Genotype Objects A single vector with a character separator: > g1 <- genotype( c('A/A','A/C','C/C','C/A', + NA,'A/A','A/C','A/C') ) > g3 <- genotype( c('A A','A C','C C','C A', + '','A A','A C','A C'), + sep=' ', remove.spaces=F)

12 Page 12CT ASA Mini Conference: 2005-03-05 Simple Examples : Creating Genotype Objects A single vector with a positional separator > g2 <- genotype( c('AA','AC','CC','CA','', + 'AA','AC','AC'), sep=1 ) Two separate vectors > g4 <- genotype( + c('A','A','C','C','','A','A','A'), + c('A','C','C','A','','A','C','C') + )

13 Page 13CT ASA Mini Conference: 2005-03-05 Simple Examples : Creating Genotype Objects A dataframe or matrix with two columns > gm <- cbind( + c('A','A','C','C','','A','A','A'), + c('A','C','C','A','','A','C','C') ) > gm [,1] [,2] [1,] "A" "A" [2,] "A" "C" [4,] "C" "A" … > g5 <- genotype( gm ) > g5 [1] "A/A" "A/C" "C/C" "A/C" NA "A/A" "A/C" "A/C" Alleles: A C

14 Page 14CT ASA Mini Conference: 2005-03-05 Simple Examples : Creating Genotype Objects Convert 1-column genotype variables read from a file: > gm1 gm1 <- makeGenotypes( + read.csv("gm1.csv")) > gm1 Age Sex G1 V2 Age Sex G1 V2 1 31 M A/A G/T 2 27 F A/C G/G 3 35 M C/C G/T 4 19 M A/C G/T 5 55 M G/G 6 34 F A/A G/G 7 45 F A/C T/T 8 32 M A/C G/T > gm1$G1 [1] "A/A" "A/C" "C/C" "A/C" NA "A/A" "A/C" "A/C" Alleles: A C _ gm1.csv __ Age,Sex,G1,G2 31,M,A/A,G/T 27,F,A/C,G/G 35,M,C/C,G/T 19,M,A/C,G/T 55,M,,G/G 34,F,A/A,G/G 45,F,A/C,T/T 32,M,A/C,G/T

15 Page 15CT ASA Mini Conference: 2005-03-05 Simple Examples : Creating Genotype Objects Convert 2-column genotype variables read from a file > gm2 <- makeGenotypes( + read.csv("gm2.csv"), + convert=list(3:4,5:6)) > gm2 Age Sex G1.1/G1.2 V2.1/V2.2 1 31 M A/A G/T 2 27 F A/C G/G 3 35 M C/C G/T 4 19 M A/C G/T 5 55 M G/G 6 34 F A/A G/G 7 45 F A/C T/T 8 32 M A/C G/T ______ gm2.csv _____ Age,Sex,G1.1,G1.2,G2.1,G2.2 31,M,A,A,G,T 27,F,A,C,G,G 35,M,C,C,T,G 19,M,C,A,G,T 55,M,,,G,G 34,F,A,A,G,G 45,F,A,C,T,T 32,M,A,C,T,G

16 Page 16CT ASA Mini Conference: 2005-03-05 Simple Examples : Displaying Genotype Information Raw > g5 [1] "A/A" "A/C" "C/C" [4] "A/C" NA "A/A [5] "A/C" "A/C" Alleles: A C Summary > summary(g5) Allele Frequency: Count Proportion A 8 0.57 C 6 0.43 NA 2 NA Genotype Frequency: Count Proportion A/A 2 0.29 A/C 4 0.57 C/C 1 0.14 NA 1 NA

17 Page 17CT ASA Mini Conference: 2005-03-05 Simple Examples: Extracting allele information Genotypes (Independent factor levels): Genotypes (Independent factor levels): > g5 [1] "A/A" "A/C" "C/C" "A/C" [5] NA "A/A" "A/C" "A/C" Alleles: A C Allele Counts (Additive Effect): Allele Counts (Additive Effect): > allele.count(g5, "A") [1] 2 1 0 1 NA 2 1 1 attr(,"allele") [1] "A" Allele presence (Dominant Effect): Allele presence (Dominant Effect): > carrier(g5,'A') [1] TRUE TRUE FALSE TRUE [5] NA TRUE TRUE TRUE Allele Homozygote (Recessive Effect): > homozygote(g5,'A') [1] TRUE FALSE FALSE FALSE [5] NA TRUE FALSE FALSE Heterozygote (Heterozygote Advantage Effect): > heterozygote(g5,'A') [1] FALSE TRUE FALSE TRUE [5] NA FALSE TRUE TRUE

18 Page 18CT ASA Mini Conference: 2005-03-05 Simple Examples: Extracting allele information First allele: First allele: > allele(g5, 1) [1] "A" "A" "C" "A" NA "A" [7] "A" "A" attr(,"which") [1] 1 attr(,"allele.names") [1] "A" "C Both alleles: > allele(g5) [,1] [,2] [1,] "A" "A" [2,] "A" "C" [3,] "C" "C" [4,] "A" "C" [5,] NA NA [6,] "A" "A" [7,] "A" "C" [8,] "A" "C" attr(,"which") [1] 1 2 attr(,"allele.names") [1] "A" "C"

19 Page 19CT ASA Mini Conference: 2005-03-05 Example Session

20 Page 20CT ASA Mini Conference: 2005-03-05 Future Development R GeneticsNG Mission: Mission: GeneticsNG is a collaborative project to develop a core set of data structures and analytic tools for the management, visualization, and analysis of genetic data. This core will provide sufficient ease of use, stability, features, documentation, and community support to inspire users and developers to utilize, contribute and extend the system. Goals: Goals: Scalable to Whole-Genome genetic analysis (>1e5 SNPs) Read/Write common genetics data storage formats Port existing open-source genetics codes Current R genetics packages (genetics, haplo.score, gap, …) Other open-source packages… Provide good documentation, including tutorials and training Engage the entire R genetics user/developer community

21 Page 21CT ASA Mini Conference: 2005-03-05 Future Development R GeneticsNG Current Team Current Team Pfizer: Gregory Warnes, Nitin Jain Channing Laboratory (Harvard): Ross Lazarus BMS: Scott D Chasalow, Giovanni Montana Insightful: Michael O'Connell Univ. Chicago: Junsheng Cheng Join us! Project Page: Project Page: http://r-genetics.sf.net/

22 Page 22CT ASA Mini Conference: 2005-03-05 References R Project: R Project: http://www.r-project.org R genetics package: R genetics package: http://cran.r-project.org/contrib/main/Descriptions/genetics.html R-News article: R-News article: Warnes GR. ``The Genetics Package,'' R News, Volume 3, Issue 1, June 2003.The Genetics PackageR News R GeneticsNG project: R GeneticsNG project: http://r-genetics.sf.net/ Me: Me: http://www.warnes.net Gregory.R.Warnes@Pfizer.com


Download ppt "The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT."

Similar presentations


Ads by Google