The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT.

Slides:



Advertisements
Similar presentations
Multistage Sampling.
Advertisements

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Myra Shields Training Manager Introduction to OvidSP.
STATISTICS Joint and Conditional Distributions
STATISTICS HYPOTHESES TEST (I)
STATISTICS INTERVAL ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
Variance Estimation in Complex Surveys Third International Conference on Establishment Surveys Montreal, Quebec June 18-21, 2007 Presented by: Kirk Wolter,
8 Copyright © 2005, Oracle. All rights reserved. Creating the Web Tier: JavaServer Pages.
Copyright CompSci Resources LLC Web-Based XBRL Products from CompSci Resources LLC Virginia, USA. Presentation by: Colm Ó hÁonghusa.
Statistical Significance and Population Controls Presented to the New Jersey SDC Annual Network Meeting June 6, 2007 Tony Tersine, U.S. Census Bureau.
Statistical methods for genetic association studies
Overview of Lecture Parametric vs Non-Parametric Statistical Tests.
Overview of Lecture Partitioning Evaluating the Null Hypothesis ANOVA
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
Population Genetics 1 Chapter 23 in Purves 7 th edition, or more detail in Chapter 15 of Genetics by Hartl & Jones (in library) Evolution is a change in.
Lecture 43 Prof Duncan Shaw.
1 Contact details Colin Gray Room S16 (occasionally) address: Telephone: (27) 2233 Dont hesitate to get in touch.
Assumptions underlying regression analysis
SADC Course in Statistics Session 4 & 5 Producing Good Tables.
ZMQS ZMQS
A. Novelletto, F. De Rango Dept. Cell Biology, University of Calabria GENOTYPING CONCORDANT / DISCORDANT COUSIN PAIRS.
Programming Language Concepts
Chapter 7 Sampling and Sampling Distributions
Dr. Engr. Sami ur Rahman Data Analysis Lecture 6: SPSS.
Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.
1.
Biostatistics Unit 5 Samples Needs to be completed. 12/24/13.
Report Card P Only 4 files are exported in SAMS, but there are at least 7 tables could be exported in WebSAMS. Report Card P contains 4 functions: Extract,
Break Time Remaining 10:00.
Information Systems Today: Managing in the Digital World
1 IMDS Tutorial Integrated Microarray Database System.
Contingency tables enable us to compare one characteristic of the sample, e.g. degree of religious fundamentalism, for groups or subsets of cases defined.
Chi-Square and Analysis of Variance (ANOVA)
Chapter 10: Virtual Memory
Prerequisites Recommended modules to complete before viewing this module 1. Introduction to the NLTS2 Training Modules 2. NLTS2 Study Overview 3. NLTS2.
© Copyright by Deitel & Associates, Inc. and Pearson Education Inc. All Rights Reserved. 1 Outline 24.1 Test-Driving the Ticket Information Application.
Benchmark Series Microsoft Excel 2013 Level 2
GIS Lecture 8 Spatial Data Processing.
1..
Lecture plan Outline of DB design process Entity-relationship model
Statistical Analysis SC504/HS927 Spring Term 2008
6.4 Best Approximation; Least Squares
Module 17: Two-Sample t-tests, with equal variances for the two populations This module describes one of the most utilized statistical tests, the.
AU 350 SAS 111 Audit Sampling C Delano Gray June 14, 2008.
Copyright © 2003 by Prentice Hall Computers: Tools for an Information Age Chapter 15 Programming and Languages: Telling the Computer What to Do.
Statistical Inferences Based on Two Samples
Analyzing Genes and Genomes
Essential Cell Biology
ANSC644 Bioinformatics-Database Mining 1 ANSC644 Bioinformatics §Carl J. Schmidt §051 Townsend Hall §
Chapter Thirteen The One-Way Analysis of Variance.
Chapter 18: The Chi-Square Statistic
©2006 Prentice Hall Business Publishing, Auditing 11/e, Arens/Beasley/Elder Audit Sampling for Tests of Controls and Substantive Tests of Transactions.
Clock will move after 1 minute
PSSA Preparation.
Chapter 11 Creating Framed Layouts Principles of Web Design, 4 th Edition.
Essential Cell Biology
Simple Linear Regression Analysis
Multiple Regression and Model Building
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
January Structure of the book Section 1 (Ch 1 – 10) Basic concepts and techniques Section 2 (Ch 11 – 15): Inference for quantitative outcomes Section.
Minitab® 16 Workshop Presented by Arved Harding Your friendly, neighborhood statistician.
RefWorks: The Basics October 12, What is RefWorks? A personal bibliographic software manager –Manages citations –Creates bibliogaphies Accessible.
4/4/2015Slide 1 SOLVING THE PROBLEM A one-sample t-test of a population mean requires that the variable be quantitative. A one-sample test of a population.
1 Distributed Agents for User-Friendly Access of Digital Libraries DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen,
South Dakota Library Network MetaLib User Interface South Dakota Library Network 1200 University, Unit 9672 Spearfish, SD © South Dakota.
Commonly Used Distributions
Lab 3 : Exact tests and Measuring Genetic Variation.
GenABEL: an R package for Genome Wide Association Analysis
Presentation transcript:

The R genetics package: T ools for statistical genetics Gregory R. Warnes Associate Director NonClinical Statistics Pfizer Global R&D Groton CT

Page 2CT ASA Mini Conference: Outline Project Goals Project Goals Simplify Population Genetic Analysis Design Details Design Details Extend R Factor objects Functions Included Functions Included Genetic data: Importing & Creation, Manipulation, Information, Annotation, Transformation, Export Statistical Functions: Hardy-Weinberg (Dis-)Equilibrium, Linkage Disequlibrium, Haplotype Imputation, Sample-size tools Simple Examples Simple Examples Creating Genotype Objects Example Session Example Session Future Development: Future Development: Emulate BioConductor Project Large scale SNP analysis Formal Object Class Multi-team collaboration

Page 3CT ASA Mini Conference: Abstract The genetics package for the R statistical environment provides convenient classes and methods for handling genetic data. Features include: creating, representing, and manipulating variables containing single-locus genetic information creating, representing, and manipulating variables containing single-locus genetic information performing common genetic calculations, including genotype and allele frequencies performing common genetic calculations, including genotype and allele frequencies estimating and testing for departure from Hardy-Weinberg equilibrium (HWE) of individual markers estimating and testing for departure from Hardy-Weinberg equilibrium (HWE) of individual markers estimating, testing and plotting linkage disequilibrium (LD) between sets of markers performing sample size calculations for genetic markers estimating, testing and plotting linkage disequilibrium (LD) between sets of markers performing sample size calculations for genetic markers tools for representing specific the relationship among marker alleles (e.g. dominant, recessive, additive, heterozygote advantage) in standard R statistical models tools for representing specific the relationship among marker alleles (e.g. dominant, recessive, additive, heterozygote advantage) in standard R statistical models In addition to these standard methods, the package also provides two novel capabilities: Estimation of departure from HWE for multi-allelic markers Estimation of departure from HWE for multi-allelic markers Confidence intervals for HWE and LD which account for the bounded nature of the estimates in order to achieve proper coverage Confidence intervals for HWE and LD which account for the bounded nature of the estimates in order to achieve proper coverage The genetics package makes it significantly easier to manipulate and analyse genetic marker information.

Page 4CT ASA Mini Conference: Abstract During this presentation I will Describe the goals of the R genetics package Describe the goals of the R genetics package Introduce the basic features of R genetics package, Introduce the basic features of R genetics package, Provide a brief worked example, including the application of the novel capabilities, and Provide a brief worked example, including the application of the novel capabilities, and Discuss the ongoing project to develop a next-generation package for efficient handling large volumes of genetic marker data (e.g. for whole genome SNP scans). Discuss the ongoing project to develop a next-generation package for efficient handling large volumes of genetic marker data (e.g. for whole genome SNP scans).

Page 5CT ASA Mini Conference: Problem At each genetic position within a gene, diploid cells have two alleles. At each genetic position within a gene, diploid cells have two alleles. This suggests storing each allele as separate variable. This suggests storing each allele as separate variable. However, most laboratory methods cannot distinguish between A/B and B/A, yielding three observed genotypes at each position: (A/A), (A/B or B/A), (B/B). Consequently, the observed alleles are confounded, However, most laboratory methods cannot distinguish between A/B and B/A, yielding three observed genotypes at each position: (A/A), (A/B or B/A), (B/B). Consequently, the observed alleles are confounded, This suggests the use of a single genotype variable. This suggests the use of a single genotype variable. This duality is not directly handled by standard statistical packages. This duality is not directly handled by standard statistical packages. As a consequence, the need to handle both views creates complexity when manipulating or including genotype data in statistical analysis. As a consequence, the need to handle both views creates complexity when manipulating or including genotype data in statistical analysis.

Page 6CT ASA Mini Conference: Initial Project Goals Simplify Statistical Analysis using Genetic Data by providing: A genotype object class that appropriately captures the single variable / separate allele duality Methods to import and manipulate genotype objects without string manipulation Simple tools including different views of genotype variables in standard statistical models Dominant ( at least one copy of X) Recessive ( both alleles are X) Additive ( Number of copies of X) Heterozygote Effect ( Differing Alleles) Independent ( separate effect for each allele combination: A/A, A/B=B/A, B/B) Functions for computing and visualizing common genetic summaries and statistical tests Allele Frequencies Hardy-Weinberg Equilibrium Linkage Disequilibrium Other statistical methods

Page 7CT ASA Mini Conference: Design Details Design: Design: Genotypes are stored in Factor objects, with factor levels formatted as A/C. A translation table is constructed to quickly extract individual allele information: Consequences Consequences Can be stored in standard data frames Can be efficiently manipulated (space & time) Permits both biallelic (C/T) and multi-allelic genetic markers (SSLPs) Genotype Allele 1 Allele 2 A/AAA A/BAB B/BBB

Page 8CT ASA Mini Conference: Genotype Manipulation Importing & Creation Importing & Creation genotype(), as.genotype(), makeGenotypes(), … haplotype(), as.haplotype(), makeHaplotypes(), … Manipulation Manipulation [] (subsetting), []<- (subset assignment), == (equality) Information Information summary() (Allele and genotype counts and frequencies), allele.names(), allele() (Extract individual alleles), nallele() (Number of distinct allele values) Annotation Annotation locus(), gene(), marker(), … Transformation Transformation carrier(), homozygote(), heterozygote(), allele.count() Export Export write.marker.file(), write.pedigree.file(), write.pop.file()

Page 9CT ASA Mini Conference: Installation Windows GUI: Command Line: > install.packages(genetics, dependencies=TRUE)

Page 10CT ASA Mini Conference: Statistical Functions Hardy-Weinberg (Dis-)Equilibrium: D, D, r, r 2, X 2 Hardy-Weinberg (Dis-)Equilibrium: D, D, r, r 2, X 2 diseq(), diseq.ci() (Confidence Intervals!) HWE.test(), HWE.chisq(), HWE.exact() Linkage Disequlibrium: D, D, r, r 2 Linkage Disequlibrium: D, D, r, r 2 LD(), LDplot(), LDtable() Haplotype Imputation: Haplotype Imputation: hap(), hapambig(), hapmcmc(), hapenum(), hapshuffle() Sample-size tools Sample-size tools gregorius() (Probability of observing a marked of given frequency with specified sample size) power.casectrl() Utilities Utilities Bootstrap.ci

Page 11CT ASA Mini Conference: Simple Examples : Creating Genotype Objects A single vector with a character separator: > g1 <- genotype( c('A/A','A/C','C/C','C/A', + NA,'A/A','A/C','A/C') ) > g3 <- genotype( c('A A','A C','C C','C A', + '','A A','A C','A C'), + sep=' ', remove.spaces=F)

Page 12CT ASA Mini Conference: Simple Examples : Creating Genotype Objects A single vector with a positional separator > g2 <- genotype( c('AA','AC','CC','CA','', + 'AA','AC','AC'), sep=1 ) Two separate vectors > g4 <- genotype( + c('A','A','C','C','','A','A','A'), + c('A','C','C','A','','A','C','C') + )

Page 13CT ASA Mini Conference: Simple Examples : Creating Genotype Objects A dataframe or matrix with two columns > gm <- cbind( + c('A','A','C','C','','A','A','A'), + c('A','C','C','A','','A','C','C') ) > gm [,1] [,2] [1,] "A" "A" [2,] "A" "C" [4,] "C" "A" … > g5 <- genotype( gm ) > g5 [1] "A/A" "A/C" "C/C" "A/C" NA "A/A" "A/C" "A/C" Alleles: A C

Page 14CT ASA Mini Conference: Simple Examples : Creating Genotype Objects Convert 1-column genotype variables read from a file: > gm1 gm1 <- makeGenotypes( + read.csv("gm1.csv")) > gm1 Age Sex G1 V2 Age Sex G1 V M A/A G/T 2 27 F A/C G/G 3 35 M C/C G/T 4 19 M A/C G/T 5 55 M G/G 6 34 F A/A G/G 7 45 F A/C T/T 8 32 M A/C G/T > gm1$G1 [1] "A/A" "A/C" "C/C" "A/C" NA "A/A" "A/C" "A/C" Alleles: A C _ gm1.csv __ Age,Sex,G1,G2 31,M,A/A,G/T 27,F,A/C,G/G 35,M,C/C,G/T 19,M,A/C,G/T 55,M,,G/G 34,F,A/A,G/G 45,F,A/C,T/T 32,M,A/C,G/T

Page 15CT ASA Mini Conference: Simple Examples : Creating Genotype Objects Convert 2-column genotype variables read from a file > gm2 <- makeGenotypes( + read.csv("gm2.csv"), + convert=list(3:4,5:6)) > gm2 Age Sex G1.1/G1.2 V2.1/V M A/A G/T 2 27 F A/C G/G 3 35 M C/C G/T 4 19 M A/C G/T 5 55 M G/G 6 34 F A/A G/G 7 45 F A/C T/T 8 32 M A/C G/T ______ gm2.csv _____ Age,Sex,G1.1,G1.2,G2.1,G2.2 31,M,A,A,G,T 27,F,A,C,G,G 35,M,C,C,T,G 19,M,C,A,G,T 55,M,,,G,G 34,F,A,A,G,G 45,F,A,C,T,T 32,M,A,C,T,G

Page 16CT ASA Mini Conference: Simple Examples : Displaying Genotype Information Raw > g5 [1] "A/A" "A/C" "C/C" [4] "A/C" NA "A/A [5] "A/C" "A/C" Alleles: A C Summary > summary(g5) Allele Frequency: Count Proportion A C NA 2 NA Genotype Frequency: Count Proportion A/A A/C C/C NA 1 NA

Page 17CT ASA Mini Conference: Simple Examples: Extracting allele information Genotypes (Independent factor levels): Genotypes (Independent factor levels): > g5 [1] "A/A" "A/C" "C/C" "A/C" [5] NA "A/A" "A/C" "A/C" Alleles: A C Allele Counts (Additive Effect): Allele Counts (Additive Effect): > allele.count(g5, "A") [1] NA attr(,"allele") [1] "A" Allele presence (Dominant Effect): Allele presence (Dominant Effect): > carrier(g5,'A') [1] TRUE TRUE FALSE TRUE [5] NA TRUE TRUE TRUE Allele Homozygote (Recessive Effect): > homozygote(g5,'A') [1] TRUE FALSE FALSE FALSE [5] NA TRUE FALSE FALSE Heterozygote (Heterozygote Advantage Effect): > heterozygote(g5,'A') [1] FALSE TRUE FALSE TRUE [5] NA FALSE TRUE TRUE

Page 18CT ASA Mini Conference: Simple Examples: Extracting allele information First allele: First allele: > allele(g5, 1) [1] "A" "A" "C" "A" NA "A" [7] "A" "A" attr(,"which") [1] 1 attr(,"allele.names") [1] "A" "C Both alleles: > allele(g5) [,1] [,2] [1,] "A" "A" [2,] "A" "C" [3,] "C" "C" [4,] "A" "C" [5,] NA NA [6,] "A" "A" [7,] "A" "C" [8,] "A" "C" attr(,"which") [1] 1 2 attr(,"allele.names") [1] "A" "C"

Page 19CT ASA Mini Conference: Example Session

Page 20CT ASA Mini Conference: Future Development R GeneticsNG Mission: Mission: GeneticsNG is a collaborative project to develop a core set of data structures and analytic tools for the management, visualization, and analysis of genetic data. This core will provide sufficient ease of use, stability, features, documentation, and community support to inspire users and developers to utilize, contribute and extend the system. Goals: Goals: Scalable to Whole-Genome genetic analysis (>1e5 SNPs) Read/Write common genetics data storage formats Port existing open-source genetics codes Current R genetics packages (genetics, haplo.score, gap, …) Other open-source packages… Provide good documentation, including tutorials and training Engage the entire R genetics user/developer community

Page 21CT ASA Mini Conference: Future Development R GeneticsNG Current Team Current Team Pfizer: Gregory Warnes, Nitin Jain Channing Laboratory (Harvard): Ross Lazarus BMS: Scott D Chasalow, Giovanni Montana Insightful: Michael O'Connell Univ. Chicago: Junsheng Cheng Join us! Project Page: Project Page:

Page 22CT ASA Mini Conference: References R Project: R Project: R genetics package: R genetics package: R-News article: R-News article: Warnes GR. ``The Genetics Package,'' R News, Volume 3, Issue 1, June 2003.The Genetics PackageR News R GeneticsNG project: R GeneticsNG project: Me: Me: