Presentation on theme: "GWAS: Installing and Testing"— Presentation transcript:
1GWAS: Installing and Testing Dustin Landers & Troy Kling
2Introduction to GWASGWAS TOOLKnowledge about how genotypes relate to traitsGenotype and Trait DataGWAS tools (e.g. PLINK, FaST-LMM, etc.) are used to identify how markers (regions of the DNA sequence) relate to some traitFor example, what changes in the DNA sequence will translate to increased plant height?GWAS tools take relational data about different markers we have on the DNA sequence (these are called Single Nucleotide Polymorphisms or SNPs) and use that to model changes in a quantitative or categorical trait
3Troy’s work so farInstalling Genome-wide Association Studies tools on Atmosphere & Discovery Environment.Working mostly with GWAS packages in R.e.g. SKAT, aml, BATools, etc.Installing a new tool that uses an R package requires writing a wrapper script for it.Wrappers for R packages can be broken down into three main chunks:Grab command-line arguments.Execute an association test on user-supplied inputs.Return the results.The wrapper script creation process can be tedious and time-consuming.Designing software to automate the creation of wrapper scripts for R packages.My new project, called wrapR, takes the name of an R package and automatically generates a wrapper script for each function within that package.These wrapper scripts are ready to be chained together and executed from Atmosphere or the DE.Surprisingly, teaching R how to interpret different types of input is the most difficult part.Simple/Complex dichotomy.Applications to Artificial Neural Networks and Machine Learning.
4Dustin’s problems and what he’s done so far How to judge how well a tool works?Run known-truth dataset through tool, examine output.But…Any one test is atypical, so how do we run lots of known-truth data sets through a tool?Obvious problems:Problem 1) Realistic data sets are massive (our Syngenta ped-map pairs are around 1.5 gigabytes each!)Problem 2) What are the best ways to summarize information from a single run?Problem 3) How do we make this easy so that everyone will do it?
5So in recap…GWAS TOOLKnowledge about how genotypes relate to traitsGenotype and Trait DataBoth Troy and I’s work has involved the middle part of this diagramTroy’s work has been in developing ways to easily integrate new tools in to iPlant CIDustin’s work has been in developing ways to easily test new tools on the iPlant CINotice that both of these statements involve the word “easily”---that’s because iPlant is interested in infrastructure and we believe that ease of use will encourage people to use it!
6So what has Dustin done so far? Created two different toolsAggregate and ValidateValidate accepts a folder-wide input and returns performance metrics (is public on the Discovery Environment)Aggregate is more of a data management tool (it’s a standalone executable) that accesses your iPlant Data Store and allows you to aggregate massive amounts of outputs with relative easeIt’s basically a formalization of what would otherwise be a bash scripting process using curl or the likeBasically, we discovered that any tester would need to do a lot of scripting—we want to cut back on that as much as possible.
7Where we want Validate to go next The clear next step for us is to somehow integrate the whole process…Meaning supplying simulations, running the tools, and being able to support a larger breadth of analyses in a single swoopAlso, to make sure we are including all the *right* kinds of analyses…A particular example to follow
8Our recent job overlap Troy installed GEMMA Dustin needed to user-test Validate and AggregateLate last year, Dustin tested PLINK and FaST-LMM and wrote a report outlining the resultsSo where does GEMMA fall in this line-up?
11We noticed that GEMMA excluded certain SNPs from analysis automatically. Actually, excluding SNPs from the analysis is common if those SNPs have low minor allele frequency. But in some cases, researchers may not want to exclude SNPs on this basis alone… What is an acceptable cut-off?
12These are the kinds of questions Validate intends to provide answers to! We think every researcher should be thinking about these things, but its understandable if they don’t. Having the proper infrastructure already in place to provide these kinds of analyses is essentially the point.
14Why we showed you this?We are playing the role of both analysts and developers.We have to understand what makes a tool work better than other tools.In this case, we recently spotted that how a tool handles minor allele frequency is extremely important.For example, FaST-LMM doesn’t need to remove SNPs with low MAF and still performs better than most tools. In fact, FaST-LMM returns SNP effect size information for every single.
15How can we improve Validate and wrapR? Any thoughts, or questions?Additional performance metrics (we are calling them performetrics) to include?Is there a better way?Is there something we are missing?
16Why the better estimates and reductions in standard errors? We have a simple demonstration of this.To show why this happens, we simulate 500 SNPs with allele frequencies ranging from to 0.05We then simulate a quantitative trait. If the allele isn’t present then y~N(0,1), if it is then y~N(10,1). This is about equivalent to a heritability value of 0.8.Then we try to predict the trait using the SNP in 500 different models, and record the estimates and the standard errors.