Presentation on theme: "GWAS: Installing and Testing Dustin Landers & Troy Kling."— Presentation transcript:
GWAS: Installing and Testing Dustin Landers & Troy Kling
Introduction to GWAS GWAS tools (e.g. PLINK, FaST-LMM, etc.) are used to identify how markers (regions of the DNA sequence) relate to some trait For example, what changes in the DNA sequence will translate to increased plant height? GWAS tools take relational data about different markers we have on the DNA sequence (these are called Single Nucleotide Polymorphisms or SNPs) and use that to model changes in a quantitative or categorical trait GWAS TOOL Genotype and Trait Data Knowledge about how genotypes relate to traits
Troy’s work so far Installing Genome-wide Association Studies tools on Atmosphere & Discovery Environment. – Working mostly with GWAS packages in R. – e.g. SKAT, aml, BATools, etc. Installing a new tool that uses an R package requires writing a wrapper script for it. – Wrappers for R packages can be broken down into three main chunks: 1.Grab command-line arguments. 2.Execute an association test on user-supplied inputs. 3.Return the results. – The wrapper script creation process can be tedious and time-consuming. Designing software to automate the creation of wrapper scripts for R packages. – My new project, called wrapR, takes the name of an R package and automatically generates a wrapper script for each function within that package. – These wrapper scripts are ready to be chained together and executed from Atmosphere or the DE. – Surprisingly, teaching R how to interpret different types of input is the most difficult part. – Simple/Complex dichotomy. – Applications to Artificial Neural Networks and Machine Learning.
Dustin’s problems and what he’s done so far How to judge how well a tool works? Run known-truth dataset through tool, examine output. But… Any one test is atypical, so how do we run lots of known- truth data sets through a tool? Obvious problems: – Problem 1) Realistic data sets are massive (our Syngenta ped- map pairs are around 1.5 gigabytes each!) – Problem 2) What are the best ways to summarize information from a single run? – Problem 3) How do we make this easy so that everyone will do it?
So in recap… Both Troy and I’s work has involved the middle part of this diagram Troy’s work has been in developing ways to easily integrate new tools in to iPlant CI Dustin’s work has been in developing ways to easily test new tools on the iPlant CI Notice that both of these statements involve the word “easily”---that’s because iPlant is interested in infrastructure and we believe that ease of use will encourage people to use it! GWAS TOOL Genotype and Trait Data Knowledge about how genotypes relate to traits
So what has Dustin done so far? Created two different tools Aggregate and Validate Validate accepts a folder-wide input and returns performance metrics (is public on the Discovery Environment) Aggregate is more of a data management tool (it’s a standalone executable) that accesses your iPlant Data Store and allows you to aggregate massive amounts of outputs with relative ease – It’s basically a formalization of what would otherwise be a bash scripting process using curl or the like – Basically, we discovered that any tester would need to do a lot of scripting—we want to cut back on that as much as possible.
Where we want Validate to go next The clear next step for us is to somehow integrate the whole process… Meaning supplying simulations, running the tools, and being able to support a larger breadth of analyses in a single swoop Also, to make sure we are including all the *right* kinds of analyses… A particular example to follow
Our recent job overlap Troy installed GEMMA Dustin needed to user-test Validate and Aggregate Late last year, Dustin tested PLINK and FaST- LMM and wrote a report outlining the results So where does GEMMA fall in this line-up?
* Indicates population structure
We noticed that GEMMA excluded certain SNPs from analysis automatically. Actually, excluding SNPs from the analysis is common if those SNPs have low minor allele frequency. But in some cases, researchers may not want to exclude SNPs on this basis alone… What is an acceptable cut-off?
These are the kinds of questions Validate intends to provide answers to! We think every researcher should be thinking about these things, but its understandable if they don’t. Having the proper infrastructure already in place to provide these kinds of analyses is essentially the point.
Why we showed you this? We are playing the role of both analysts and developers. We have to understand what makes a tool work better than other tools. In this case, we recently spotted that how a tool handles minor allele frequency is extremely important. For example, FaST-LMM doesn’t need to remove SNPs with low MAF and still performs better than most tools. In fact, FaST-LMM returns SNP effect size information for every single.
How can we improve Validate and wrapR? Any thoughts, or questions? Additional performance metrics (we are calling them performetrics) to include? Is there a better way? Is there something we are missing?
Why the better estimates and reductions in standard errors? We have a simple demonstration of this. To show why this happens, we simulate 500 SNPs with allele frequencies ranging from 0.0000001 to 0.05 We then simulate a quantitative trait. If the allele isn’t present then y~N(0,1), if it is then y~N(10,1). This is about equivalent to a heritability value of 0.8. Then we try to predict the trait using the SNP in 500 different models, and record the estimates and the standard errors.