Presentation on theme: "BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna."— Presentation transcript:
BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna
Outline what is R what is Bioconductor packages getting and using Bioconductor
R R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. GNU project
R what sorts of things is R good at? –there are very many statistical algorithms –there are very many machine learning algorithms –visualization –it is possible to write scripts that can be reused –R is a real computer language
R R supports many data technologies –XML,database integration,SOAP R interacts with other languages –C; FORTRAN; Perl; Python; Java R has good visualization capabilities R has a very active development environment R is largely platform independent –Unix; Windows; OSX
Overview of the Bioconductor Project
Bioconductor Bioconductor is an open source and open development software project for the analysis of biomedical and genomic data. The project was started in the Fall of 2001 and includes 23 core developers in the US, Europe, and Australia. R and the R package system are used to design and distribute software. Releases –v 1.0: May 2 nd, 2002, 15 packages. –v 1.1: November 18 th, 2002,20 packages. –v 1.2: May 28 th, 2003, 30 packages. –v 1.3: October 28 th, 2003, 54 packages. ArrayAnalyzer: Commercial port of Bioconductor packages in S-Plus.
Goals Provide access to powerful statistical and graphical methods for the analysis of genomic data. Facilitate the integration of biological metadata (GenBank, GO, LocusLink, PubMed) in the analysis of experimental data. Allow the rapid development of extensible, interoperable, and scalable software. Promote high-quality documentation and reproducible research. Provide training in computational and statistical methods.
Bioconductor packages Bioconductor software consists of R add-on packages. An R package is a structured collection of code (R, C, or other), documentation, and/or data for performing specific types of analyses. E.g. affy, cluster, graph, hexbin packages provide implementations of specialized statistical and graphical methods.
Bioconductor packages Release 1.3, October 28 th, 2003 AnnBuilder Bioconductor annotation data package builderAnnBuilder Biobase Biobase: Base functions for BioconductorBiobase DynDoc Dynamic document toolsDynDoc MAGEML handling MAGEML documentsMAGEML MeasurementError.cor Measurement Error model estimate for correlation coefficientMeasurementError.cor RBGL Test interface to boost C++ graph libRBGL ROC utilities for ROC, with uarray focusROC RdbiPgSQL PostgreSQL accessRdbiPgSQL Rdbi Generic database methodsRdbi Rgraphviz Provides plotting capabilities for R graph objectsRgraphviz Ruuid Ruuid: Provides Universally Unique ID valuesRuuid SAGElyzer A package that deals with SAGE librariesSAGElyzer SNPtools Rudimentary structures for SNP dataSNPtools affyPLM affyPLM - Probe Level ModelsaffyPLM Affy Methods for Affymetrix Oligonucleotide ArraysAffy Affycomp Graphics Toolbox for Assessment of Affymetrix Expression MeasuresAffycomp Affydata Affymetrix Data for Demonstration PurposeAffydata Annaffy Annotation tools for Affymetrix biological metadataAnnaffy Annotate Annotation for microarraysAnnotate
Ctc Cluster and Tree Conversion.Ctc daMA Efficient design and analysis of factorial two-colour microarray datadaMA Edd expression density diagnosticsEdd externalVector Vector objects for R with external storageexternalVector factDesign Factorial designed microarray experiment analysisfactDesign Gcrma Background Adjustment Using Sequence InformationGcrma Genefilter Genefilter: filter genesGenefilter Geneplotter Geneplotter: plot microarray dataGeneplotter Globaltest Global TestGlobaltest Gpls Classification using generalized partial least squaresGpls Graph graph: A package to handle graph data structuresGraph Hexbin Hexagonal Binning RoutinesHexbin Limma Linear Models for Microarray DataLimma Makecdfenv CDF Environment MakerMakecdfenv marrayClasses Classes and methods for cDNA microarray datamarrayClasses marrayInput Data input for cDNA microarraysmarrayInput marrayNorm Location and scale normalization for cDNA microarray datamarrayNorm marrayPlots Diagnostic plots for cDNA microarray datamarrayPlots marrayTools Miscellaneous functions for cDNA microarraysmarrayTools Bioconductor packages Release 1.3, October 28 th, 2003
Matchprobes Tools for sequence matching of probes on arraysMatchprobes Multtest Multiple Testing ProceduresMulttest ontoTools graphs and sparse matrices for working with ontologiesontoTools Pamr Pam: prediction analysis for microarraysPamr reposTools Repository tools for RreposTools Rhdf5 An HDF5 interface for RRhdf5 Siggenes Significance and Empirical Bayes Analyses of MicroarraysSiggenes Splicegear splicegearSplicegear tkWidgets R based tk widgetstkWidgets Vsn Variance stabilization and calibration for microarray dataVsn widgetTools Creates an interactive tcltk widgetswidgetTools
annotate, annafy, and AnnBuilder Assemble and process genomic annotation data from public repositories. Build annotation data packages or XML data documents. Associate experimental data in real time to biological metadata from web databases such as GenBank, GO, KEGG, LocusLink, and PubMed. Process and store query results: e.g., search PubMed abstracts. Generate HTML reports of analyses. AffyID 41046_s_at ACCNUM X95808 LOCUSID 9203 SYMBOL ZNF261 GENENAME zinc finger protein 261 MAP Xq13.1 PMID GO GO: GO: GO: many other mappings Metadata package hgu95av2 mappings between different gene identifiers for hgu95av2 chip.
multtest package Multiple hypothesis testing Control type I error rate by using e.g. Bonferroni method
heatmap mva package -clustering
mva package – principal component analysis
Installation 1.Main R software: download from CRAN (cran.r-project.org), use latest release, now cran.r-project.org 2.Bioconductor packages: download from Bioconductor (www.bioconductor.org), use latest release, now 1.3.www.bioconductor.org Available for Linux/Unix, Windows, and Mac OS.
Installation After installing R, install Bioconductor packages using getBioC install script. From R > source("http://www.bioconductor.org/getBioC.R") > getBioC() In general, R packages can be installed using the function install.packages. In Windows, can also use Packages pull- down menus.
User interaction R Command-line Widgets. Small-scale graphical user interfaces (GUI), providing point & click access for specific tasks. –E.g. File browsing and selection for data input, basic analyses.
Widgets tkMIAME tkphenoData tkSampleNames Reading in phenoData
Documentation and help R manuals and tutorials:available from the R website or on-line in an R session. R on-line help system: detailed on-line documentation, available in text, HTML, PDF, and LaTeX formats. > help.start() > help(lm) > ?hclust > apropos(mean) > example(hclust) > demo() > demo(image)
Short courses Bioconductor short courses –modular training segments on software and statistical methodology; –lectures notes, computer labs, and course packages available on WWW for self- instruction.
Vignettes Bioconductor has adopted a new documentation paradigm, the vignette. A vignette is an executable document consisting of a collection of code chunks and documentation text chunks. Vignettes provide dynamic, integrated, and reproducible statistical documents that can be automatically updated if either data or analyses are changed. Each Bioconductor package contains at least one vignette, providing task-oriented descriptions of the package's functionality.
Vignettes vExplorer HowTos: Task-oriented descriptions of package functionality. Executable documents consisting of documentation text and code chunks. Dynamic, integrated, and reproducible statistical documents. Can be used interactively – vExplorer. Generated using Sweave ( tools package).
References R cran.r-project.orgwww.r-project.orgcran.r-project.org –software (CRAN); –documentation; –newsletter: R News; –mailing list. Bioconductor –software, data, and documentation (vignettes); –training materials from short courses; –mailing list. Personal
acknowledgements Robert Gentleman Department of Biostatistical Science, Dana Faber Cancer Institute, Boston Sandrine Dudoit Division Biostatistics, University of California, Berkeley