Presentation is loading. Please wait.

Presentation is loading. Please wait.

Large-Scale Expression Data Mining and Management Dr Ewan Hunter Senior Scientist (Europe)

Similar presentations


Presentation on theme: "Large-Scale Expression Data Mining and Management Dr Ewan Hunter Senior Scientist (Europe)"— Presentation transcript:

1 Large-Scale Expression Data Mining and Management Dr Ewan Hunter Senior Scientist (Europe)

2 Silicon Genetics Founded in 1998 to provide scientists with software that efficiently analyzes, interprets, and manages large volumes of expression data

3 Customer List Continues to Grow Over 4000 customers at over 500 organizations, including leading research institutes, biotech and pharmaceutical companies Pfizer Stanford University Merck & Co Celera Bristol Meyers Squibb TGen Novartis NASA Ames Research Cold Spring Harbor UCLA UCSF UC Davis Lawrence Berkeley Labs Merck KGaA Baylor College of Medicine Applied Biosystems Genentech Cedars-Sinai SRI International Vancouver General Hospital Glaxo SmithKline Cornell University AstraZeneca US EPA US NIH US FDA Wyeth/AHP Roche Schering-Plough Boehringer Ingelheim Bayer Affymetrix Swiss Array Consortium Celgene NERC Biogen Emory University Aventis

4 Our recipe for success Corporate focus on expression informatics Independently owned and profitable since inception six-years ago Customer-driven software development Responsive, knowledgeable and proactive technical support team Extensive experience in software implementation

5 The expression informatics leader Market leadership confirmed in recent GenomeWeb survey of 167 industry professionals.

6 Extensive citations in leading journals Year End GeneSpring citations appearing in leading peer- reviewed journals

7 Gene Expression Data Flow Validation Scanning Image Processing Data Processing Analysis & Visualization Silicon Genetics GeneSpring and GeNet Validation Normalization Scaling Error models Formatting Clustering ANOVA Class Prediction Pathways Cross validation Biochemical Literature Output from Affymetrix ®, Clontech ™ Agilent ™ and others

8 Different data types must be integrated Raw Data –Residing in custom-developed databases –Residing in LIMS –Residing in flat-files/spreadsheets Sample and gene annotation –Residing in custom-developed databases –Residing in LIMS –Residing in flat-files/spreadsheets Pre-processed data –From third-party applications –From flat-files Analysis results –From GeneSpring –From flat-files –From third-party applications

9 GeNet as a Centralized Workspace

10 Automated synchronization capabilities Synchronize existing data repositories with GeNet using SampleLoader API Integrate data from LIMS systems and corporate databases Integrate pre-processed data from third-party applications

11 SampleLoader populating the workspace with annotated raw data

12 Integration of Standard Annotation with Sample Data Treatment Type Age Gender Array Design Stage Duration Concentration Dosage Compound Sample ID Author Time Disease Type Organ Tissue Type

13 Enforceable Annotation Standards via SampleLoader Compliance with MIAME or in- house annotation standards can be easily enforced (via XML DTD) Attributes can be chosen from a standard list complete with drop- down options Attributes can be indicated as required, recommended and optional

14 Integration of clinical information to create a searchable and standardized repository

15 SampleLoader populating the workspace with pre-processed data

16 Mining GeNet with GeneSpring

17 The GeneSpring client – list of powerful analysis capabilities continues to grow Scripting Language (automated analysis Ontology and Homology Tools MIAME support (Published Meta Data Structure) Two-Way ANOVA Post-Hoc tests Find Similar Samples Algorithm Boolean Filtering PCA on conditions SVM classifier Multiple Clustering Algorithms –QT clustering –Hierarchical –K-means –SOM

18 Seamless interaction with GeNet Sample data residing in GeNet can be accessed from the Sample Manager in GeneSpring Users can easily search for samples of interest and proceed with analysis in GeneSpring

19 Populating the workspace with analysis results

20 The GeNet Workspace

21 Configuring data upload to GeNet Data upload and download via GeneSpring to and from GeNet is seamless –Users can upload important analysis results to GeNet with a click of the mouse –Data residing in GeNet is automatically available to the GeneSpring user upon login Data upload to GeNet via SampleLoader can be configured via customizable xml files –“SampleLoader Configuration Files” –Runs nightly cron-jobs to synchronize existing data repositories with GeNet

22 More on Integration… External Program Interface The External Program Interface (EPI) allows you to run external programs from within GeneSpring Used to integrate GeneSpring directly with other analysis and visualization packages –Ex: SAS, S+, R, JMP, MatLab Extends out-of-the-box capabilities of GeneSpring –Interface with custom code (C, C++, Java, Perl, etc.) Results can be stored in GeNet

23 Example EPIs Ariadne’s PathwayAssist Lion’s SRS and LTE SAS Bioconductor/R S+

24 Utilizing the Full Power of the GeNet Workspace

25 Custom Integration and Future APIs Further customization and integration work can be performed with the help of Silicon Genetics’ Professional Services Additional APIs are in development that will allow other applications to query GeNet for data –Key customer input and assistance in developing specs is welcome and encouraged

26 Making GeNet a Workspace GeNet API – Architecture S.O.A.P. (Simple Object Access Protocol) “a lightweight XML-based messaging protocol used to encode the information in Web service request and response messages before sending them over a network.” Webopedia. In addition we will distribute WSDL (web service description library) files that allow specialized applications to auto-generate routines in a variety of language that generate GeNet-specific SOAP objects.

27 Making GeNet a Workspace GeNet API – Use Cases –Generating normalized samples using a third party application –Updating genomic annotations using a “custom spidering” application –Adding sample attachments created by third-party applications

28 Scenario #1 Biologist analyzes own data Biologist Data pre-processing and normalization Statistical Data Analysis Data visualization Clustering and pathway analysis Report Generation Results GeneLists P-values Fold Change Pathways Clusters Graphs

29 Example Workflow for Scenario #1 LIMS SampleLoader GeNet Biologist GeneSpring Annotated Data Finished Results

30 Scenario #2 Bioinformatician analyzes data for Biologist Bioinformatician Data pre-processing and normalization Statistical Data Analysis Clustering and pathway analysis Results GeneLists P-values Fold Change Pathways Clusters Graphs Biologist Data visualization Report Generation

31 Example Workflow for Scenario #2 LIMS GeNet SampleLoader Primary Results Raw Data Biologist API GeNet Viewer Finished Results 3 rd party tool Finished Results Statistician/ Bioinformatician R SAS GeneSpring Primary Results EPI

32 Scenario #3 Analysis responsibilities shared Statistician/ Bioinformatician Data pre-processing and normalization Statistical data analysis Biologist Data visualization Clustering and pathway analysis Report Generation Results GeneLists P-values Fold Change Pathways Clusters Graphs

33 Catering to the statistician and biologist sharing analysis responsibilities Statistician/Bioinformatician –Can perform initial analysis with both 3 rd party applications and GeneSpring –3 rd party apps Probe-level analysis Data normalization and QC Statistical tests –GeneSpring Data normalization and QC Boolean Filters Statistical tests Biologist –Can complete analysis with GeneSpring Data visualization Clustering Ontology Builder Homology Tools Sequence support Pathway Analysis Final Report Generation Image Export

34 Example Workflow for Scenario #3 LIMS R, D-chip Probe-level analysis SAS, S+ Data processing Statistical analysis Statistician/ Bioinformatician GeNet SampleLoader Primary Results Biologist GeneSpring Data & Analyses Finished Results Raw Data API

35 Key GeneSpring Features Automated integration of biological information Filtering Statistics Clustering PCA Class Prediction Sequence Analysis Pathway Analysis

36 Automated Ontology Builder Builds hierarchical, ontological classifications based on annotation in master gene table Genelists categorized by biological process, molecular function and cellular component

37 Automated Homology Builder Homology tables between organisms can be automatically created Aids in comparing functionality in model systems Aids in comparing identical genes from different technologies

38 Key GeneSpring Features Automated integration of biological information Filtering Statistics Clustering PCA Class Prediction Sequence Analysis Pathway Analysis

39 Statistics in GeneSpring Basic Statistics –Mean –Standard Deviation –Standard Error One-sample t-test p-values with Multiple Testing Correction option One-way and Two-way ANOVA with Multiple Testing Correction option Post-Hoc tests Global Error Model-derived Statistics Similar Lists p-values Correlation for Similar Samples External Program Interfaces to other statistical packages

40 Easy to execute ANOVA tests User can execute both 1-way and 2-way ANOVA tests from a simple interface Choose 1-way or 2-way test Choose variable to testChoose test typeChoose MTCChoose Post-hoc for 1-way ANOVA Run test

41 Easy to interpret ANOVA results Results from 2-way ANOVA are returned in a spreadsheet format Lists can be saved and viewed in GeneSpring or displayed in a Venn Diagram Post-hoc test summary by groups

42 Key GeneSpring Features Automated integration of biological information Filtering Statistics Clustering PCA Class Prediction Sequence Analysis Pathway Analysis

43 An impressive list of clustering options Gene Tree Condition Tree K-means SOM QT clustering

44 Key GeneSpring Features Automated integration of biological information Filtering Statistics Clustering PCA Class Prediction Sequence Analysis Pathway Analysis

45 Principal Components Analysis GeneSpring enables the user to easily perform both PCA on genes and PCA on conditions

46 Key GeneSpring Features Automated integration of biological information Filtering Statistics Clustering PCA Class Prediction Sequence Analysis Pathway Analysis

47 Class Prediction Used to identify genes that discriminate well among phenotypes Used for quality control or class discovery –Samples representing potential outliers Uses K-nearest neighbors algorithm Leave-one-out cross validation tests accuracy of prediction rule

48 Key GeneSpring Features Automated integration of biological information Filtering Statistics Clustering PCA Class Prediction Sequence Analysis Pathway Analysis

49 Sequence Analysis Entire sequence information for entire organisms can be loaded and visualized Advanced searches for potential regulatory sequences and specific promoters can be performed Genes and sequences can be visualized on organism-specific chromosomal maps

50 Key GeneSpring Features Automated integration of biological information Filtering Statistics Clustering PCA Class Prediction Sequence Analysis Pathway Analysis

51 Powerful Pathway Analysis Powerful capabilities to visualize and manipulate pathways Genes automatically placed on KEGG and GenMaPP pathways and linked with expression data Pathways are easily converted to genelists for further analysis Seamless integration with popular pathway tools, such as Ariadne Genomics’ PathwayAssist ( Utilizes Natural Language Processing (NPL) to data mine biological publications via PubMed (NIH db)

52 Utilizing GeNet as a Workspace These powerful analysis features in GeneSpring can be easily extended to mine the entire GeNet repository of data –“Database” converted to a “Workspace” Complex querying through scripting function Compute Farm for computationally intensive analysis procedures

53 Mining Data in the Workspace using GeneSpring Example: Find Similar Samples Sample Pool can be chosen from all samples stored locally and on GeNet

54 Interpreting the Results Results will show correlation value of all samples in the user- specified pool to the target sample Is the sample of interest highly correlated to a sample that previously exhibited toxicity in a different study? Is the sample of interest highly correlated to a sample that demonstrated high therapeutic potential?

55 GeNet as a Workspace Example: Find Similar Gene Lists When creating a new gene list in GeneSpring, GeNet is automatically queried to find if previously created lists are statistically similar to the new list Does my genelist have a significant number of similar members to an important genelist in a previous study?

56 Complex Querying Capabilities Leverage the knowledge of the entire organization by performing highly customizable, database-wide queries through the scripting function Queries can be performed on a large-scale using the entire GeNet repository or on a smaller-scale using the individual researchers sample repository

57 Automating Analyses through Scripts Complex and routine analyses with a series of steps can be bundled into one, push-button operation with the ScriptEditor™

58 A Flexible Visual Scripting Language Power users create standard scripts using our powerful visual scripting language Scripts are easily executed by novice GeneSpring users Scripts can execute any external program/algorithm Scripts can be bundled within scripts All scripts and results can be stored and shared on GeNet

59 Simple script execution Choose to run script locally or remotely Specify inputs using mini-navigatorScript DescriptionIf needed, specify knob values

60 The BioScript Library Major Categories in BioScript Library 1. QC Filtering 2. Study-Centric Queries - Analysis of groups – Multi-group comparison - Analysis of series – Single time/dose series analysis 3. Biological Queries - Biological fold analysis - Gene Ontology (GO) analysis - Biological pathway analysis - Sequence… promoter analysis 4. Gene-Centric Queries 5. GeNet-Wide Queries 6. Analysis via External Applications

61 Accessing Scripts in GeneSpring The BioScript Library, as well as custom scripts are available in the GeneSpring Navigator or via a connection to GeNet

62 An architecture designed specifically for high-volume data mining Our unique architecture gives users the best of both worlds –Flexibility and responsiveness of the desktop –High power and administrative ease of the server Effectively bypasses disadvantages associated with desktop-only and server-only systems –Computational limitations of lower-memory PCs –Slowness of server and long wait-times for results

63 One-click Remote Computation Computationally intensive analyses can be sent to a compute farm with one-click You’re now free to keep working while the analysis is completed

64 GeNet Public Data Repository (GPDR) Fully annotated, ready-for-analysis public data repository with over 6,500 samples available to GeNet customers

65 Open and Scaleable System Open system based on industry standards Architecture scales easily to support a large number of users –Easy to connect additional clients to GeNet –Easy to add additional Remote Servers, as computational needs grow –SampleLoader can connect to an unlimited number of external data sources Powerful scripting language and external program interface allows for further customization and standardization API for querying GeNet allows for greater integrative capacities

66 A look into the future… GeNet will intelligently integrate other data types that are valuable to investigate in the context of gene expression data Genotyping data integration in prototype stage –SNP analysis tool Extended support for: –Proteomics data –Metabolomics data –Diagnostic and clinical statistics –Sequence data –Other data types…


Download ppt "Large-Scale Expression Data Mining and Management Dr Ewan Hunter Senior Scientist (Europe)"

Similar presentations


Ads by Google