Presentation on theme: "Classical and myGrid approaches to data mining in bioinformatics"— Presentation transcript:
1 Classical and myGrid approaches to data mining in bioinformatics Peter LiSchool of Computing ScienceUniversity of Newcastle upon Tyne
2 Outline Real life bioinformatics use cases Graves’ disease Williams-Beuren syndromeClassical approach to bioinformatics data analysisBioinformatics workflowsUsing myGrid workflows for data analysisIssues for further work
3 Application scenario1 Graves’ disease Simon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical Medical Sciences, University of Newcastle
5 In silico experiments in Graves’ disease Identification of genes:Microarray data analysisGene annotation pipelineDesign of genotype assays for SNP variations in genesDistributed bioinformatics services from Japan, Hong Kong, various sites in UKDifferent data types: textual, image, gene expression, etc.
6 Classical approach to the bioinformatics of Graves’ disease Data Analysis - MicroarrayImport microarray data to Affymetrix data mining tool, run analyses and select geneStudy annotations for many different genesUsing web html based resourcesExperiment design to test hypotheses Find restriction sites and design primers by eye for genotyping experimentsSelect gene and visually examine SNPS lying within gene
7 Application scenario2 Williams-Beuren Syndrome Hannah Tipney, May Tassabehji, St Mary’s Hospital, Manchester, UKGene prediction; gene and protein annotationServices from USA, Japan, various sites in UK
8 Williams-Beuren Syndrome (WBS) Contiguous sporadic gene deletion disorder1/20,000 live births, caused by unequal crossover (homologous recombination) during meiosisHaploinsufficiency of the region results in the phenotypeMultisystem phenotype – muscular, nervous, circulatory systemsCharacteristic facial featuresUnique cognitive profileMental retardation (IQ , mean~60, ‘normal’ mean ~ 100 )Outgoing personality, friendly nature, ‘charming’
9 Williams-Beuren Syndrome Microdeletion STAG3PMS2LBlock AFKBP6TPOM121NOLR1Block CGTF2IPNCF1PGTF2IRD2PBlock BWilliams-Beuren Syndrome MicrodeletionEicher E, Clark R & She, X An Assessment of the Sequence Gaps: Unfinished Business in a Finished Human Genome. Nature Genetics Reviews (2004) 5:Hillier L et al. The DNA Sequence of Human Chromosome 7. Nature (2003) 424:C-cenA-cenB-cenChr 7 ~155 Mb~1.5 Mb7q11.23C-midB-midA-midB-telA-telC-telWBSCR1/E1f4HWBSCR5/LABPOM121NOLR1WBSCR14WBSCR18WBSCR22WBSCR21GTF2IRD1FKBP6BAZ1BBCL7BGTF2IGTF2IRD2FZD9TBL2STX1ACLDN3CLDN4LIMK1CYLN2NCF1ELNRFC2*WBSSVASPatient deletionsCTA-315H11CTB-51J22‘Gap’Physical Map
10 Filling a genomic gap in Silico 12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaaIdentify new, overlapping sequences of interestCharacterise the new sequences at nucleotide and amino acid levelCutting and pasting between numerous web-based services i.e. BLAST, InterProScan etc
11 Classical approachFrequently repeated - info rapidly added to public databasesTime consuming and mundaneDon’t always get resultsHuge amount of interrelated data is produced – handled in notebooks and files saved to local hard driveMuch knowledge remains undocumented:Bioinformatician does the analysisAdvantages:Specialist human intervention at every step, quick and easy accessto distributed servicesDisadvantages:Labour intensive, time consuming, highly repetitive and error proneprocess, tacit procedure so difficult to share both protocol and results
12 In silico experiments in bioinformatics Bioinformatics analyses - in silico experiments - workflowsResources/ServicesGenscanEMBLClustal-WBLASTExample workflow: Investigate the evolutionary relationships between proteinsMultiplesequencealignmentClustal-WProteinsequencesQuery
13 Why workflows and services? Workflow = general technique for describing and enacting a processWorkflow = describes what you want to do, not how you want to do itWeb Service = how you want to do itWeb Service = automated programmatic internet access to applicationsAutomationCapturing processes in an explicit mannerTedium! Computers don’t get bored/distracted/hungry/impatient!Saves repeated time and effortModification, maintenance, substitution and personalisationEasy to share, explain, relocate, reuse and buildAvailable to wider audience: don’t need to be a coder, just need to know how to do BioinformaticsReleases Scientists/Bioinformaticians to do other workRecordProvenance: what the data is like, where it came from, its qualityManagement of data (LSID - Life Science IDentifiers)
14 myGridEPSRC e-Science pilot research projectManchester, Newcastle, Sheffield, Southampton, Nottingham, EBI and industrial partners.‘Targeted to develop open source software to support personalised in silico experiments in biology on a Grid.’Which means enabling scientists to….Distributed computing – machines, tools, databanks, peopleProvenance and data managementWorkflow enactment and notificationA virtual lab ‘workbench’, a toolkit which serves life science communities.
15 Workflow components Freefluo Freefluo Workflow engine to run workflows SOAPLABWeb ServiceAny Applicatione.g. DDBJ BLASTScufl Simple Conceptual Unified Flow LanguageTaverna Writing, running workflows & examining resultsSOAPLAB Makes applications available
16 The workflow experience Have workflows delivered on their promise?YES!Correct and biologically meaningful resultsAutomationSaved time, increased productivityBut you still require humans!SharingOther people have used and want to develop the workflowsChange of work practisesPost hoc analysis. Don’t analyse data piece by piece receive all data all at onceData stored and collected in a more standardised mannerResults management
17 The workflow experience Activation Energy versus Reusability trade-offLack of ‘available’ services, levels of redundancy can be limitedBut once available can be reused for the greater good of the communityInstability of external bioinformatics web servicesResearch levelReliant on other peoples serversTaverna can retry or substitute before graceful failureNeed Shim services in workflows
18 Modelling in silico experiments as workflows requires Shims Unrecorded ‘steps’ which aren’t realised until attempting to build somethingEnable services to fit togetherSemantic, syntactic and format typing of data in workflowData has to be filtered, transformed, parsed for consumption by servicesAnnotation PipelineGOMEDLINEKEGGSwissProtInterProPDBBlastHGBASEQuery
19 Shims‘I want to identify new sequences which overlap with my query sequence and determine if they are useful’Sequence database entryFasta format sequenceGenbank format sequenceAlignment of full query sequence V full ‘new’ sequenceOld BLAST resultSimplify and CompareListerRetrieveBLAST2Sequencei.e. lastknown 3000bpMaskBLASTIdentify new sequences and determine their degree of identity
20 Biological results from WB syndrome Four workflow cycles totalling ~ 10 hoursThe gap was correctly closed and all known features identifiedCLDN4CLDN3STX1AWBSCR18WBSCR21WBSCR22WBSCR24WBSCR27WBSCR28WBSCR14ELNCTA-315H11CTB-51J22RP11-622P13RP11-148M21RP11-731K22314,004bp extensionAll nine known genes identified(40/45 exons identified)
21 GD results: Differential expression and variations of the I kappa B-epsilon gene 3’ UTR SNP – 3948 C/An=30Mean NFKBIE expression levels -Controls: / (SEM)GD: / (SEM)P= (T-test)- Mnl restriction site- χ2 = 9.1, p = , Odds Ratio = 1.4
22 ConclusionsIt works – a new tool has been developed which is being utilised by biologistsMore regularly undertaken, less mundane, less error proneMore systematic collection and analysis of resultsIncreased productivityServices: only as good as the individual services, lots of them, we don’t own them, many are unique and at a single site, research level software, reliant on other peoples servicesActivation energy
23 Issues and future directions1 Transfer of large data sets between services (microarray data)Passing data by value breaks Web servicesStreaming (Inferno)Pass by reference and use third party data transfer (GridFTP, LSID)
24 Issues and future directions2 Data visualisationHow to visualise results mined from data using workflows?
25 Workflow results Large amounts of information (or datatypes) Results are implicitly linked within itselfResults are implicitly linked outside of itselfGenomic sequence is central co-ordinating point, but there are a number of different co-ordinate systemsNeed holistic view
26 What’s the problem? No domain model in myGrid We need a model for visualisationBut domain models are hardIt’s not clear that the domain model should be in the middleware
27 What have we done!? Bioinformatics PM (pre myGrid) One big distributed data heterogeneity and integration problem
28 What have we done!? Bioinformatics PM (post myGrid) One big data heterogeneity and integration problem
29 Initial Solutions Take the data Use something (Perl script or an MSc student) to map the data into a (partial) data modelVisualise results which are linked via HTML pages
30 A second solutionStart to build visualisation information into the workflow, using beanshell scripts.But what if we change the workflow?
31 Summary Domain models are hard Workflows can obfuscate the model Visualisation requires oneWe can build some knowledge of a domain model into the workflowIs there a better way?
32 Acknowledgements Core Matthew Addis, Nedim Alpdemir, Neil Davis, Alvaro Fernandes, Justin Ferris, Robert Gaizaukaus, Kevin Glover, Carole Goble, Chris Greenhalgh, Mark Greenwood, Yikun Guo, Ananth Krishna, Peter Li, Phillip Lord, Darren Marvin, Simon Miles, Luc Moreau, Arijit Mukherjee, Tom Oinn, Juri Papay, Savas Parastatidis, Norman Paton, Terry Payne, Matthew Pocock Milena Radenkovic, Stefan Rennick-Egglestone, Peter Rice, Martin Senger, Nick Sharman, Robert Stevens, Victor Tan, Anil Wipat, Paul Watson and Chris Wroe.UsersSimon Pearce and Claire Jennings, Institute of Human Genetics School of Clinical Medical Sciences, University of Newcastle, UKHannah Tipney, May Tassabehji, Andy Brass, St Mary’s Hospital, Manchester, UKPostgraduatesMartin Szomszor, Duncan Hull, Jun Zhao, Pinar Alper, John Dickman, Keith Flanagan, Antoon Goderis, Tracy Craddock, Alastair Hampshire