Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Improving the Reuse of Scientific Workflows and their By-products Xiaorong Xiang National Evolutionary Synthesis Center (NESCent) Duke University, University.

Similar presentations


Presentation on theme: "1 Improving the Reuse of Scientific Workflows and their By-products Xiaorong Xiang National Evolutionary Synthesis Center (NESCent) Duke University, University."— Presentation transcript:

1 1 Improving the Reuse of Scientific Workflows and their By-products Xiaorong Xiang National Evolutionary Synthesis Center (NESCent) Duke University, University of North Carolina - Chapel Hill, and North Carolina State University Gregory Madey Department of Computer Science and Engineering University of Notre Dame 2007 IEEE International Conference on Web Services (ICWS 2007) Salt Lake City, Utah, July 2007 Supported in part by the Indiana Center for Insect Genomics (ICIG) & the Indiana 21st Century Fund

2 2 Collaborators: Xiaorong Xiang & Jeanne Romero-Severson

3 3 Outline: two parts Production system (MoGServ) for bioinformatics workflow  Bioinformatics application  Productivity improvement Prototype system exploring ideas for end- user composition  Workflow reuse  Knowledge management/discovery

4 4 From the article “Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade” by Folker Meyer in journal CTWatch Quarterly August, 2006 volume 2 number 3 Bioinformatics today Rapidly accumulating data: DNA sequences, contigs, expression data, annotations, etc. Non-standard independently developed heterogeneous data sources Data sharing and security Productivity Problem!

5 5 SOA in Bioinformatics MORE Community efforts needed to provide more shared and reliable services More demonstration projects needed => best practices, measured utility, feedback to middleware projects, etc. Recent exposure of data & analysis tools as services Large public databases and bioinformatics tools Middleware projects Provide infrastructure to compose, manage, execute, connect the distributed services

6 6 Mother of Green (MoG) project Biological science  In collaboration with Prof. Jeanne Romero-Severson, Biological Sciences, University of Notre Dame.  Study the deep phylogeny of plastid Computer science  Provide an environment to support scientists’ investigations  A case study of using SOA for data and application integration  A prototype for future research in service-oriented architecture domain

7 7 Mother of Green Malaria causes 1.5 - 2.7 million deaths every year Malaria causes 1.5 - 2.7 million deaths every year 3,000 children under age five die of malaria every day 3,000 children under age five die of malaria every day Plasmodium falciparum (a protozoan parasite)causes human malariaPlasmodium falciparum (a protozoan parasite) causes human malaria Drug resistance a world-wide problem Drug resistance a world-wide problem Targeted drug design through phylogenomics Targeted drug design through phylogenomics Malaria causes 1.5 - 2.7 million deaths every year Malaria causes 1.5 - 2.7 million deaths every year 3,000 children under age five die of malaria every day 3,000 children under age five die of malaria every day Plasmodium falciparum (a protozoan parasite)causes human malariaPlasmodium falciparum (a protozoan parasite) causes human malaria Drug resistance a world-wide problem Drug resistance a world-wide problem Targeted drug design through phylogenomics Targeted drug design through phylogenomics P. falciparum

8 8 Mother of Green P. falciparum has three genomes P. falciparum has three genomes Nuclear, mitochondrial, plastid Animals and insects have only two Animals and insects have only two Target the third genome Target the third genome No harm to animals No harm to animals New antimalarial drug New antimalarial drug High risk, high tech, high payoff High risk, high tech, high payoff J. Romero-Severson Department of Biological Sciences Greg Madey & Xiaorong Xiang Department of Computer Science & Engineering J. Romero-Severson Department of Biological Sciences Greg Madey & Xiaorong Xiang Department of Computer Science & Engineering

9 9 Mother of Green Plastids are the third genome Intracellular organelles Terrestrial plants, algae, apicomplexans Functions in plants and algae Photosynthesis Oxidation of water Reduction of NADP Synthesis of ATP Fatty acid biosynthesis Aromatic amino acid biosynthesis Functions in apicomplexans ? Plastids are the third genome Intracellular organelles Terrestrial plants, algae, apicomplexans Functions in plants and algae Photosynthesis Oxidation of water Reduction of NADP Synthesis of ATP Fatty acid biosynthesis Aromatic amino acid biosynthesis Functions in apicomplexans ? Chloroplast in plant cell Plastid in Toxoplasma sp. Apicoplast in P. falciparum plastid

10 10 Mother of Green The apicoplast appears to code for <30 proteins. Repair, replication and transcription proteins Why is the apicoplast essential?

11 11 Find the ancestors of the apicoplast Identify genes in the ancestors Determine gene function Look for these genes in the P. falciparum nucleus Then study regulatory mechanisms in candidate genes Mother of Green Phylogenomics Mother of Green Phylogenomics

12 12 Phylogenomics of plastids Very old lineage (> 2.5 billion years) Cyanobacterial ancestor Three main plastid lineages Glaucophytes Group of freshwater algae Chloroplast resembles intact cyanobacteria Chlorophytes Green plant lineage Chloroplast genome reduced Many chloroplast genes now in nuclear genome Rhodophytes Red algal lineage Chloroplast genome bigger than in green plants Oomycetes Apicomplexans

13 13 Phylogenomics of plastids One cyanobacterial ancestor ? Many? Lineages are not linear One plastid origin Multiple plastid origins

14 14 The process of endosymbiosis. Horizontal Gene Transfer (arrows) from the plastid to the nucleus. The nucleomorph is a remnant of the original endosymbiont nucleus. Primitive eukaryote Endosymbiont plastid Secondary endosymbionts Second eukaryote Secondary nonphotosynthetic endosymbiont Cyanobacteria Nucleus Nucleomorph Plastid disappears

15 15 Secondary endosymbiont Tertiary endosymbionts Third eukaryote Tertiary nonphotosynthetic endosymbiont Plastid disappears Tertiary endosymbiosis. Horizontal Gene Transfer P. falciparum

16 16 The information gathering problem Rapid accumulation of raw sequence information ~100 sequenced chloroplast genomes ~57 sequenced cyanobacterial genomes Rate of accumulation is increasing Information accumulates faster than analyses finish Information in forms not readily accessible Solution Semi-automated web-services “Smart” web-services Semantic web

17 17 A typical in-silico investigation – Data driven research A: Query complete genome sequences given a taxa A: Query complete genome sequences given a taxa B: Query protein coding genes for each genome sequence B: Query protein coding genes for each genome sequence C: Eliminate vector sequences C: Eliminate vector sequences D: Sequences alignment D: Sequences alignment E: Phylogenetic analysis E: Phylogenetic analysis

18 18 Time consuming manual web-based operations Data collection  Copy & paste! Analysis tool usage  Copy & paste! Experiment data recording  Copy & paste! Repetitive experiments for scientific discovery  Copy & paste! Repeat as new data becomes available  Copy & paste!

19 19 MoGServ system architecture MoGServ interface  Web interface  Application interface MoGServ middle layer  Data access storage  Data and analysis services  Service and workflow registry  Indexing and querying metadata  Service and workflow enactment Acting in two roles: service requester and service provider

20 Web Interface Applications Application Server Data Access Services Data Access Services Data Analysis Services Data Analysis Services Job Manager Job Launcher Service/Workflow Registry Service/Workflow Registry Metadata Search Metadata Search Local Data Storage Local Data Storage Workflow/Soap Engines Services NCBI DDBJ EMBL Data/Services Providers MoGServ Middle Layer Services Access Client Others MoGServ System Architecture

21 21 Data storage and access services Local database  Integrating data from multiple data sources with scientists interests  Supporting repetitive investigations against several subsets of sequences  Avoiding network traffic and service failure when retrieving data on-the-fly from public data sources Accessing the data in the local database by services

22 22 Service and workflow registry A table-based description with necessary properties  Text description  Service location  Input/output  Provider  Version  Algorithm  Invocation method Not intended for supporting service discovery or composition To answer end-users questions about their results  Provenance: “Which algorithm was used to generate the data and what is the source of the input data?” A repository of service and workflow used for local application developers

23 23 Indexing and querying metadata Metadata  Service and workflow description  Description of sequence data in order to track the origination of data  Experimental data output, input, and intermediate data Indexing and querying with keyword  Lucene  Implemented as services

24 24 Service and workflow enactment INPUT Parameters Task Name Timer INPUT Parameters Task Name Timer Service/Workflow Registry Job Manager Find the service/workflow definition using the task name Form a Job Description Output Job ID Output Job ID Job Launcher Instances of Workflow/Service Engines Instances of Workflow/Service Engines Job Information

25 25 Implementation Development and deployment  J2EE, JSP, XSLT  Tomcat 5.0.18 / Axis 1.2 Database  PostgresSQL 8.1 Index and search of metadata  Apache Lucene library Service implementation  Java2WSDL  Wrap command line applications with JLaunch library Workflow  Taverna workbench, part of myGrid project  Freefluo workflow engine

26 26 Data and services Services, Workflows Data collection from remote database Query local database Data analysis tools, blast, clustalw, Data format conversion, readseq Management data sets and jobs Download and upload Data Complete genome sequences ATP gene sequences Sequence sets Saved jobs

27 27 Taverna workbench

28 28 A workflow created using the Taverna workbench tool

29 29 Improvement opportunities Use existing domain ontology in bioformatics community to describe services, workflows, and data Integrate the semantic web technology to support end-users workflow creation based on their knowledge of scientific domain Support users with limited knowledge of scientific processes Record various workflow representations Facilitate the discovery and reuse of prior workflows  Knowledge management  Knowledge discovery

30 30 Service Composition and workflows Service composition  Ad-hoc  Semi-automate Semantic annotation + reasoning  Automated Semantic annotation + planning Scientific workflows  Workflows composed based on service-oriented architecture for assisting scientists in accessing and analyzing data.

31 31 Current workflow management systems Existing workflow management system and bioinformatics middleware  Taverna, Kepler, Triana, Pegasus  Design, execute, monitor, re-run Support ad-hoc, semi-automated and automated service discovery and composition from scratch

32 32 Our approach Reuse the verified knowledge and workflow in the community  Increase the correctness of composed workflows over time  Provide more accurate guidelines for users A four level hierarchical workflow structure An enhanced workflow system

33 33 Aligning Retrieving Workflow A defined by a less experienced user using the functional definition of services queryGene clustalW Workflow B defined by an intermediate user with executable services queryGene clustalW queryGene setIds setFilter clustalW Workflow C defined by an expert user with two extra executable services to ensure the accurate output of the biological process Three user-defined workflows from different views Question: “are gene genealogies for ATP subunits α, β,and γ different?”

34 34 User Service Annotator Abstract workflow OWL DL reasoner OWL DL reasoner Ontology Create abstract workflow using ontology Annotate services using ontology Semantics enabled service registry Semantics enabled service discovery Semantics enabled service discovery Service matchmaking Workflow composer (software agent/experienced users) Find appropriate service Workflow execution engine Workflow execution engine concrete workflow Data provenance management Data provenance management Collect and manage information about data origination Knowledge base management Knowledge base management Knowledge discovery Knowledge discovery Enhanced workflow system MogServ

35 35 Encode, convert the High level definition To low-level executable Invoke a workflow with Specific input data and Record the data Provenance and Performance of services, workflows. Abstract workflow Concrete workflow Optimal workflow Workflow instance Replace individual Services with their optimal alternatives Task A Task B Service B Service A Service D Service C Service B Service A Service D Service C’ input output Service B Service A Service D Service C’ Our hierarchical workflow structure Pegasus workflow structure

36 36 Reusable knowledge Connectivity  Helps to convert from abstract workflow to concrete workflow Alternative services  Helps to convert from concrete workflow to optimal workflow Quality profile of services  Helps discover optimal workflows Mapping of abstract workflow and concrete workflow  Helps to choose reusable workflows

37 37 Connectivity identification (Match detection) Service: QueryLocal Operation: createSet performTask: mygrid:retrieving inputPara: Settype(String, mog:gene) Queryterm(String, null) outputPara: Setid(string, mog:geneset) useResource: MoG Service: ClustalW Operation: runClustalWdf performTask: mygrid:aligning inputPara: Setid(String, mog:set ) Sequencetype(String, mog:sequence) outputPara: filen(string, mygrid:sequence _alignment_report) useResource: EBI Service: FormatConversion Operation: convert performtask: mygrid: translating inputPara: filen(String, mygrid:sequence _alignment_report ) outputPara: Out(String, mygrid:nexus _paup_format) useResource: MoG Parameter (data type, semantic type) Matching rule: opertation ij → operation mn if exist parameter k is output parameter of operation ij and exist parameter o is input parameter of operation mn and data type (parameter o ) = data type (parameter k ) and semantic type (parameter o ) = semantic type(parameter k )

38 38 Need for verified service connectivity The mismatching problem TPFP FNTN Match Detection output Accurate annotation Inaccurate annotation Lack semantic annotation Inaccurate reasoning Inaccurate annotation Lack of semantic annotation Inaccurate reasoning Accurate annotation GenBankService Out:GenBank record Blastp In: protein sequence X Mediator, adaptor, shim DDBJ-XML Out: sequence data record NCBI blast In: sequence data record fasta formatSelf-defined format May be detected by experts at design time or after run Can be detected automatically X YesNo Yes No FP TN Real match

39 39 Connectivity Graph Implementation Registration process registry Automatically Identify the connectivity Knowledge base Store the connectivity Workflow Translation / Service composition process Refine, update, decompose the workflow connect (service a, operation ai, parameter c, service b, operation bi, parameter d ) identifyConnect (Single service, rdf repository) Search at syntactic level: search path between two nodes search next available service automatic composition base on input, output Implementation: shortest path algorithm Dijkstra Connectivity between services is converted to finding a path between two nodes in a graph

40 40 Generic Service Description Ontology (myGrid/Feta model) Data Services Workflows Service Domain Ontology (myGrid) MoGServ application Domain Ontology (MoGServ) Software components for annotation RDF Store Ontological modules used for semantic description of data, services & workflows

41 41 MoGServ Application Domain Ontology  To better track the data origination  To support the automation of workflow creation  To better share the data on the web in the future propertiesdomainrange invokedbyJobUser isParentOfSet isInstanceOfJobService hasSetNameSetXML:String Ontological modules Number of ConceptsNumber of properties Object Datatype MoGServ1297 myGrid4198 myGrid/Feta model261117 Example concepts and properties defined in MoGServ

42 42 Sample service/workflow annotation Question: Which service has an operation that accepts nucleotide_sequence as a parameter Answer: Uri: http://www.ebi.ac.ukhttp://www.ebi.ac.uk …/alignment:blastn_ncbi OperationName: Run Displayed by Rdf-Gravity

43 43 Implementation of annotation and query components for data, services & workflows Sesame 1.2.6 library  Supports files, RDBMS, SeRQL Sesame RDF store Annotation Templates (Data) Annotation Templates (Service) Query templates Select Y, W, X from {Y} mg:hasOperation{W} mg:inputParameter {X} rdf:type {mog:set} using namespace rdf =, mg =, mog = Query Components Annotation components result Service: http:host.cse.nd.edu/http:host.cse.nd.edu/ axis/services/ClustalW?wsdl Operation: runClustalWdf inputParameter: setid SeRQL

44 44 Experiment Used 418 concepts from domain ontology for semantic type, defined 10 concepts for data type. Randomly generate service annotation. 1 input, 1 output 1000 services connectivity graph (right side) Intel Pentium mobile 1.5GZ Number of servicesNumber of Matched pair Load RDF repository (milliseconds) Average time of match detection per single service (milliseconds) 20010154712.02 40034234613.01 60084260012.31 800138301512.35 1000225332512.51 Number of nodes724 Number of arcs587 Average path search time (milliseconds) Less than 1 Connectivity graph load time (milliseconds) 220 Length 0 = 724, length 1= 587, length 2=448, length 3= 281, Length 4=114, length 5=71 Length 6 =28, length 7=16 Length 8 = 4, length 9 = 2 Conclusion: Feasible solution.

45 45 Reuse of workflows Reuse of abstract workflows Reuse of concrete workflows Compare structural similarity of two workflows Implementation: SUBDUE algorithm SUBDUE is has a graphy match utility that is part of its data mining system Given workflow is converted to a graph and fed to the SUBDUE match algorithm Abstract example … input output query_term hasParameter task hasInput task hasNext retrieving aligning multiple_alignment_report performTask hasOutput performTask hasParameter v 1 input v 2 output v 3 task v 4 task v 5 query_term v 6 retrieving v 7 aligning v 8 multiple_aligning_report e 3 4 hasNext e 3 1 hasInput e 4 2 hasOutput e 3 6 performTask e 4 7 performTask e 1 5 hasParameter e 2 8 hasParameter SUBDUE input format Graph view

46 46 Conclusion Pro  Increase the correctness of the formed workflow over time Avoid the incorrect, inaccurate semantic annotations Take advantage of verified knowledge Avoid the ontological reasoning process  Better support for semi-automated and automated service composition over time Provide more accurate guideline to users over time Con  The connectivity graph can be big Number of parameters Number of services  Search the connectivity of a service when a service is registered in the system may take relative long time More complex matching rule Number of parameters  May not have high accuracy at the beginning

47 47 Future work Integrate the GridSam into the MoGServ for execution, monitoring Integrate the Grid computing technology for resource allocation Refine the MoGServ application domain ontology Create interface for end-user workflow creation Create interface for individual workspace Evaluate the scalability, accuracy of connectivity graph approach and the graph matching approach with large number real workflows and services

48 48 Thank you Questions?


Download ppt "1 Improving the Reuse of Scientific Workflows and their By-products Xiaorong Xiang National Evolutionary Synthesis Center (NESCent) Duke University, University."

Similar presentations


Ads by Google