Presentation is loading. Please wait.

Presentation is loading. Please wait.

PANACEA WP3 The Platform WP participants: UPF, ILC, ILSP, LG, DCU, ELDA Final Annual Review 19 th February 2013 Marc Poch, UPF

Similar presentations


Presentation on theme: "PANACEA WP3 The Platform WP participants: UPF, ILC, ILSP, LG, DCU, ELDA Final Annual Review 19 th February 2013 Marc Poch, UPF"— Presentation transcript:

1 PANACEA WP3 The Platform WP participants: UPF, ILC, ILSP, LG, DCU, ELDA Final Annual Review 19 th February 2013 Marc Poch, UPF (marc.pochriera@upf.edu) 1

2 Summary Objectives Platform components / Demo Achievements –Functional platform –Interoperability: Travelling Object, Common Interfaces, format converters, etc. –Scalability WP7 Evaluation Conclusions and future work 2

3 Objectives Development of a platform (a space of interoperability defined by standardized protocols and common interfaces) for the easy integration of a variety of software components, tools and methodologies deployed as web services to configure a factory for the automation of acquisition, processing and annotation of language resources. 3 WP3.1. (T1-T6) Architecture and design of the platform WP3.2 (T15-T30) Work Flow editor and engine WP3.3. (T7-T30) Common interfaces, middleware and temporal files, journaling, etc. WP3.4 (T15-T30) The Registry WP3.5 (T7-T30) Deployment of web services of the components supplied by WP4 to WP6

4 4 Tools to be integrated Web Service wrapper The Registry Common Interfaces Format Converters Workflow editor and engine Sharing workflows From local tools to sharing workflows

5 Platform definition The PANACEA platform is an interoperability space based on tools, guidelines, a Common Interface definition, and a “Travelling Object” specification Version 3: Tools: Taverna, BioCatalogue, myExperiment, Soaplab, storage system Common Interface: WS interoperability Travelling Object: XCES, GrAF, CoNLL, LMF Documentation Formal definition Technical Definition 5

6 Clients: Java, Python, Perl, etc. Platform tools and portals 6 JAX-WS, Axis, CXF, etc. Workflows Social Network Registry Web Services Share tools (remotely run distributed tools) Share and find Web Services Call / chain Web Services Share and find workflows SOAP or REST Soaplab Biocatalogue Taverna www.taverna.org.uk PANACEA Registry: registry.elda.org PANACEA Registry: registry.elda.org PANACEA myExperiment: myexperiment.elda.org PANACEA myExperiment: myexperiment.elda.org myExperiment PANACEA Platform: uses, adapts and improves myGrid tools for eScience (used in biology, social science, music, astronomy, multimedia and chemistry).

7 Technological option: Web Services SOAPLAB 2 (SOAP) Easy deployment of command line tools as WS. (Java, Python, C++, UIMA, etc. ) Clients: Java, Python, Perl, Taverna, etc. No coding needed! Only metadata “Polling” techniques for long lasting tasks Web form to run the web services URL input / output ready PANACEA improvement for SOAP messaging (network usage and memory) PANACEA limit multiple users Easy deployment of command line tools as WS. (Java, Python, C++, UIMA, etc. ) Clients: Java, Python, Perl, Taverna, etc. No coding needed! Only metadata “Polling” techniques for long lasting tasks Web form to run the web services URL input / output ready PANACEA improvement for SOAP messaging (network usage and memory) PANACEA limit multiple users TAVERNA BioCatalogue Web Services Workflow editor Registry Social network myExperiment 7

8 Technological option: Registry SOAPLAB 2 (SOAP) User friendly GUI Free, open source, Continuously maintained Search function Users rating (users feedback) Service annotations and Language Categorization (PANACEA) Monitoring system (web service status and data results) User friendly GUI Free, open source, Continuously maintained Search function Users rating (users feedback) Service annotations and Language Categorization (PANACEA) Monitoring system (web service status and data results) TAVERNA BioCatalogue Web Services Workflow editor Registry Social network myExperiment 8 PassedWarningFailedUnchecked

9 Technological option: Taverna SOAPLAB 2 (SOAP) User friendly GUI Free and open source Continuously maintained (v. 2.4) SOAP and REST web services Credentials manger (passwords, certificates, etc.) Multiple files processing (“lists”) PANACEA Workflows, best practises, videos, etc. : Parallelization, Error recovery: “retries”, Polling PANACEA collaboration: bug fixing and pre-release tests User friendly GUI Free and open source Continuously maintained (v. 2.4) SOAP and REST web services Credentials manger (passwords, certificates, etc.) Multiple files processing (“lists”) PANACEA Workflows, best practises, videos, etc. : Parallelization, Error recovery: “retries”, Polling PANACEA collaboration: bug fixing and pre-release tests TAVERNA BioCatalogue Web Services Workflow editor Registry Social network myExperiment 9

10 PANACEA 10

11 Demos Previous Review: –PANACEA Registry / PANACEA myExperiment –Run Web Services and Workflows –Design and merging of workflows in Taverna Final Review: Specific examples –Creation of a bilingual dictionary –Twitter NLP –Web cleaner and anonymizer –PANACEA Registry / PANACEA myExperiment 11

12 Demos I Creation of a bilingual dictionary –http://myexperiment.elda.org/workflows/93http://myexperiment.elda.org/workflows/93 –Input: Pairs of Basic Xces Documents English: http://nlp.ilsp.gr/panacea/Bilingual/data/20101222/LAB_EN_FR/www.ilo.org/1.xml http://nlp.ilsp.gr/panacea/Bilingual/data/20101222/LAB_EN_FR/www.ilo.org/1.xml French: http://nlp.ilsp.gr/panacea/Bilingual/data/20101222/LAB_EN_FR/www.ilo.org/191.xml http://nlp.ilsp.gr/panacea/Bilingual/data/20101222/LAB_EN_FR/www.ilo.org/191.xml 1.Sentence alignment: Hunalign (3 rd party tool) Interoperability 2.PoS tagging: Treetagger (3 rd party tool) Interoperability 3.Build phrase tables: Moses (3 rd party tool) Interoperability 4.Bilingual dictionary extractor Video: http://ws02.iula.upf.edu/panacea/examples/videos/Panacea_bilingual_dictionary_extraction_v01.mp4http://ws02.iula.upf.edu/panacea/examples/videos/Panacea_bilingual_dictionary_extraction_v01.mp4 12

13 Demos II Twitter NLP + Registry (3 rd party tool) This web service is based on the Twitter NLP tool developed by Noah's ARK group. Noah's ARK group is Noah Smith's research group at the Language Technologies Institute, School of Computer Science, Carnegie Mellon University. 1.Search the WS in the Registry 2.Check monitoring system 3.Use web client with example data 13

14 Demos III Web cleaner and anonymizer http://myexperiment.elda.org/workflows/98 Input: a list of URLs to process –Example: a web article from www.fifa.comwww.fifa.com 1.ILSP Web cleaner and text extractor WS 2.UPF Anonymizer WS –Internally calls Freeling NER WS (3 rd party tool) Interoperability 14 Video: http://ws02.iula.upf.edu/panacea/examples/videos/Panacea_web_cleaner_and_anonymization_v01.mp4http://ws02.iula.upf.edu/panacea/examples/videos/Panacea_web_cleaner_and_anonymization_v01.mp4

15 WP3 Achievements Functional and Operational Platform –Multiple tools, webs and features –Ready to use –Usability –Real Users Interoperability –Common Interfaces –Travelling Object –3 rd party tools Integration –Format converters Scalability –Web service scalability: long lasting tasks –Workflow design optimization: robustness –Machine resources: handling parallel requests 15

16 Functional and Operational Platform PANACEA Registry –157 web services PANACEA WS benefits: WS are easy to deploy (low maintenance cost) –More than 1300 annotations Usability / Doc. –A cloud of 164 tags –Monitoring system: WS up and running 94.82% since their deployment (97%) Availability PANACEA myExperiment –74 shared workflows Storage System Usability 16

17 Functional and Operational Platform: Tutorials and Documentation 17 Tutorials Specific and General tutorials More than 12 videos Usability Frequently Asked Questions Documentation Registry annotations, tags and Categories Common Interfaces documentation: xml, web, etc. Travelling Objects documentation

18 Functional and Operational Platform: Users 18 WP7 Validators Linguatech (WP8) Qualia (Business intelligence) CNGL (Centre for Next Generation Localisation) INCYTA (Translation) Master and Phd Students make use of the PANACEA platform http://ws02.iula.upf.edu/panacea/statistics/upf-statistics.htmlhttp://ws02.iula.upf.edu/panacea/statistics/upf-statistics.html

19 Interoperability WS advantages for users: –No installation –No maintenance –No machine resources –Easily found on the Registry –Usability –Can be combined in workflows (share experiments) –And more... How to chain them to create workflows? 19

20 Three levels of interoperability: –COMMUNICATION PROTOCOLS: Soap, Rest –DATA –PARAMETERS Format N Tool A Format M Tool B Format L Tool C Format N Tool A empty Tool B empty Tool C Interoperability Tool B does not “understand” format N! All tools understand the previous format Tool A Tool B ABCDABCD ABCDABCD Tool A Tool B YTQZYTQZ ABCDABCD 20

21 Common Interface A Common Interface (CI) defines the mandatory parameters for every functionality: http://panacea-lr.eu/en/info-for-professionals/documents/ http://registry.elda.org 21

22 Travelling Object The Travelling Object (TO) is the common data and metadata format used in PANACEA to make components understand each other. (Interoperability) TO1 is the minimal common vertical in-line format used by the deployed tools since the first version of the platform using XCES standard TO2 GrAF standard: The Graph Annotation Format (Ide and Sudermam, 2007) is the XML serialization of LAF (ISO 24612, 2009) LMF for lexical resources CONLL for parsers Converters and adapted WS outputs 22

23 Format Converters 31 Format converters on the PANACEA Registry Freeling to TO. CNR http://registry.elda.org/services/207http://registry.elda.org/services/207 KAF to TO. CNR http://registry.elda.org/services/208http://registry.elda.org/services/208 Basic Xces to txt. CNR http://registry.elda.org/services/209http://registry.elda.org/services/209 PoS tag. (Freeling treetagger) to GrAF. UPF http://registry.elda.org/services/142http://registry.elda.org/services/142 Dependency parsing (Freeling) to GrAF. UPF http://registry.elda.org/services/197http://registry.elda.org/services/197 Dependency CoNLL to GrAF. CNR http://registry.elda.org/services/254http://registry.elda.org/services/254 Word doc to txt. UPF http://registry.elda.org/services/112http://registry.elda.org/services/112 In-house mwe to LMF. CNR http://registry.elda.org/services/296http://registry.elda.org/services/296 Pdf to text. UPF http://registry.elda.org/services/116http://registry.elda.org/services/116 Multi. encodings converter (ISO, UTF, etc.). UPF http://registry.elda.org/services/114http://registry.elda.org/services/114 Aligner to TO. DCU http://registry.elda.org/services/69http://registry.elda.org/services/69 Sentence alignment to TMX. DCU http://registry.elda.org/services/219http://registry.elda.org/services/219 Treetagger to MOSES. DCU http://registry.elda.org/services/275http://registry.elda.org/services/275 UIMA to GrAF. ILSPhttp://registry.elda.org/services/182http://registry.elda.org/services/182 METASHARE metadata generatorshttp://myexperiment.elda.org/workflows/96http://myexperiment.elda.org/workflows/96 23

24 3 rd party tools integration PANACEA WS wrapper (Soaplab) and the CI make it easy for WS Providers to integrate 3 rd party tools. ILSP tools are UIMA toolsUIMA Freeling UPC Treetagger University of Stuttgart Twitter NLPCarnegie Mellon University MALT Parser Uppsala University DeSR Università di Pisa MOSES / Giza++ DELiC4MT (MT evaluation)DCU Berckeley tagger, parser, alignerBerkeley University California 24

25 Web Services Scalability Web services are being deployed using Soaplab 2.3.2: –Service providers only need to use metadata (ACD) files Usability –Web client application to test WSs: Spinet Usability –PANACEA developers have been in contact with Soaplab developers Collaboration –SOAP protocol standard Interoperability WS can be called from Taverna or other workflow editors WS can be called with many programming languages: Python, Perl, Ruby, Java, etc. –Soaplab polling to avoid client timeouts Scalability –PANACEA Improvements Scalability Parallel request limit system SOAP messaging optimization 25

26 The Registry The PANACEA Registry is a BioCatalogue instance (the source code has been used to deploy the registry on a server) Features: Annotation capabilities and categorization Search function Automatic status check system for web services Modifications: –Logos, colours, categorization system, Spinet link improved Usability PANACEA developers are in contact with BioCatalogue developers for bug reporting, error fixing, etc. with mutual benefit: –Specific WSDL type registering bug solution –Improve performance PassedWarningFailedUnchecked http://registry.elda.org 26

27 Workflows design optimization: Robustness Building workflows with Taverna –Version 2.4.2 Scalability –Polling (Soaplab) Scalability long lasting web service calls without timeouts –Retries Scalability –Parallelization Scalability –Tutorials and videos Usability 27

28 myExperiment The PANACEA myExperiment is a myExperiment instance (the source code has been used to deploy it on a server) Features: Annotation capabilities Search function “Services tab beta” added to PANACEA myExperiment. Users can list web services from the Registry and see in which workflows have been used. Modifications: –Logos, colours and presentation PANACEA developers are in contact with myExperiment developers for bug reporting, error fixing, etc. : Collaboration –Improve performance –Database problems fixed 28

29 Machine Resources: handling parallel requests Parallelization level 3 (3 parallel request per service * 2 services = 6 concurrent requests) Workflow nameFreeling_tagging_for_crawled_data_with_output_download filemassive_freeling_for_crawled_data_v11_download.t2flow myexp urlhttp://myexperiment.elda.org/workflows/32 Taverna2.4.0 workbench VMCoresRAMHD iula04 (UPF) 48 40GB (SAS) WSparall.poll. int.poll. backoffpoll. max int.retriesini. delaymaxfactor WS1 python_preprocess + freeling_tagging + python_postprocessing 3 20001100002500015000020 WS2 postagger_to_ xces_converter 3 20001100002500015000020 corpuslist fileurlsurl exampleTokens MCv2LAB_ES_list.sorted.txt13188http://nlp.ilsp.gr/panacea/D4.3/data/201109/LAB_ES/1.xml 61 M NameStatusQueued it.It. doneIt. w/errorAverage time/it. Freeling_tagging_for_crawled_data_w ith_output_download Finished--- 5.2 h download_dataUrlFinished013188031 ms freeling_taggingFinished01318854.2 s postagger_to_xces_converterFinished01318804.1 s 29

30 Machine Resources: handling parallel requests Parallelization level 10 (10 parallel request per service * 2 services = 20 concurrent requests) Workflow nameFreeling_tagging_for_crawled_data_with_output_download filemassive_freeling_for_crawled_data_v11_download.t2flow myexp urlhttp://myexperiment.elda.org/workflows/32 Taverna2.4.0 workbench VMCoresRAMHD iula04 (UPF) 48 40GB (SAS) WSparall.poll. int.poll. backoffpoll. max int.retriesini. delaymaxfactor WS1 python_preprocess + freeling_tagging + python_postprocessing 10 20001100002500015000020 WS2 postagger_to_ xces_converter 10 20001100002500015000020 corpuslist fileurlsurl exampleTokens MCv2LAB_ES_list.sorted.txt13188http://nlp.ilsp.gr/panacea/D4.3/data/201109/LAB_ES/1.xml 61 M NameStatusQueued it.It. doneIt. w/errorAverage time/it. Freeling_tagging_for_crawled_data_w ith_output_download Finished--- 2.2 h download_dataUrlFinished013188029 ms freeling_taggingFinished01318855.9 s postagger_to_xces_converterFinished01318804.8 s 30

31 Machine Resources: handling parallel requests From 1x to 10x experiment http://ws02.iula.upf.edu/panacea/examples/videos/ Panacea_parallelization_scalability_v01.mp4 –Two Taverna instances running at the same time –100 documents to be processed –1 workflow with NO parallelization / the other with 10x –The same server: ws04 with 8GB RAM and 4 CPUs More resources > more parallel requests 31

32 Machine Resources: handling parallel requests Conclusions: –PANACEA fulfils large data scalabilty goal Scalability –Requirements: Robust WS deployment: Soaplab (with Panacea improvements) or other robust framewoks. Taverna 2.4 Workflow design must follow the PANACEA massive data tutorial (retries, polling, etc) The architecture is highly scalable: growth is just a matter of resources Statistics Typical Panacea server: 2 - 4 cores 4 - 8 GB RAM 30 - 100 GB HDD 100 Freeling WS parallel requests EMBL –EBI (European Bioinformatics Institute in Cambridge): 200 Servers 2000 cores Server requests balancing Software, etc. More than 50000 Freeling WS parallel requests 32

33 WP7 Evaluation 33

34 Conclusions Functional platform –Web services software –Registry / myExperiment Usability for users and providers Interoperability: –Data formats –Common Interfaces Tutorials and Documentation Scalability 34

35 The future Authentication Web Services Business opportunity –Institutions and companies can sell their services and/or machine resources Automatically build workflows Usability and interoperability –Based on input data and user desired output, etc. Data Visualization tools / Widgets Usability Improve total throughput Scalability –With more machine resources we can achieve faster experiment results –Software optimization: task splitting and parallelization Publications with experiments Research –Researchers could link their publications to real experiments (WS, workflows, data. etc.) –Fostering research making experiments easily replicable –Improved experiments: more data, more machine resources, faster results, etc. 35

36 Thank you Questions? 36


Download ppt "PANACEA WP3 The Platform WP participants: UPF, ILC, ILSP, LG, DCU, ELDA Final Annual Review 19 th February 2013 Marc Poch, UPF"

Similar presentations


Ads by Google