CSE Framework: A UIMA-based Distributed System for Configuration Space Exploration Elmer Garduno 2, Zi Yang 1, Yan Fang 3, Avner Maiberg 1, Collin McCormack.

Slides:



Advertisements
Similar presentations
Improving Learning Object Description Mechanisms to Support an Integrated Framework for Ubiquitous Learning Scenarios María Felisa Verdejo Carlos Celorrio.
Advertisements

Designing Services for Grid-based Knowledge Discovery A. Congiusta, A. Pugliese, Domenico Talia, P. Trunfio DEIS University of Calabria ITALY
Software Engineering CSE470: Process 15 Software Engineering Phases Definition: What? Development: How? Maintenance: Managing change Umbrella Activities:
Key-word Driven Automation Framework Shiva Kumar Soumya Dalvi May 25, 2007.
Test Automation An Approach to Automated Software Regression Testing Presented by Adnet, Inc Feb 2015.
Spark: Cluster Computing with Working Sets
 delivers evidence that a solution developed achieves the purpose for which it was designed.  The purpose of evaluation is to demonstrate the utility,
Roadmap to Continuous Integration Testing and Benefits Gowri Selka, Walgreens Natalie Koltun, Walgreens May 20th, 2014 ©2013 Walgreen Co. All rights reserved.
IBM Software Group ® Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
Search Engines and Information Retrieval
Rutgers Components Phase 2 Principal investigators –Paul Kantor, PI; Design, modelling and analysis –Kwong Bor Ng, Co-PI - Fusion; Experimental design.
Fundamentals of Information Systems, Second Edition
+ Doing More with Less : Student Modeling and Performance Prediction with Reduced Content Models Yun Huang, University of Pittsburgh Yanbo Xu, Carnegie.
Methodology Conceptual Database Design
Lecture Nine Database Planning, Design, and Administration
Business Intelligence System September 2013 BI.
Chapter 2- Software Process Lecture 4. Software Engineering We have specified the problem domain – industrial strength software – Besides delivering the.
Regression testing Tor Stållhane. What is regression testing – 1 Regression testing is testing done to check that a system update does not re- introduce.
OpenEdge BPM. 2 Challenges Process implementation not documented Processes should be explicit – not buried within an application or handled thru “tribal.
Business Process Performance Prediction on a Tracked Simulation Model Andrei Solomon, Marin Litoiu– York University.
Database System Development Lifecycle © Pearson Education Limited 1995, 2005.
Petter Nielsen Information Systems/IFI/UiO 1 Software Prototyping.
Chapter 2 The process Process, Methods, and Tools
Dillon: CSE470: SE, Process1 Software Engineering Phases l Definition: What? l Development: How? l Maintenance: Managing change l Umbrella Activities:
Search Engines and Information Retrieval Chapter 1.
SWE 316: Software Design and Architecture – Dr. Khalid Aljasser Objectives Lecture 11 : Frameworks SWE 316: Software Design and Architecture  To understand.
The Challenge of IT-Business Alignment
Graph Data Management Lab, School of Computer Science gdm.fudan.edu.cn XMLSnippet: A Coding Assistant for XML Configuration Snippet.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
1 Minggu 9, Pertemuan 17 Database Planning, Design, and Administration Matakuliah: T0206-Sistem Basisdata Tahun: 2005 Versi: 1.0/0.0.
EMI INFSO-RI SA2 - Quality Assurance Alberto Aimar (CERN) SA2 Leader EMI First EC Review 22 June 2011, Brussels.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Delivering business value through Context Driven Content Management Karsten Fogh Ho-Lanng, CTO.
Carnegie Mellon School of Computer Science Copyright © 2001, Carnegie Mellon. All Rights Reserved. JAVELIN Project Briefing 1 AQUAINT Phase I Kickoff December.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –
Combining GATE and UIMA Ian Roberts. University of Sheffield NLP 2 Overview Introduction to UIMA Comparison with GATE Mapping annotations between GATE.
“Isolating Failure Causes through Test Case Generation “ Jeremias Rößler Gordon Fraser Andreas Zeller Alessandro Orso Presented by John-Paul Ore.
CSCE 548 Secure Software Development Security Operations.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Reviews Crawler (Detection, Extraction & Analysis) FOSS Practicum By: Syed Ahmed & Rakhi Gupta April 28, 2010.
Using Symbolic PathFinder at NASA Corina Pãsãreanu Carnegie Mellon/NASA Ames.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
SWE 4743 Abstraction Richard Gesick. CSE Abstraction the mechanism and practice of abstraction reduces and factors out details so that one can.
UML (Unified Modeling Language)
MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
Whole Test Suite Generation. Abstract Not all bugs lead to program crashes, and not always is there a formal specification to check the correctness of.
2005WICSA Towards an Operational Framework for Architectural Prototyping Klaus Marius Hansen Henrik Bærbak Christensen Author Department of Computer.
Combining GATE and UIMA Ian Roberts. 2 Overview Introduction to UIMA Comparison with GATE Mapping annotations between GATE and UIMA.
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
Laurea Triennale in Informatica – Corso di Ingegneria del Software I – A.A. 2006/2007 Andrea Polini XVIII. Software Testing.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Ian Bird, CERN WLCG Project Leader Amsterdam, 24 th January 2012.
Advanced Software Engineering Dr. Cheng
Curator: Self-Managing Storage for Enterprise Clusters
RECENT TRENDS IN METADATA GENERATION
Designing Software for Ease of Extension and Contraction
Rapid fire performance testing of 250 websites
An Adaptive Middleware for Supporting Time-Critical Event Response
DAT381 Team Development with SQL Server 2005
Copyright © JanBask Training. All rights reserved Top 10 Charming IT jobs that would be High in Demand in 2019.
Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
Regression testing Tor Stållhane.
Scriptless Test Automation through Graphical User Interface
On the notion of Variability in Software Product Lines
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

CSE Framework: A UIMA-based Distributed System for Configuration Space Exploration Elmer Garduno 2, Zi Yang 1, Yan Fang 3, Avner Maiberg 1, Collin McCormack 4, Eric Nyberg 1 1) Carnegie Mellon University {ziy, amaiberg, 2) Sinnia 3) Oracle Corporation 4) The Boeing Company

Motivation Question Analysis Keywords Document Retrieval Corpus Docs Answer Extraction Answer candidates Answer Selection Which city in China has the largest number of foreign financial companies? Keywords: China largest foreign financial company Answer type: location (city) Answer candidates ScoreDocument extracted Beijing0.7AP Hong Kong0.65WSJ Shanghai0.64FBIS3-58 Taiwan0.5FT Shanghai0.4FBIS Document IDRank FBIS3-58 (relevant)1 AP WSJ FBIS (relevant)4 FT Answer Shanghai Typical QA Pipeline

CURRENT RESEARCH IN QA

What did we learn from Watson? QA systems can be fast enough, accurate enough, and confident enough to perform in the real world Key factors: – Scalable, parallel architecture – Agile, open advancement process Next big challenge: rapid domain adaptation

Automatic Optimization of QA for TREC Genomics Questions

Results of Automatic Optimization [ Yang, Z., Garduno, E., Fang, Y., Maiberg, A., McCormack, C. and Nyberg, E. (2013). “Building Optimal Information Systems Automatically: Configuration Space Exploration for Biomedical Information Systems”, Proceedings of the ACM Conference on Information and Knowledge Management ]

Automatically Building an Information System by Another Meta-System?

Building an Information System Automatically CSE framework

The benefit of CSE framework Accelerate the system development cycle by automating the component selection and tuning! Save cost! The benefit of CSE framework Accelerate the system development cycle by automating the component selection and tuning! Save cost! It requires Identify the tool, knowledge base, task algorithm candidates Provide information needs with known outcomes, e.g. answers to questions in the domain. It requires Identify the tool, knowledge base, task algorithm candidates Provide information needs with known outcomes, e.g. answers to questions in the domain.

CSE - FRAMEWORK

Definition: Phase An information system Phase t The processing unit as the t-th step in a process

Definition: Component, Configuration Inside phase t An instantiated processing unit in phase t

Definition: Trace An execution path that involves a single configured component for each phase

Exponential problem The number of traces grows exponentially with the phases and the number of components. Space should be pruned when possible to keep the space bounded.

Definition: Configuration space Pipeline Phase 1 Phase 2 Phase 3 Set of all configured components

UIMA - EXTENDED CONFIGURATION DESCRIPTOR

Extended Configuration Descriptor YAML format A simple yet complete configuration descriptor configuration: name: testqa-ziy-test author: ziy persistence-provider: inherit: jdbc.db.persistence-provider collection-reader: inherit: jdbc.db.collection-reader dataset: BIO-COMBINED sequence-start: 160 sequence-end: 187

Extended Configuration Descriptor Phases and components inherit configuration properties or are declared as classes. pipeline: - inherit: jdbc.cse.phase name: keyterm-extractor options: | - inherit: default.keyterm.default - inherit: default.keyterm.faster - inherit: jdbc.cse.phase name: retrieval-stategist options: | - inherit: default.retrieval.default - inherit: default.retrieval.better - inherit: jdbc.cse.phase name: passage-extractor options: | - class: cmu.edu.default.ie.Default

Component configuration class: edu.cmu.lti.oaqa.ecd.example.FirstPhaseAnnotatorA1 extract: true cross-opts param-a: [value100,value200] param-b: [value300,value400] This evaluates to the following Object[] param lists. [extract: true, param-a: value100, param-b: value300] [extract: true, param-a: value200, param-b: value300] [extract: true, param-a: value100, param-b: value400] [extract: true, param-a: value200, param-b: value400]

Extended Configuration Descriptor Evaluation metrics are pluggable, and can be specified at the local or global level. - inherit: jdbc.eval.cse-retrieval-aggregator-consumer - inherit: bioqa.eval.cse-passage-map-aggregator-consumer post-process: - inherit: jdbc.eval.cse-retrieval-evaluator-consumer - inherit: report.csv-report-generator builders: | - inherit: jdbc.report.f-measure-report-component - inherit: bioqa.eval.cse-passage-map-evaluator-consumer - inherit: report.csv-report-generator builders: | - inherit: bioqa.report.map-report-component

In-phase pipelines 21

IMPLEMENTATION

Implementation details Built on top of uimaFIT Combinatorial features are implemented using CAS Multiplier. CASes are persisted as compressed XMI – Once per trace at each phase. – Experiments can be restarted at any arbitrary point. Experimentation specific Type System Use UIMA-AS for external resources.

Distributed execution 24

Incremental improvement 25

Per trace visibility 26

Error analysis 27

Other domains:QA4MRE Question Answering for Machine Reading Configuration space: – 12 UIMA components were first developed – Replace UIMA descriptors with ECD CSE – 46 configurations – 1,040 combinations – 1,322 executions The best trace identified by CSE achieved 59.6% performance gain over the original pipeline. [Building Optimal Question Answering System Automatically using Configuration Space Exploration (CSE) for QA4MRE 2013 Tasks Alkesh Patel, Zi Yang, Eric Nyberg and Teruko Mitamura]

FUTURE WORK AND COLLABORATION

Future work Advanced Configuration Space exploration and pruning (Bagpipes Framework). Run arbitrary UIMA pipelines on top of industry grade distributed systems (Spark, Mesos, HDFS). Further investigation on space, time, resources constraining. Use differential CAS storage.

Collaboration

Thanks! 32