FlashExtract : A General Framework for Data Extraction by Examples

Slides:

Advertisements

Similar presentations

Towards Data Mining Without Information on Knowledge Structure

Advertisements

Three-Step Database Design

Synthesizing Number Transformations from Input-Output Examples Rishabh Singh and Sumit Gulwani.

From Verification to Synthesis Sumit Gulwani Microsoft Research, Redmond August 2013 Marktoberdorf Summer School Lectures: Part 1.

Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,

Data Manipulation using Programming by Examples and Natural Language Invited Upenn April 2015 Sumit Gulwani.

1 CS 561 Presentation: Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Ming Li.

MyIsern Lime MyIsern : A Web-based Collaboration Database By Kevin English University of Hawaii, ICS 613.

T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)

Aki Hecht Seminar in Databases (236826) January 2009

DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.

Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,

1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.

Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University

InterLink William R. Cook UT Austin November 2008.

Towards Semantic Web: An Attribute- Driven Algorithm to Identifying an Ontology Associated with a Given Web Page Dan Su Department of Computer Science.

Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.

ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 6: General Schema Manipulation Operators PRINCIPLES OF DATA INTEGRATION.

(C) 2013 Logrus International Practical Visualization of ITS 2.0 Categories for Real World Localization Process Part of the Multilingual Web-LT Program.

Programming by Example using Least General Generalizations Mohammad Raza, Sumit Gulwani & Natasa Milic-Frayling Microsoft Research.

Databases & Data Warehouses Chapter 3 Database Processing.

Webpage Understanding: an Integrated Approach

DYNAMICS CRM AS AN xRM DEVELOPMENT PLATFORM Jim Novak Solution Architect Celedon Partners, LLC

Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.

An Extension to XML Schema for Structured Data Processing Presented by: Jacky Ma Date: 10 April 2002.

Search Engines and Information Retrieval Chapter 1.

Warren He, Devdatta Akhawe, and Prateek MittalUniversity of California Berkeley This subset of the web application generates new requests to the server.

Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.

1 Extending Java And Developing DSLs With Open Source Language Workbench JetBrains MPS Konstantin Solomatov JetBrains Lead Developer for JetBrains MPS.

Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.

An Approach for Processing Large and Non-uniform Media Objects on MapReduce-Based Clusters Rainer Schmidt and Matthias Rella Speaker: Lin-You Wu.

Programming by Examples Marktoberdorf Lectures August 2015 Sumit Gulwani.

End-User Programming (using Examples & Natural Language) Sumit Gulwani Microsoft Research, Redmond August 2013 Marktoberdorf Summer.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.

Dimensions in Synthesis Part 3: Ambiguity (Synthesis from Examples & Keywords) Sumit Gulwani Microsoft Research, Redmond May 2012.

Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.

Querying Structured Text in an XML Database By Xuemei Luo.

GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.

Dimitrios Skoutas Alkis Simitsis

--Presented by Tianyi Zhang Building Community Wikipedias: A Machine-Human Partnership Approach.

Interoperable Visualization Framework towards enhancing mapping and integration of official statistics Haitham Zeidan Palestinian Central.

SPARQL Query Graph Model (How to improve query evaluation?) Ralf Heese and Olaf Hartig Humboldt-Universität zu Berlin.

User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.

BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.

FlashNormalize: Programming by Examples for Text Normalization International Joint Conference on Artificial Intelligence, Buenos Aires 7/29/2015FlashNormalize1.

SmartSynth: Synthesizing Smartphone Automation Scripts from Natural Language Vu Le (UC Davis) Sumit Gulwani (MSR Redmond) Zhendong Su (UC Davis)

Automating String Processing in Spreadsheets using Input-Output Examples Sumit Gulwani Microsoft Research, Redmond.

Compositional Program Synthesis from Natural Language and Examples Mohammad Raza, Sumit Gulwani & Natasa Milic-Frayling Microsoft.

FlashMeta Microsoft PROSE SDK: A Framework for Inductive Program Synthesis Oleksandr Polozov University of Washington Sumit Gulwani Microsoft Research.

Cross Language Clone Analysis Team 2 February 3, 2011.

Fall CSE330/CIS550: Introduction to Database Management Systems Prof. Susan Davidson Office: 278 Moore Office hours: TTh

Programming by Examples Marktoberdorf Lectures August 2015 Sumit Gulwani.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Dagstuhl Seminar Oct 2015 Sumit Gulwani Applications of Inductive Programming in Data Wrangling.

Programming by Examples applied to Data Wrangling Invited SYNT July 2015 Sumit Gulwani.

Deductive Techniques for synthesis from Inductive Specifications Dagstuhl Seminar Oct 2015 Sumit Gulwani.

Application generation Peter Bell SystemsForge Peter Bell SystemsForge Beyond Scaffolding.

A Mixed-Initiative System for Building Mixed-Initiative Systems Craig A. Knoblock, Pedro Szekely, and Rattapoom Tuchinda Information Science Institute.

Sumit Gulwani Spreadsheet Programming using Examples Keynote at SEMS July 2016.

Sumit Gulwani Programming by Examples Applications, Algorithms & Ambiguity Resolution Keynote at IJCAR June 2016.

Outline Core Synthesis Architecture [1 hour by Sumit]

Potter’s Wheel: An Interactive Data Cleaning System

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Programming by Examples

Programming by Examples

Programming by Examples

Supporting High-Performance Data Processing on Flat-Files

Jiasi Shen, Martin Rinard MIT EECS & CSAIL

Presentation transcript:

FlashExtract : A General Framework for Data Extraction by Examples Vu Le (UC Davis) Sumit Gulwani (MSR) Hello everyone. My name is Vu Le, from UC Davis. Today I will talk about our work on FlashExtract, a framework that allows easy extraction of data from semi-structured document using examples.

motivation ..… We have entered an information era where data can be found any where. Unfortunately, often times such data is embedded and hidden in some formats. To retrieve it, we computer scientists have to write some Shell, Perl or Python scripts. This task may be tricky sometimes. If we computer scientists have a hard time parsing our data, what about biologists, financial analysts, teachers, students, you name it? It would be really great if we have a technique that everybody, including those who are unable to program, can use. And when it comes to user-friendliness, nothing can beat examples.

demo Now let me show you a demo on how FlashExtract helps these people extract their data. This is FE user interface. Now I open a file that we need to extract data from. This file store customer information. Each customer record has the name, address, and phone number. Now suppose I want to extract the name and the city in this file. In FE, I can just highlight the file that I want to extract. Let’s extract the name. Now we extract the city. Flash Extract learn the city from only 1 example because it is able to relate the 1-1 relationship between city and name. Multiple lines: In FlashExtract, we can extract a field that span over multiple lines. If we select a field inside this multi-line field, the outer field is promoted to a struct. We can perform extraction in any order. For instance, we can select… FlashExtract can be combined with other technologies to enable more sophisticated data manipulation. For instance, suppose we want to modify the name in this file into Last Name and First Initial. … The file that I just showed has flat structure. We can use FlashExtract to perform extraction on more complicated, hierarchical data as well. This is a file that I collected from a help user forum. It contains a sequence of sample reading, each reading lists all chemicals and all of their properties. The user wants to extract the sample ID, and from each sample, the chemical and some of its characteristics.

schema extraction program Output schema Field extraction programs for all fields in the schema We call a program that performs the whole extraction across multiple fields a schema extraction program. A schema extraction program consists of an output schema, which defines the structure of the output data, And a set of field extraction programs, which is responsible for extracting each individual field in the schema.

output schema XML-like: sequence and structure Seq([blue] Struct(Name: [green] String, City: [yellow] String)) An output schema is built using sequence and structure constructs. It is similar to XML. For example, this is the schema for the extraction of name and city. The schema is a sequence of blue struct. Each struct contains a green Name and a yellow City.

field extraction program An ancestor A program in the DSL Examples Green = <Blue, PRegion> Yellow = <, PSeqRegion> A field extraction program is responsible for extracting a field in the schema. To extract a field, we must relate it to an ancestor, which defines a boundary for the learning of that field. The top-level ancestor is the entire file. The second element is the program in the underlying DSL to perform the extraction within the boundaries defined by the ancestor. For example, in FEP of the Green (or name) field, the ancestor is the Blue field, and the program is a program that extracts a green region within the blue area. In the FEP of the Yellow (or city) field, the ancestor is the entire file, and the program is the program that extract a sequence of yellow regions in the entire file.

data extraction DSL DSL is a tuple (G, N1, N2) G : grammar defining extraction strategies N1 : top-level SeqRegion nonterminal N2 : top-level Region nonterminal Each non-terminal has a learn method Now let’s talk about the data extraction DSL that enables field extraction. A data extraction DSL is a tuple of three elements A grammar G that defines extraction strategies A non terminal N1 to perform sequence of region extractions And a nonterminal to N2 perform region extraction. Each non-terminal in the DSL is associated with a learn method, and these methods can inductively call each others.

core algebra Decomposable Map Operator Filter Operators Merge Operator Pair Operator We use the following core operators to build the DSL. Please refer to the paper for their formal descriptions. Let me give you an example that illustrate their usage.

city example I will demonstrate the extraction of the Yellow City field.

city example Filter lines that end with “WA” In the first step, we use a filter construct to select lines that end with WA

city example Filter lines that end with “WA” Map each selected line to a pair of positions We then map each of the selected line to a pair of positions which correspond to the city in the line.

city example Filter lines that end with “WA” Map each selected line to a pair of positions Learn two leaf exprs for the two positions We then learn two expressions for the two positions. These leaf expressions are built on top of regex.

learning algorithm Inductive on the grammar structure Learn city = learn a map operator The lines that hold the city The pair that identifies the city within a line At a high-level, our learning algorithm is inductive on the grammar structure. For example, to learn the city field, we learn a map operator, which specifies the lines that hold the city, and the pair that …

learning algorithm Inductive on the grammar structure Learn city = learn a map operator The lines that hold the city The pair that identifies the city within a line Learn lines = learn a Boolean filter To learn the lines that hold the city, we learn a boolean filter that satisfies those lines.

inductive synthesis Problem Definition: Identify a vertical domain of tasks that users struggle with Domain-Specific Language (DSL): Design a DSL that can succinctly describe tasks in that domain Synthesis Algorithm: Develop an algorithm that can efficiently translate examples into likely programs in DSL Machine Learning: Rank the various programs User Interface: Provide an appropriate interaction mechanism to resolve ambiguities What I have describe so far is a case of inductive synthesis. In general, to create an inductive synthesizer, we first need to identify the problem domain. We then design the DSL to describe tasks in the domain and the algorithm to learn programs from examples. Next we define how we rank programs, and provide a user interface to enable user interaction.

pros & cons Advantages Disadvantages Efficient synthesizer Easier ranking control Tighter integration with user interaction model Disadvantages Non-constructive: require thinking & implementation Non-modular: DSL is not extensible The advantages of this specialized approach are … The disadvantages of this approach is that it is not trivial to instantiate the synthesizer to a new domain. It requires lots of thinking and reimplementation of the algorithms. The DSL is also not easily extensible to the new domain.

inductive meta-synthesis A synthesizer for a related family of DSLs that supports a common user interaction model Alleviate disadvantages of the generic methodology In this work, we introduce an inductive meta-synthesis methodology in which we need to create only a single synthesizer for a related family of DSLs that support a common UI. Our methodology alleviates the disadvantages.

inductive meta-synthesis Identify a family of vertical task domains Design an algebra for DSLs Implement a search algorithm for each algebra operator To do this, we need to identify a family of domains that share the common UI model. We need to design a core algebra for building DSLs, And we need to implement the search algo for …

inductive meta-synthesis Identify a family of vertical task domains Design an algebra for DSLs Implement a search algorithm for each algebra operator To do this, we need to identify a family of domains that share the common UI model. We need to design a core algebra for building DSLs, And we need to implement the search algo for … Test-Driven Synthesis (Perelman et. al.) Synthesis Track @11:15am Wed

extraction meta-synthesis Identify a family of vertical task domains Extraction of semi-structured documents Design an algebra for DSLs Merge, Map, FilterBool, FilterInt, Pair Implement a search algorithm for each algebra operator Compositional and inductive learners Here is how we apply our methodology to the domain of extracting semi-structured documents. The tasks, of course, is to extract documents. Our algebra contains the following operators. For each of these operators, we implement a efficient search algorithm that works across all domains. These algorithms are compositional and inductive. I will not discuss these operators and their learning algorithms here, please refer to the paper for the detailed discussion.

synthesis algorithm Top-down Grammar-guided Top-level SeqRegion, Region symbols N1, N2 Grammar-guided Grammar built from the algebra operators At a high level, our synthesis algorithm is top-down and grammar-guided. We start at N1 if we want to learn a sequence of regions, or N2 if we want to learn a region. The algorithm inductively invoke learners of other operators, as defined in the grammar, to learn the program.

key insight Reduce learning task for an expression to learning tasks for its sub-expressions Examples: Learn Map (λx : F, S) Learn the scalar expression F Learn the sequence expression S The key insight of our approach is to reduce the learning of a complicated expression to the learning of its less complicated sub-expressions. As an example, our map construct consists of a sequence S and a function F that maps each element in S to a new element. The result is a sequence of new elements. We reduce the learning of map operator to the learning of two sub problems.

instantiations Text files Web pages Spreadsheets We instantiated our framework to three domains.

demo Now let me give you a quick demo on the web page instantiation.

evaluation Can FlashExtract extract data from real-world files? How many interactions typically required? How efficient/real-time is FlashExtract? We want to answer the following questions via our evaluation.

expressiveness Can FlashExtract extract data from real-world files? How many interactions typically required? How efficient/real-time is FlashExtract?

benchmarks 25 text files 25 webpages from [1] 25 spreadsheets System log files Copied texts from web pages and PDFs Samples from “Pro Perl Parsing” 25 webpages from [1] Add two more test cases for each web page 25 spreadsheets 7 from [2] that are applicable for extracting 18 from EUSES corpus To answer the first question, we selected 25 real-world benchmarks for each domain. FlashExtract is able to extract data from all of them. [1] E. Oro, M. Ruffolo, and S. Staab. Sxpath: extending xpath towards spatial querying on web documents. Proc. VLDB Endow., 2010. [2] B. Harris and S. Gulwani. Spreadsheet table transformations from examples. In PLDI, 2011.

effectiveness Can FlashExtract extract data from real-world files? Yes How many interactions typically required? 2.36 examples How efficient/real-time is FlashExtract? For the second question, FE took 2.36 examples on average to extract a field. The majority are positive examples.

efficiency Can FlashExtract extract data from real-world files? Yes How many interactions typically required? 2.36 examples How efficient/real-time is FlashExtract? 0.82s last interaction And for the third question, FlashExtract took 0.82s on average for the last interaction. The last interaction corresponds to the last example, and FE takes the longest time to synthesize because it learn from more examples.

conclusion Inductive meta-synthesis FlashExtract is general Text file, web page, spreadsheet instantiations FlashExtract is practical Extract real-world data, in real time, within a few examples In summary, in this talk, I have presented the inductive meta synthesis methodology and its instantiation, FlashExtract. FlashExtract is general because it can extract data from text files, etc. FE is practical because it is able to

thank you Questions?